commit 798df6c4b7f8d331357062137207ea5af4601ef5 Author: Cayden Yap Date: Mon Nov 24 09:43:21 2025 -0800 upload notebook diff --git a/skip_gram.ipynb b/skip_gram.ipynb new file mode 100644 index 0000000..fcbd8a4 --- /dev/null +++ b/skip_gram.ipynb @@ -0,0 +1,359 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Abstract\n", + "\n", + "> \"No one is going to implement word2vec from scratch\" or sm šŸ¤“ commentary like that idk\n", + "\n", + "This notebook provides a brief explanation and implementation of a Skip Gram model, one of the two types of models word2vec refers to." + ], + "metadata": { + "id": "JZwIogzJPENc" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Intuition" + ], + "metadata": { + "id": "rqxbpHtxdtp_" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Problem\n", + "\n", + "Given a corpus C, map all tokens to a vector such that words with similar semantics (similar probability of appearing within a context) close to each other." + ], + "metadata": { + "id": "PZeBycn2df3M" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Idea\n", + "\n", + "**The idea of a skip gram model proceeds from these two observations:**\n", + "\n", + "1. Similar words should appear in similar contexts\n", + "2. Similar words should appear together\n", + "\n", + "The intuition behind the Skip Gram model is to map a target token to all the words appearing in a context window around it.\n", + "\n", + "> The MIMS major **Quentin** is a saber fencer.\n", + "\n", + "In this case the target token **Quentin** should map to all the other tokens in the window. As such the target token should have similar mappings to words such as MIMS, saber, and fencer.\n", + "\n", + "Skip Gram treats each vector representation of a token as a set of weights, and uses a linear-linear-softmax model to optimize them. At the end, the first set of weights are a list of $n$ vectors that map a token to a prediction of output tokens - solving the initial mapping problem." + ], + "metadata": { + "id": "dsEGSoXwdj62" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Code & Detailed Implementation" + ], + "metadata": { + "id": "yKrQicLLoEiY" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Preproccessing\n", + "\n", + "Tokenize all the words, and build training pairs using words in a context window:" + ], + "metadata": { + "id": "6dvQ80wQdTOi" + } + }, + { + "cell_type": "code", + "source": [ + "import numpy as np\n", + "\n", + "class Preproccess:\n", + "\n", + " @staticmethod\n", + " def tokenize(text):\n", + "\n", + " \"\"\"Returns a list of lowercase tokens\"\"\"\n", + "\n", + " return \"\".join([t for t in text.lower().replace(\"\\n\", \" \") if t.isalpha() or t == \" \"]).split(\" \")\n", + "\n", + " @staticmethod\n", + " def build_vocab(tokens, min_count=1):\n", + "\n", + " \"\"\"Create an id to word and a word to id mapping\"\"\"\n", + "\n", + " token_counts = {}\n", + " for token in tokens:\n", + " if token not in token_counts:\n", + " token_counts[token] = 0\n", + " token_counts[token] += 1\n", + "\n", + " sorted_tokens = sorted(token_counts.items(), key=lambda t:t[1], reverse=True) # Sort tokens by frequency\n", + " vocab = {}\n", + " id_to_word = [0] * len(sorted_tokens)\n", + " for i in range(len(sorted_tokens)):\n", + " token, count = sorted_tokens[i]\n", + " if count < min_count:\n", + " break\n", + " id_to_word[i] = token\n", + " vocab[token] = i\n", + "\n", + " return vocab, id_to_word\n", + "\n", + " @staticmethod\n", + " def build_pairs(tokens, vocab, window_size=5):\n", + "\n", + " \"\"\"Generate training pairs\"\"\"\n", + "\n", + " pairs = []\n", + " token_len = len(tokens)\n", + "\n", + " for center in range(token_len):\n", + " tokens_before = tokens[max(0, center-window_size):center]\n", + " tokens_after = tokens[(center + 1):min(token_len, center + 1 + window_size)]\n", + " context_tokens = tokens_before + tokens_after\n", + " for context in context_tokens:\n", + " if tokens[center] in vocab and context in vocab:\n", + " pairs.append((tokens[center], context))\n", + "\n", + " return pairs\n", + "\n", + " @staticmethod\n", + " def build_neg_sample(word, context, vocab, samples=5):\n", + "\n", + " \"\"\"Build negative samples\"\"\"\n", + "\n", + " neg_samples = []\n", + " neg_words = [vocab[w] for w in vocab if (w != word) and (w != context)]\n", + " neg_samples = np.random.choice(neg_words, size=samples, replace=False)" + ], + "metadata": { + "id": "116wFEAdoyHH" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Build Model\n", + "\n", + "* 3 layers used as an optimizer:\n", + " * $L_1 = XW_1$\n", + " * $S = W_2 L_1$\n", + " * $P = \\text{softmax(S)}$\n", + "* Loss function: $-\\sum \\log(P_{\\text{context}} | P_{\\text{target}})$\n", + "* Negative sampling used to speed up training, compare and update against ~20 negative vocab terms instead of updating all weights" + ], + "metadata": { + "id": "t5Sdt9McMJHe" + } + }, + { + "cell_type": "code", + "source": [ + "class Word2Vec:\n", + "\n", + " def __init__(self, vocab_size, embedding_dim=100):\n", + "\n", + " \"\"\"Initialize weights\"\"\"\n", + "\n", + " self.vocab_size = vocab_size\n", + " self.embedding_dim = embedding_dim\n", + " self.W1 = np.random.normal(0, 0.1, (vocab_size, embedding_dim)) # First layer - word encoding\n", + " self.w2 = np.random.normal(0, 0.1, (embedding_dim, vocab_size)) # Second layer - context encoding\n", + "\n", + " def sigmoid(self, x):\n", + "\n", + " \"\"\"Numerically stable sigmoid\"\"\"\n", + "\n", + " x = np.clip(x, -500, 500)\n", + " return 1 / (1 + np.exp(-x))\n", + "\n", + " def cross_entropy_loss(self, probability):\n", + "\n", + " \"\"\"Cross entropy loss function\"\"\"\n", + "\n", + " return -np.log(probability + 1e-10) # 1e-10 added for numerical stability\n", + "\n", + " def neg_sample_train(self, center_token, context_token, negative_tokens, learning_rate=0.01):\n", + "\n", + " \"\"\"Negative sampling training for a single training pair\"\"\"\n", + "\n", + " total_loss = 0\n", + " total_W1_gradient = 0\n", + "\n", + " # Forward prop for positive case\n", + " center_embedding = self.W1[center_token, :] # L₁ = XW₁\n", + " context_vector = self.W2[:, context_token]\n", + " score = np.dot(center_embedding, context_vector) #Lā‚‚ = L₁Wā‚‚, but only for the context token vector\n", + " sigmoid_score = self.sigmoid(score)\n", + " loss = self.cross_entropy_loss(sigmoid_score)\n", + " total_loss += loss\n", + "\n", + " # Backward prop for positive case\n", + " score_gradient = 1 - sigmoid_score # āˆ‚L/āˆ‚S\n", + " W2_gradient = center_embedding * score_gradient # āˆ‚L/āˆ‚Wā‚‚ = āˆ‚L/āˆ‚S * āˆ‚S/āˆ‚Wā‚‚ = XW₁ * āˆ‚L/āˆ‚S\n", + " W1_gradient = context_vector * score_gradient # āˆ‚L/āˆ‚W₁ = āˆ‚L/āˆ‚S * āˆ‚S/āˆ‚W₁ = Wā‚‚ * āˆ‚L/āˆ‚S\n", + "\n", + " # Update weights\n", + " self.W2[:, context_token] -= learning_rate * W2_gradient\n", + " total_W1_gradient += learning_rate * W1_gradient\n", + "\n", + " for neg_token in negative_tokens:\n", + "\n", + " # Forward prop for negative case\n", + " neg_vector = self.W2[:, neg_token]\n", + " neg_score = np.dot(center_embedding, neg_vector)\n", + " neg_sigmoid_score = self.sigmoid(neg_score)\n", + " neg_loss = -np.log(1 - neg_sigmoid_score)\n", + " total_loss += neg_loss\n", + "\n", + " # Backward prop for negative case\n", + " neg_score_gradient = sigmoid_score\n", + " neg_W2_gradient = center_embedding * neg_score_gradient\n", + " neg_W1_gradient = context_vector * neg_score_gradient\n", + "\n", + " # Update weights\n", + " self.W2[:, neg_token] -= learning_rate * neg_W2_gradient\n", + " total_W1_gradient -= learning_rate * neg_W1_gradient\n", + "\n", + " # Update W1\n", + " total_W1_gradient = np.clip(total_W1_gradient, -1, 1)\n", + " self.W1[center_token, :] += total_W1_gradient\n", + "\n", + " return total_loss\n", + "\n", + " def find_similar(self, token):\n", + "\n", + " \"\"\"Use cos similarity to find similar words\"\"\"\n", + "\n", + " word_vec = self.W1[token, :]\n", + " similar = []\n", + " for i in range(self.vocab_size):\n", + " if i != token:\n", + " other_vec = self.W1[i, :]\n", + " norm_word = np.linalg.norm(word_vec)\n", + " norm_other = np.linalg.norm(other_vec)\n", + " if norm_word > 0 and norm_other > 0:\n", + " cosine_sim = np.dot(word_vec, other_vec) / (norm_word * norm_other)\n", + " else:\n", + " cosine_sim = 0\n", + " similar.append((cosine_sim, i))\n", + " similar.sort(key=lambda x:x[0], reverse=True)\n", + " return [word[1] for word in similar]" + ], + "metadata": { + "id": "dNh8VOgWMKUc" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Run Model" + ], + "metadata": { + "id": "hkXCFeJrJzVH" + } + }, + { + "cell_type": "code", + "source": [ + "def epoch(model, pairs, vocab):\n", + " loss = 0\n", + " pair_len = len(pairs)\n", + " done = 0\n", + " for word, context in pairs:\n", + " neg_samples = Preproccess.build_neg_sample(word, context, vocab, samples=5)\n", + " loss += model.neg_sample_train(word, context, neg_samples)\n", + " done += 1\n", + " if ((100 * done) / pair_len) // 1 > ((100 * done - 100) / pair_len) // 1:\n", + " print(\"_\", end=\"\")\n", + " return loss\n", + "\n", + "with open(\"corpus.txt\") as corpus_file:\n", + " CORPUS = corpus_file.read()\n", + "\n", + "EPOCHS = 100\n", + "tokens = Preproccess.tokenize(CORPUS)\n", + "vocab, id_to_token = Preproccess.build_vocab(tokens, min_count=3)\n", + "print(\"~VOCAB LEN~:\", len(vocab))\n", + "pairs = Preproccess.build_pairs(tokens, vocab, window_size=5)\n", + "model = Word2Vec(len(id_to_token), embedding_dim=100)\n", + "print(\"~STARTING TRAINING~\")\n", + "for i in range(EPOCHS):\n", + " print(f\"Epoch {i}: {epoch(model, pairs, vocab) / len(id_to_token)}\")\n", + "print(\"~FINISHED TRAINING~\")\n" + ], + "metadata": { + "id": "hR47oUJxJ23n", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 221 + }, + "outputId": "6e589c28-eac9-4128-802a-32d7b3dc14a4" + }, + "execution_count": 1, + "outputs": [ + { + "output_type": "error", + "ename": "FileNotFoundError", + "evalue": "[Errno 2] No such file or directory: 'corpus.txt'", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m/tmp/ipython-input-1995861593.py\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mloss\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 12\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 13\u001b[0;31m \u001b[0;32mwith\u001b[0m \u001b[0mopen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"corpus.txt\"\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mcorpus_file\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 14\u001b[0m \u001b[0mCORPUS\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcorpus_file\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 15\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: 'corpus.txt'" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Notes (Pedantic Commentary Defense :P)\n", + "\n", + "1. I use the term \"similar\" and \"related\" in reference to words, which implies some sort of meaning is encoded. However in practice word2vec is just looking for words with high probabilities of being in similar contexts, which happens to correlate to \"meaning\" decently well.\n", + "2. CBOW shares a very similar intuition to Skip Gram, the only difference is which way you map a target token to context tokens.\n", + "3. Of course, a good deal of mathamatical pain can be shaved off this excercise by using Tensorflow (here is a [Colab](https://colab.research.google.com/github/tensorflow/text/blob/master/docs/tutorials/word2vec.ipynb#scrollTo=iLKwNAczHsKg) from Tensorflow that does it) - but this is done from scratch so the inner workings of word2vec can be more easily seen.\n", + "4. Results are (very) subpar with a small corpus size, and this isn't optimized for GPUs sooo... at least the error goes down!\n", + "\n", + "# Sources\n", + "1. https://en.wikipedia.org/wiki/Word2vec\n", + "2. https://arxiv.org/abs/1301.3781 (worth a read - not a long paper and def on the less math intensive side of things)\n", + "3. https://ahammadnafiz.github.io/posts/Word2Vec-From-Scratch-A-Complete-Mathematical-and-Implementation-Guide/#implementation" + ], + "metadata": { + "id": "0e7TsIRoSmnV" + } + } + ] +} \ No newline at end of file