Word2Vec 1. Efficient Estimation of Word Representations in Vector Space

What are the main advantages and disadvantages of simple language models?
- Simplicity and robustness
- Remove the hidden layer, not complete NN highly reduce computational cost so it can perform training in larger word set
- They are simple, so they can be efficient to train and run on a computer. However there are a number of nuances to the natural human language, so simple models often fall short in their comprehension of these nuances.
Summarize the main objectives of this paper
- Want to solve NLP issues, want to build a model to measure the syntactic and semantic word similarities.
- Want to preserve linear algebraic similarities between words
- Want to do this at a really large scale - Billions of words
Explain the 2 main computational bottlenecks in NNLMs.
- The two main bottlenecks in NNLMs are in computing Q, the computational complexity.
- When computing Q in an NNLM system the first problem is H x V term between hidden layer and output layer, this can be optimized by using methods like Hierachichal SoftmaxHierachichal Softmax
  Hierachical Softmax(HS) is technique that you can apply to any neural network that has the softmax functions as its activation function on the outp... function.
- But the N x D x V term between projection layer and hidden layer still remains so in the authors work the removed the hidden layer to gain a simple model.
Describe the continuous Bag of Words model
- Reference Figure 1: The CBOW architecture predicts the current word based on the context, use the average of input work to predict the closest word by cosine similarity
- Order of input does not matter
Describe the continuous skip-gram model
- Reference Figure 1: the Skip-gram predicts surrounding words given the current word
- Order of input does matter, so words further away are less important, and words closer are more important.
What metric did they introduce? Although they never define it mathematically, can you write it explicitly?
- The metric introduced compares the output word vector and the actual word vector and then finds their difference using cosine distance.
- What is right based off of the closest vector, and the result is given as a binary output, either it is right or wrong. No inbetween.
Describe the Tasks of Table 1
- The task contains 5 types of semantic questions and 9 types and syntactic questions, they evaluate the accuracy of all question types so they can tell the model’s ability to measure syntactic and semantic similarities
What does Table 2 show?
- Table 2 shows the estimate for the best choice of model architecture to obtain the best results in the least amount of time. Or, to visualize how much accuracy you might lose in using lower dimensionality word vectors or a smaller training set, specifically using the CBOW architecture.
- Dimensionality is the size of the embedding
How are CBOW and Skip-Gram learning word representations that display an algebraic linearity in their relationships?
- Answer is unclear… Do they explain this link in the paper? Not really… Say what they want and the output, but no explanation
- Check if they were lucky, and consider why the gap in information might exist… There should always be a logical step
- Operations needed to get to the solution are nearly linear
- Must have the same linear projection transformation for all the vectors -> leads to things being grouped in similar locations
Can you think of alternative tasks to CBOW or Skip-Gram that would be learning something useful? Unsupervised, take whatever valid text and apply CBOW and Skip-Gram, allows data to be scaled
- Language translation(mentioned in paper), translation can be a supervised or unsupervised task, if you give the model a dataset of english books and a dataset of german books than ask it to translate, (no labeling) than it is unsupervised. And in unsupervised case, CBOW and skip gram can handle.
- Autocomplete: sentence completion, could be optimized further, but these models appear to be rather accurate (for 2013)

https://doi.org/10.48550/arXiv.1301.3781

Word2Vec 1. Efficient Estimation of Word Representations in Vector Space

Related Posts