GloVe: Global Vectors for Word Representation
GloVe: Global Vectors for Word Representation
Problem
Framing
Count models used global corpus statistics but missed the linear regularities seen in predictive embeddings. GloVe closes this gap with a weighted log-bilinear fit to co-occurrence counts, reaching 75.0% analogy accuracy with 300d vectors on 42B tokens.
Currently Used Methods
Foundational
- @mikolovWord2vec2013 — predictive CBOW and skip-gram embeddings with strong analogy structure.
- Limitation in context: uses local windows, not explicit global co-occurrence statistics.
- Neural Probabilistic Language Model — neural embeddings from next-word prediction.
- Limitation in context: expensive normalization, no direct count factorization.
- A Unified Architecture for Natural Language Processing — full-context neural word representations for NLP.
- Limitation in context: task-tied training, not simple unsupervised corpus statistics.
- Improving Distributional Similarity with Lessons Learned from Word Embeddings — SVD-style count baselines for distributional semantics.
- Limitation in context: weaker analogy structure than predictive embeddings.
- Finding Structure in Time — Hellinger PCA count-based embeddings.
- Limitation in context: lower semantic and overall analogy accuracy.
Proposed Method
Architecture
GloVe learns word vectors , context vectors , and biases . It regresses from a bilinear score and uses as the final embedding.

Loss / Objective
The model minimizes a weighted least-squares fit over nonzero co-occurrences.
Algorithm
After optimization, the final representation sums the two learned embeddings.
Training Procedure
- Corpora: 1.5B, 6B, and 42B tokens.
- Dimensions: 50, 100, 300, 1000.
- Window size: 10 in key sweeps.
- Weighting: , .
- Optimizer: AdaGrad.
Evaluation
Datasets
- Word analogy.
- Word similarity.
- Named entity recognition.
- Training corpora: 1.5B, 6B, 42B tokens.
Metrics
- Analogy accuracy: semantic, syntactic, overall.
- Word similarity: Spearman correlation.
- NER: .
Headline results
- Analogy, 100d, 1.6B: 67.5 semantic, 54.3 syntactic, 60.3 overall.
- Analogy, 300d, 1.6B: 80.8 semantic, 61.5 syntactic, 70.3 overall.
- Analogy, 300d, 6B: 77.4 semantic, 67.0 syntactic, 71.7 overall.
- Analogy, 300d, 42B: 81.9 semantic, 69.3 syntactic, 75.0 overall.
- NER, 50d: 85.0 .
Table 1: Word analogy accuracy across models, dimensions, and corpus sizes
| Model | Dim. | Size | Sem. | Syn. | Tot. |
|---|---|---|---|---|---|
| ivLBL | 100 | 1.5B | 55.9 | 50.1 | 53.2 |
| HPCA | 100 | 1.6B | 4.2 | 16.4 | 10.8 |
| GloVe | 100 | 1.6B | 67.5 | 54.3 | 60.3 |
| SG | 300 | 1B | 61 | 61 | 61 |
| CBOW | 300 | 1.6B | 16.1 | 52.6 | 36.1 |
| vLBL | 300 | 1.5B | 54.2 | 64.8 | 60.0 |
| ivLBL | 300 | 1.5B | 65.2 | 63.0 | 64.0 |
| GloVe | 300 | 1.6B | 80.8 | 61.5 | 70.3 |
| SVD | 300 | 6B | 6.3 | 8.1 | 7.3 |
| SVD-S | 300 | 6B | 36.7 | 46.6 | 42.1 |
| SVD-L | 300 | 6B | 56.6 | 63.0 | 60.1 |
| CBOW | 300 | 6B | 63.6 | 67.4 | 65.7 |
| SG | 300 | 6B | 73.0 | 66.0 | 69.1 |
| GloVe | 300 | 6B | 77.4 | 67.0 | 71.7 |
| CBOW | 1000 | 6B | 57.3 | 68.9 | 63.7 |
| SG | 1000 | 6B | 66.1 | 65.1 | 65.6 |
| SVD-L | 300 | 42B | 38.4 | 58.2 | 49.2 |
| GloVe | 300 | 42B | 81.9 | 69.3 | 75.0 |
Ablations
- Vector size: accuracy rises with dimension, then saturates.
- Window size: larger windows help semantic analogies more.
- Context type: symmetric helps semantic; asymmetric helps syntactic.
- Training time: GloVe reaches target accuracy faster than CBOW and skip-gram.
Method Strengths and Weaknesses
Strengths
- Combines global counts with linear analogy structure.
- Avoids softmax normalization with sparse weighted regression.
- Best reported analogy score is 75.0 overall at 42B tokens.
- Summing word and context vectors improves results.
Weaknesses
- Requires explicit construction of the co-occurrence matrix.
- Weighting choices and are empirical.
- Learns static type embeddings, not token-specific meanings.
- Similarity gains are smaller than analogy gains.
Suggestions from the authors
- Compare count and prediction models under matched compute budgets.
- Explain why predictive models outperform many count baselines.
- Develop principled weighting and multi-pass training schemes.
- Study vector-space structure beyond scalar similarity.
Links
Prior Papers
- @mikolovWord2vec2013 — predictive embeddings are the direct baseline GloVe matches and often exceeds on analogies.
Further Papers
- @petersELMo2018 — contextual token representations address GloVe's static embedding limitation.
- @devlinBERT2018 — masked-language-model pretraining supersedes static word vectors on downstream NLP.
- @radfordGPT2018 — autoregressive contextual representations shift modeling from word types to token states.