GloVe: Global Vectors for Word Representation

Jeffrey Pennington, Richard Socher, Christopher D. Manning

2014 · EMNLP

GloVe: Global Vectors for Word Representation

Problem

Framing

Count models used global corpus statistics but missed the linear regularities seen in predictive embeddings. GloVe closes this gap with a weighted log-bilinear fit to co-occurrence counts, reaching 75.0% analogy accuracy with 300d vectors on 42B tokens.

Currently Used Methods

Foundational

@mikolovWord2vec2013 — predictive CBOW and skip-gram embeddings with strong analogy structure.
- Limitation in context: uses local windows, not explicit global co-occurrence statistics.
Neural Probabilistic Language Model — neural embeddings from next-word prediction.
- Limitation in context: expensive normalization, no direct count factorization.
A Unified Architecture for Natural Language Processing — full-context neural word representations for NLP.
- Limitation in context: task-tied training, not simple unsupervised corpus statistics.
Improving Distributional Similarity with Lessons Learned from Word Embeddings — SVD-style count baselines for distributional semantics.
- Limitation in context: weaker analogy structure than predictive embeddings.
Finding Structure in Time — Hellinger PCA count-based embeddings.
- Limitation in context: lower semantic and overall analogy accuracy.

Proposed Method

Architecture

GloVe learns word vectors $\mathbf{w}_i$ , context vectors $\tilde{\mathbf{w}}_j$ , and biases $b_i, \tilde{b}_j$ . It regresses $\log X_{ij}$ from a bilinear score and uses $\mathbf{w}_i + \tilde{\mathbf{w}}_i$ as the final embedding.

$Verified figure: equation \mathbf{w}_i^T \tilde{\mathbf{w}}_k + b_i + \tilde{b}_k = \log(X_{ik}) beside the weighting curve f(X_{ij}), which saturates at x_{\max} with \alpha=3/4.$

Loss / Objective

The model minimizes a weighted least-squares fit over nonzero co-occurrences.

J = \sum_{i,j=1}^{V} f(X_{ij}) \left( \mathbf{w}_i^T \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log X_{ij} \right)^2

f(x)= \begin{cases} (x/x_{\max})^{\alpha} & \text{if } x < x_{\max} \\ 1 & \text{if } x \ge x_{\max} \end{cases}

Algorithm

After optimization, the final representation sums the two learned embeddings.

\mathbf{z}_i = \mathbf{w}_i + \tilde{\mathbf{w}}_i

Training Procedure

Corpora: 1.5B, 6B, and 42B tokens.
Dimensions: 50, 100, 300, 1000.
Window size: 10 in key sweeps.
Weighting: $x_{\max}=100$ , $\alpha=3/4$ .
Optimizer: AdaGrad.

Evaluation

Datasets

Word analogy.
Word similarity.
Named entity recognition.
Training corpora: 1.5B, 6B, 42B tokens.

Metrics

Analogy accuracy: semantic, syntactic, overall.
Word similarity: Spearman correlation.
NER: $F_1$ .

Headline results

Analogy, 100d, 1.6B: 67.5 semantic, 54.3 syntactic, 60.3 overall.
Analogy, 300d, 1.6B: 80.8 semantic, 61.5 syntactic, 70.3 overall.
Analogy, 300d, 6B: 77.4 semantic, 67.0 syntactic, 71.7 overall.
Analogy, 300d, 42B: 81.9 semantic, 69.3 syntactic, 75.0 overall.
NER, 50d: 85.0 $F_1$ .

Table 1: Word analogy accuracy across models, dimensions, and corpus sizes

Model	Dim.	Size	Sem.	Syn.	Tot.
ivLBL	100	1.5B	55.9	50.1	53.2
HPCA	100	1.6B	4.2	16.4	10.8
GloVe	100	1.6B	67.5	54.3	60.3
SG	300	1B	61	61	61
CBOW	300	1.6B	16.1	52.6	36.1
vLBL	300	1.5B	54.2	64.8	60.0
ivLBL	300	1.5B	65.2	63.0	64.0
GloVe	300	1.6B	80.8	61.5	70.3
SVD	300	6B	6.3	8.1	7.3
SVD-S	300	6B	36.7	46.6	42.1
SVD-L	300	6B	56.6	63.0	60.1
CBOW $^{\dagger}$	300	6B	63.6	67.4	65.7
SG $^{\dagger}$	300	6B	73.0	66.0	69.1
GloVe	300	6B	77.4	67.0	71.7
CBOW	1000	6B	57.3	68.9	63.7
SG	1000	6B	66.1	65.1	65.6
SVD-L	300	42B	38.4	58.2	49.2
GloVe	300	42B	81.9	69.3	75.0

Ablations

Vector size: accuracy rises with dimension, then saturates.
Window size: larger windows help semantic analogies more.
Context type: symmetric helps semantic; asymmetric helps syntactic.
Training time: GloVe reaches target accuracy faster than CBOW and skip-gram.

Method Strengths and Weaknesses

Strengths

Combines global counts with linear analogy structure.
Avoids softmax normalization with sparse weighted regression.
Best reported analogy score is 75.0 overall at 42B tokens.
Summing word and context vectors improves results.

Weaknesses

Requires explicit construction of the co-occurrence matrix.
Weighting choices $x_{\max}$ and $\alpha$ are empirical.
Learns static type embeddings, not token-specific meanings.
Similarity gains are smaller than analogy gains.

Suggestions from the authors

Compare count and prediction models under matched compute budgets.
Explain why predictive models outperform many count baselines.
Develop principled weighting and multi-pass training schemes.
Study vector-space structure beyond scalar similarity.

GloVe: Global Vectors for Word Representation

GloVe: Global Vectors for Word Representation

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers