Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean

2013 · ICLR Workshop

Efficient Estimation of Word Representations in Vector Space

Problem

Framing

Neural word embeddings had not scaled past a few hundred million tokens without expensive hidden-layer language models. The paper closes this gap with two shallow log-linear architectures, CBOW and skip-gram, that learn high-quality vectors on billion-token corpora in about a day.

Currently Used Methods

Foundational

Proposed Method

Architecture

The paper proposes two shallow shared-embedding models with no nonlinear hidden layer. CBOW averages context vectors from four past and four future words to predict the center word. Skip-gram uses the center word to predict surrounding words within a sampled window up to C=10C=10.

Architecture diagram: CBOW sums context embeddings to predict the center word, while skip-gram uses the center word embedding to predict surrounding words.

Loss / Objective

The paper minimizes per-token training complexity under hierarchical softmax.

O=E×T×QO = E \times T \times Q QCBOW=N×D+D×log2(V)Q_{\mathrm{CBOW}} = N \times D + D \times \log_2(V) QSG=C×(D+D×log2(V))Q_{\mathrm{SG}} = C \times \left(D + D \times \log_2(V)\right)

Algorithm

Word relations are tested by vector offsets and nearest-neighbor retrieval.

x=vector(biggest)vector(big)+vector(small)\mathbf{x} = \operatorname{vector}(\mathrm{``biggest''}) - \operatorname{vector}(\mathrm{``big''}) + \operatorname{vector}(\mathrm{``small''})

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Table 1: Restricted analogy accuracy for CBOW across vector dimensionality and training words.

Dimensionality / Training words24M49M98M196M391M783M
5013.415.718.619.122.523.2
10019.423.127.828.733.432.2
30023.229.235.338.643.745.9
60024.030.136.540.846.650.4

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers