Note: Deep contextualized word representations
Peters et al. - 2018 - Deep contextualized word representations
Introduction
They introduce a new type of deep contextualised word representation complex characteristics of word use. Their word vectors are learned functions of the internal states of a deep bidirectional language model (biLM).
Bidirectional language models
Given $(t_1, t_2, \ldots, t_N)$, a forward language model computes the probability:
A backward LM is similar to a forward LM:
The formulation jointly maximises the log likelihood:
The parameters both the token representation $\Theta_{x}$ and softmax layer $\Theta_{s}$ are shared in the forward and backward direction.