Deep Learning Notes | Word Embedding
Contents
自然语言处理:为文字建立统计模型
0. Sequence Data
Many data sources are sequential in nature, and call for special treatment when building predictive models:
- Documents such as books and movie reviews, newspaper articles, and tweets
- The sequence and relative positions of words in a document capture the narrative, theme and tone,
- tasks: topic classification, sentiment analysis, and language translation.
- Time Series of weather and finantial information
- tasks: weather / market indices prediction
- Recorded Speech and Sound Recordings
- tasks: text transcription of a speech, or music generation
A sentence can be represented as a sequence of L words, include slang or non-words, have spelling errors. The simplest and most common featurization is the bag-of-words model
- score each text for the presence or absence of each of the words in a language dictionary
- given a Language Dictionary that contains 10000 most frequently occuring words
|
|
1. Word Representation
The One-Hot Representation is simple
- however, it has NO information about its relationship to other one-hot-encoded vector
- Solution: Create a Matrix of Features to describe the words
- Word Embeddings!
Two Pretrained Embeddings are widely used:
word2vec
GloVe