Deep Learning Notes | Word Embedding

2022-05-17 277 words 2 minutes

Contents

自然语言处理：为文字建立统计模型

0. Sequence Data

Many data sources are sequential in nature, and call for special treatment when building predictive models:

Documents such as books and movie reviews, newspaper articles, and tweets
- The sequence and relative positions of words in a document capture the narrative, theme and tone,
- tasks: topic classification, sentiment analysis, and language translation.
Time Series of weather and finantial information
- tasks: weather / market indices prediction
Recorded Speech and Sound Recordings
- tasks: text transcription of a speech, or music generation

A sentence can be represented as a sequence of L words, include slang or non-words, have spelling errors. The simplest and most common featurization is the bag-of-words model

score each text for the presence or absence of each of the words in a language dictionary
given a Language Dictionary that contains 10000 most frequently occuring words

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


from keras.preprocessing.text import Tokenizer

sentence = ["John likes to watch movies. Mary likes movies too."]

def print_bow(sentence) -> None:
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(sentence)
    sequences = tokenizer.texts_to_sequences(sentence)
    word_index = tokenizer.word_index
    bow = {}
    for key in word_index:
        bow[key] = sequences[0].count(word_index[key])

    print(f"Bag of word sentence 1:\n{bow}")
    print(f'We found {len(word_index)} unique tokens.')

print_bow(sentence)
'''
Bag of word sentence 1:
{'likes': 2, 'movies': 2, 'john': 1, 'to': 1, 'watch': 1, 'mary': 1, 'too': 1}

We found 7 unique tokens.
'''

1. Word Representation

The One-Hot Representation is simple

however, it has NO information about its relationship to other one-hot-encoded vector
Solution: Create a Matrix of Features to describe the words
- Word Embeddings!

Two Pretrained Embeddings are widely used:

word2vec
GloVe

Contents

Deep Learning Notes | Word Embedding

自然语言处理：为文字建立统计模型

0. Sequence Data

1. Word Representation

Use