= tf.keras.Sequential([
model =maxlen),
tf.keras.layers.Embedding(num_words,embedding_dim,input_length
tf.keras.layers.GlobalAveragePooling1D(),24,activation='relu'),
tf.keras.layers.Dense(5,'softmax')
tf.keras.layers.Dense( ])
TensorFlow NLP cheat sheet
Some tips for tensorflow and keras in Natural Language Processing
Basic Implementation
The standard language model starts with an embedding layer, this then needs to be flattened to a vector, then we can add a dense layer before an output layer.
The Embedding
layer creates a vector-space for the text data. So for example, the words beautiful and ugly may be in opposite directions. And words such as cat and kitten may be close together in vector space.
GlobalAveragePooling1D
can be replaced by Flatten()
The model above does not take account for the order of words,
If we want to do this we can insert an additional layer after the embedding layer. For example, by using the LSTM model as below
= tf.keras.Sequential([
model_lstm=MAXLEN),
tf.keras.layers.Embedding(VOCAB_SIZE,EMBEDDING_DIM,input_length32)),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(24,activation='relu'),
tf.keras.layers.Dense(='softmax')
tf.keras.layers.Dense(NUM_CLASSES,activation ])
We can even insert a conolution layer after the embedding instead
tf.keras.layers.Conv1D(128,5,activation='relu')
For two consecutive layers of RNNs use return_sequences=True
=True)),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(lstm1_dim, return_sequences tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(lstm2_dim)),
Text data Tokenizer
- Create a
Tokenizer
instance - Fit tokenizer to text data with
tokenizer.fit_on_texts(text_data)
- Convert text to sequences with
sequences = tokenizer.texts_to_sequences(text_data)
- For example, the following words have the indices: apple->1, brain->2, cat->3, that->4, is->5
- And a sequence of text within the data can be converted to a sequence: “that cat apple is brain” -> (4, 3, 1, 5, 2)
- Get the word index
word_index = tokenizer.word_index
- Get the text back from the sequences
text = tokenizer.sequences_to_texts(sequences)
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
= Tokenizer(oov_token="<OOV>",num_words=10_000)
tokenizer
tokenizer.fit_on_texts(text_data)
= label_tokenizer.texts_to_sequences(text_data) sequences