TensorFlow NLP cheat sheet

Some tips for tensorflow and keras in Natural Language Processing

Author

Thomas H. Simm

Basic Implementation

The standard language model starts with an embedding layer, this then needs to be flattened to a vector, then we can add a dense layer before an output layer.

The Embedding layer creates a vector-space for the text data. So for example, the words beautiful and ugly may be in opposite directions. And words such as cat and kitten may be close together in vector space.

GlobalAveragePooling1Dcan be replaced by Flatten()

model = tf.keras.Sequential([ 
    tf.keras.layers.Embedding(num_words,embedding_dim,input_length=maxlen),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24,activation='relu'),
    tf.keras.layers.Dense(5,'softmax')
])

The model above does not take account for the order of words,

If we want to do this we can insert an additional layer after the embedding layer. For example, by using the LSTM model as below

model_lstm= tf.keras.Sequential([
    tf.keras.layers.Embedding(VOCAB_SIZE,EMBEDDING_DIM,input_length=MAXLEN),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(24,activation='relu'),
    tf.keras.layers.Dense(NUM_CLASSES,activation='softmax')   
])

We can even insert a conolution layer after the embedding instead

tf.keras.layers.Conv1D(128,5,activation='relu')

For two consecutive layers of RNNs use return_sequences=True

tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(lstm1_dim, return_sequences=True)),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(lstm2_dim)),

Text data Tokenizer

  • Create a Tokenizer instance
  • Fit tokenizer to text data with tokenizer.fit_on_texts(text_data)
  • Convert text to sequences with sequences = tokenizer.texts_to_sequences(text_data)
    • For example, the following words have the indices: apple->1, brain->2, cat->3, that->4, is->5
    • And a sequence of text within the data can be converted to a sequence: “that cat apple is brain” -> (4, 3, 1, 5, 2)
  • Get the word index word_index = tokenizer.word_index
  • Get the text back from the sequences text = tokenizer.sequences_to_texts(sequences)
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(oov_token="<OOV>",num_words=10_000)
tokenizer.fit_on_texts(text_data)

sequences = label_tokenizer.texts_to_sequences(text_data)