How to build an auto-complete model with tensorflow.js

Yicong
5 min readJul 2, 2021
Demo link: https://ohyicong.github.io/portfolio/autocomplete_model.html

Auto-complete is a basic feature that is built into every phone and computer, where it learns from your typing habits and magically adds new words into its dictionary. Sometimes, it is so intuitive that we forgot its existence.

But have you wondered how it works? I did.

Out of curiosity, I did some research and realized that there isn’t a simple tutorial on implementing an auto-complete model with machine learning. So I decided to write this article to share and explain my process of building, training and deploying an auto-complete model.

To allow everyone to follow my tutorial, I have decided to use Tensorflow.js for the implementation. It is an amazing javascript library that allows you to develop your Machine Learning (ML) models on web browsers such as Edge, Chrome, and Firefox.

So anyone with a web browser and text editor can follow my tutorial.

Auto-Complete Intuition

In general, an ideal auto-complete function exhibits the following characteristics.

Selecting Machine Learning Model

There are many ML models available but not all are suitable for our context. We have to select a model that is able to process and infer from a sequence of inputs to predict the most probable output.

With this in mind, I have chosen the Long Short Term Memory (LSTM) model for this application. The LSTM model has a recurrent neural network architecture that allows it to infer inputs from t-0 to t-n, where t represents the current time step and n represents the sequence number.

Simplified Overview of an LSTM Model

Creating Dataset

After selecting the model, we need to figure out what kind of data is suitable for training. Here are some questions to help with your thinking process:

  1. What are the features and labels?
  2. What are the data limitations?
  3. What kind of data format is required?
  4. What is the data shape?

Qn 1: What are the features and labels?

To decide on the training features and labels, put yourself into the shoes of an English teacher. Think about the ways you can teach English to a toddler.

One of the methods is “fill-in-the-blanks”, where some of the characters are obscured and the toddler is trained to write down the correct characters to form a word.

Similarly, this method can be applied to create the dataset to train our model.

For example, to train the model to recognize the word “apple”, we can obscure the characters in the word from left to right, as shown in the table below.

Dataset
+------------+------------+
| features | labels |
+------------+------------+
| a | apple |
| ap | apple |
| app | apple |
| appl | apple |
| apple | apple |
+------------+------------+

Qn2: What are the data limitations?

There are two data limitations that you should take note of.

  1. Maximum word length (this number is determined by you, for this tutorial, I will be using 10)
  2. Alphabets only (you should not include any numbers and special characters in your dataset)

Since the maximum length is 10, we need to fill up the empty spaces with zero, as shown in the table below.

Dataset
+------------+------------+
| features | labels |
+------------+------------+
| a000000000 | apple00000 |
| ap00000000 | apple00000 |
| app0000000 | apple00000 |
| appl000000 | apple00000 |
| apple00000 | apple00000 |
+------------+------------+

Qn3: What kind of data format?

As computers only work with numbers, we need to find a way to convert the characters into numbers. One simple method is to use integer encoding, where padding=0, a=1, b=2, c=3 … z=27, which will result in the output as shown below.

apple00000 = 1,16,16,12,5,0,0,0,0,0

Even though we can use this dataset to train the model, it may not yield good results. This is because of the unfair integer representation for each character, where the ML model assumes that the character “z” has higher precedence than “a” as the integer representing “z” has a higher value.

A simple method to neutralize the precedence effect is to use one-hot encoding, where the character is represented with an array of ones and zeros. See the one-hot encoding output below.

apple00000 = 
[0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0]
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0]
[0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
[0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

Qn4: What is the data shape?

With the zero-padding and one-hot encoding, the resultant shape for each word will be (10,27):(max word length: number of alphabets + 1).

Building Machine Learning Model

Now, we finally reach the exciting part where you are going to build the ML model for this application.

The model architecture consists of two layers:

  1. LSTM layer (input). It is used for drawing inference from sequential input.
  2. Softmax Dense layer (output). It is used for one-hot encoding prediction.

After creating the two layers, you need to define the input and output shapes. This is a very important step as specifying a wrong number will result in a “shape mismatch” error.

As calculated before, the shape will be (10,27). Hence, we should define max_len = 10 and alpha_len = 27.

// code for building auto-complete model
const max_len = 10;
const alpha_len = 27;
var model = tf.sequential();
model.add(tf.layers.lstm({
units:alpha_len*2,
inputShape:[max_len,alpha_len],
dropout:0.2,
recurrentDropout:0.2,
useBias: true,
returnSequences:true,
activation:"relu"
}))
model.add(tf.layers.timeDistributed({
layer: tf.layers.dense({
units: alpha_len,
dropout:0.2,
activation:"softmax"
})
}));

Training Machine Learning Model

In order to train the model, we need to decide on a few parameters:

  1. Optimizer. The adam optimizer is used to train our model as it handles sparse data well and has an adaptive learning rate that effectively trains the model.
  2. Loss Function. The categorical cross-entropy is used as the model needs to predict the results in one-hot encoding, which has a categorical data type.
  3. Epoch. The default setting is at 250, it should be adjusted depending on the dataset size.
  4. Batch size. The default setting is at 32, it should be adjusted depending on the dataset size.
  5. Evaluation Metrics. The Mean Square Error (MSE) is used due to the nature of the loss function.
# code to train the model
model.compile({
optimizer: tf.train.adam(),
loss: 'categoricalCrossentropy',
metrics: ['mse']
})
model.fit(train_features, train_label, {
epochs,
batch_size,
shuffle: true,
callbacks: tfvis.show.fitCallbacks(
{ name: 'Training' },
['loss', 'mse'],
{ height: 200, callbacks: ['onEpochEnd'] }
)
});
Adam optimizer minimizing MSE over 250 iterations.

Demonstration

After training, you will be able to create your own user interface and test out the ML model!

Prediction capability after training the model

Resources

Demo link: https://ohyicong.github.io/portfolio/autocomplete_model.html

Source code: https://gist.github.com/ohyicong/b1e9dab5eec6371b404dbe603ac4685d

--

--