We are living in the era of smartphones and digital communications have changed the way we communicate with our peers. We spend increasing amounts of time typing on our electronics with our friends, job colleagues and in social media in general. Depending on the size of the dispositive, it can be quite challenging to accurately type what we want, thus potentially wasting a lot of time.

Fortunately, machine learning and statistical data-based methods can be used to make our lives easier; in fact, there are a lot of text predictors already out there implemented within keyboard apps.

In this project, we are building our own text prediction algorithm as a prototype for possible later implementations to smartphones and other personal gadgets. They key aspects we are focusing on for this project are:

  • Size on memory.
  • Execution time.
  • Performance.
  • Presentation.

Size on memory

Random Access Memory (RAM) is expensive and therefore limited in electronics. We need our program to use little system RAM so it can run smoothly simultaneously with other important applications. Also, by lowering the memory requirements in the system, we make our potential user base size bigger and the app to start faster since less memory has to be allocated and filled with information.

Execution time

It is utterly important that the prediction time is as fast as the average typing speed since we need to predict the words as people are typing, not after they have completed the idea.

Performance

We need the program to be relevant and make very accurate predictions of the next word in order to be useful, otherwise people won’t use it.

Presentation

They way the prediction is presented and displayed on the screen can be very important for the program appeal and relevance.

Data and Corpora

In Natural Language Processing (NLP), the area that studies the interaction between computers and the way people uses language, it is commonly named corpora to the compilation of text documents used to train the prediction algorithm or any other insight drawn from the data to understand language.

In our application, we are using a corpora of three text compilations coming from tweets (http://twitter.com), blog entries and news articles from different sources, all of them from US english speakers. The corpora used comes from the Heliohost Corpora originally from http://www.corpora.heliohost.org and can be downloaded here.

The original zipped file size is 548 MB, although the file contains corpus in two additional languages which we did not use. Each corpus has different lengths as we could expect; for example, twitter has 140 characters limit and we expect blogs entries need bigger sizes to convey their ideas. The average size per line for the twitter, blogs and news corpus, respectively are shown bellow.

mean(nchar(twitter))
mean(nchar(blogs))
mean(nchar(news))

The total number of entries for each corpus is showed bellow.

length(twitter)
length(blogs)
length(news)

As we show in the next section, the size of this corpora makes the computation times unbearably high for the purposes of this prototype and we need to work with a subset of this corpora.

Exploratory Analysis

The way we are going to model the data in order to make a prediction is with the basic N-gram model. This representation of the structure of the data has the advantage of being very easy to interpret, has low computation times compared with other models, can be very accurate depending on several factors and assumptions that we are gonna cover in the next sections, has good scalability in the sense that we can relax assumptions and make it more complex if necessary and also some versions the N-gram model are probabilistic so we can make inference about the population, even for unseen cases in the training set.

An \(N-gram\) is a decomposition of a corpora of each combination of \(N\) consecutive words making a phrase. In it’s basic form, we can take the an \(N-1\) gram and predict the last word based on the probability dependent on the original \(N-gram\). The mathematical representation of the model is presented in the next sections.

Tokenization is the process of splitting the corpora in tokens that can be phrases, words, characters and whatnot. After the tokenization is made, we can compute our \(N-grams\) and corresponding frequencies.
We found the package ngram1 to be relatively fast in creating the tokenization and associated ngrams since the relevant code is low level written in C. However, the conversion of the results into R data frames is still very time consuming and we needed to use a sample of the corpora to make our algorithm. In fact, we sampled each corpus separately to keep each type of corpus representative randomly sampling 30% of the lines. Here is an example of the sampling process.

set.seed(147)
twitter.intrain <- as.logical(rbinom(n = length(twitter), 
                                    size = 1, prob = .3))
twitter.train <- twitter[twitter.intrain]
twitter.test <- twitter[-twitter.intrain]

The computational time to compute the ngram and frequency tables was slightly below one hour for the 1-grams and 2-grams and more than one hour for the 3-grams and 4-grams each. We decided that for the purposes of this prototype this 4 N-grams would do just fine to model the structure of the data and make good predictions.

One important aspect of the corpora is that it contains profanity words in there and we decided that the Google Ngram Project profanity list (you can find a version of that list at https://gist.github.com/jamiew/1112488) is more than appropriate for our purposes; although, in the end we decided not to filter for profanities in the prototype since there is no valid reason to do so.

Recall that one of our main goals for this project is to keep the memory size as small as possible. We found that predicting from the full set of ngrams and frequencies would be very inefficient since the size is unbearable for smartphones and also inefficient since N-grams with very low frequencies account for a big part of the tables. In the next figure we plot the 2gram table frequency distribution.

barplot(head(count[order(count, decreasing = T)],100), main="2-gram Frequency Distribution")

From the previous plot we learned that we can save a lot of space in memory if we prune each N-gram without losing too much information since the size in memory is related to the number of N-grams but precision is related to the relative frequency of the N-grams.

In fact, the guys at Google Ngram Project decided to prune the distribution for N-grams with frequency \(k\) lower than 40.2 We can’t use the \(k=40\) parameter used by Google because this number is determined by:

  1. The size of the corpora
  2. The cumulative frequency they are willing to retain.

We decided to go for a \(k=5\) for the 2-gram and 3-gram cases and \(k=1\) for the 4-gram to keep a cumulative frequency of around 60% of the N-grams; we kept the complete 1-gram table since it doesn’t take a lot of memory.

Mathematics and Handling of Unseen Cases

To describe the mathematics behind our model we need some special notation commonly used in NLP. We are usually interested in measuring the probability of a chain of words in order \(P(w_n|w_1, w_2, w_3, ..., w_n-1)\)
this is the probability of observing \(w_n\) given \(w_1, w_2, ..., w_n-1\)
We usually write \(w_1^{n-1}\) to suggest a chain of words, not in the sense of exponentials.
By the Chain Rule of Probability we know that
\(P(w_n|w_1, w_2, w_3, ..., w_{n-1}) = P(w_1)P(w_2|w_1)P(w_3|w_2,w_1)...P(w_n|w_1^{n-1}) = \prod_z^n P(w_n|w_1^{z-1})\).

In practice it is computationally intensive to make computations for very large phrases according to this formula but we can approximate this conditional probability by a Markovian Process which assumes the transition probabilities are constant which means it only matters the current state to compute the conditional probability, not the way you arrived there. This is a good approximation for NLP models because it is usually only a few words back that matter to make context for the next word, not a very long chain of words.

For the 2gram model or bigram we can write this Markovian assumption as
\(P(w_n|w_1^{n-1})\approx P(w_n|w_{n-1})\)

The Maximum Likelihood Estimator (MLE) of this conditional probability can be constructed using frequencies in the training set. To compute the MLE of the bigram model for example we use
\(MLE = \frac{C(w_{n-1}, w_n)}{C(w_{n-1}, w)}\)
where \(C\) is the observed frequency in the training set and \(w_{n-1}, w\) means all the bigrams that begin with \(w_{n-1}\).

To handle unseen cases we use the simple backoff without interpolation. We begin in the \(4gram\) model and if the \(3gram\) used to predict the last word is unseen in the corpora, we backoff to the \(2gram\) model and if it is unobserved again, we backoff to the \(1gram\) model.

The Stupid Backoff2 introduced by the Google team uses this kind of not probabilistic backoff but they use interpolation to compute frequencies using all \(ngrams\) from the \(5gram\) to the unigram at every step of the way. We believe that for the purposes of this prototype, the simple backoff model implemented is sufficiently good.

Implementation

The App is deployed at the Rstudio’s Shiny Apps Server and it showcases the text prediction result using the best 5 predictions. We decided to predict not only words based on the \(ngrams\) but also to predict the word currently being typed using the unigram.

The way we present the predictions is by making the size of the word proportional to their likelihood using the wordcloud2 javascript library implementation in R.4

The prediction algorithm runs acceptably fast with hundredths of a second of runtime, satisfying our goal of speed.

Next steps

The next steps consist of using the whole corpora to build the ngrams and maybe extend to the \(5gram\) case if this adds important accuracy. This additions increase the computation time importantly for one-threaded processes meaning parallel processing would be needed to handle this task maybe also using hashing to efficiently store text chains. Using low level programming languages like C would reduce the processing times importantly.

One other possible addition would be to add a profanity filter and case sensitive options to the algorithm.

References

1: Schmidt D and Heckendorf C (2017). “ngram: Fast n-Gram Tokenization.” R package version 3.0.2, https://cran.r-project.org/package=ngram and Schmidt D and Heckendorf C (2017). Guide to the ngram Package: Fast n-gram Tokenization. R Vignette, https://cran.r-project.org/package=ngram.

2: Thorsten Brants, Ashok C.Popat, Peng Xu, Franz J. Och and Jeffrey Dean. “Large Language Models in Machine Translation.”. http://www.aclweb.org/anthology/D07-1090.pdf.

3: Daniel Jurafsky and James H. Martin. Speech and Language Processing. Draft of September 1, 2014. Chapter 4. https://lagunita.stanford.edu/c4x/Engineering/CS-224N/asset/slp4.pdf

4: Dawei Lang. “wordcloud2: Create Word Cloud by htmlWidget”. R package version 0.2.0. https://github.com/lchiffon/wordcloud2.