Text Classification with Python & NLTK

Machine learning frameworks such as Tensorflow and Keras are currently all the range, and you can find several tutorials demonstrating the usage of CNN (Convolutional Neural Nets) to classify text. Often this can be overkill and, in this post we are going to show you how to classify text using Python’s NLTK library. The NLTK (Natural Language Toolkit) provides Python users with a number of different tools to deal with text content and provides some basic classification capabilities.

Input Data

In the example, I’m using a set of 10,000 tweets which have been classified as being positive or negative. Our classifier is going to take import in CSV format, with the left column containing the tweet and the right column containing the label. An example of the data can be found below:

Using your own data is very simple and simply requires that your left column contains your text document, while the column on the right contains the correct label. Allowing our classifier to classify a wide range of documents with labels of your choosing. The data used for this example can be downloaded here.

The Bag of Words Approach

We are going to use a bag of words approach. Simply put, we just take a certain number of the most common words found throughout our data set and then for each document we check whether the document contains this word. The bag of words approach is conceptually simple and doesn’t require us to pad documents to ensure that every document in our sample set is the same length. However, the bag of words approach tends to be less accurate than using a word embedding approach. By simply checking whether a document contains a certain set of words we miss out on a lot of valuable information – including the position of the words in a said document. Despite this we can easily train a classifier which can achieve 80%+ accuracy.

Initialising our class and reading our CSV file

Our CSV classifier is going to take several arguments. Firstly, we pass the name of our CSV file. Then as optional parameters we pass featureset_size and a test ratio. By default, our classifier will use the most 1,000 common words found in our dataset to create our feature set. Additionally, we will test the accuracy of our classifier against 10% of the items contained in our data set. We then initialise a few variables which will be used later by our classifier.

We then come on to reading our CSV file. We simply iterate through each line and before splitting by our line by commas. The text after our last comma is the documents label, while everything to the left is the document in question. By applying a regex to our document, we produce a list of words contained in the said document. In the example, I used a very simple regex to pull out the words, but it is possible to replace this with a more complex tokenizer. For each word in the document we append this to the list of words. This will allow us to determine the frequency that words occur in our dataset. We also place the list of words found in our document into the variable where we store all the documents found in our dataset.

Extracting Word Features

We then write our functions for handling generating our feature set. Here, we use NLTK’s Frequency Dist class to store the frequency in which different words where found throughout the dataset. We then iterate through all of the words in the document, creating a new record should we have not seen the word before and incrementing the count should it have already been found. We then limit our bag of words to be equal to the feature set size we passed when we initialised the class.

Now we have got a list of the most frequently found words, we can write a function to generate features for each of the documents in our dataset. As we are using a bag of words approach we are only interested in whether the document contains each word contained in the 1,000 most frequent words. If we find the word we return True, otherwise we return False. Eventually, we get a dictionary of 1,000 features which will be used train the classifier.

Training

We start by shuffling the documents. Some algorithms and classifiers can be sensitive to the order of data. This makes important to shuffle our data before training. We then use our feature set function within a list comprehension which returns us a list of tuples containing our feature set dictionary and the documents label. We then calculate where to split our data into training and test sets. The test set allows us to check how the classifier performs against an unseen dataset. We can then pass our data set to nltk’s naïve bayes classifier. The actual training may take some time and will take longer the larger the dataset used. We then check the classifiers accuracy against both the training and test set. In all likelihood the classifier will perform significantly better against the training set.

Classifying New Documents

Once we have trained a classifier we can then write a function to classify new documents. If we have not already loaded our CSV file and generated the word features, we will have to do this before classifying our new document. We then simply make generate a new set of features for this document before passing this to our classifier class. We will then pass this to our classifier and call the classify method with the feature set. The function will return the string of the predicted label.

Saving and Loading Model

Rather than training the model every time we want to classify the sentence, it would make sense to save the model. We can write two simple functions to allow us to reuse our model whenever we want. The save functions simply saves our classifier and feature words to objects to files, which can then be reloaded by our load model function.

Accuracy

Algorithm Train Test
Naive Bayes Classifier (NLTK) 84.09% 72.89%
BernouliNB (Sklearn) 83.93% 79.78%
MultinomiaNB (Sklearn) 84.58% 74.67%
LogisticRegression (Sklearn) 89.05% 75.33%
SGDClassifier (Sklearn) 81.23% 69.32%

The algorithm performs relatively well against our example data. Being able to correctly classify whether a Tweet is positive or negative around 72% of the time. NLTK gives it’s users the option to replace the standard Naive Bayes Classifier with a number of other classifiers found in the Sci-kit learn package. I ran the same test swapping in these classifiers for the Naive Bayes Classifier, and a number of these classifiers significantly outperformed the standard naive classifier. As you can see the BernouliNB model performed particularly well, correctly classifying documents around 80% of the time.

The accuracy of the classifier could further be improved by using something called an ensemble classifier. To build an ensemble classifier we would simply build several models using different classifiers and then classify new documents against all of these classifiers. We could then select the answer which was provided by the majority of our classifiers (hard voting classifier).  Such a classifier would likely outperform just using on of the above classifiers. The full code below provides a function that allows you to try out other Sklearn classifiers.

Example Usage

The class is pretty easy to use. The above code outlines all of the steps required to train a classifier and classify an unseen sentence. More usage examples and the full code can be found on Github here.

Full Code

 

Leave a Reply

Your email address will not be published. Required fields are marked *