Classification of movie review sentiment using ULMFiT
We test how well ULMFIiT manages to predict Norwegian language movie review using a fine-tuned language model.
- Load data and language model
- Setup dataloader
- Creating our model
- Looking inside our model
- Training the model
from fastai2.text.all import *
In this post we'll try to predict movie reviews from a Norwegian language dataset explored in this post. It's taken a couple of steps to get here:
- first we explored the underlying AWD-LSTM architecture
- then we fine-tuned a pretrained Norwegian language model
- and tried to interpret what the model learn with dimensionality reduction techniques.
In this post we want to do actual classification, the final step of the ULMFiT method using fastai2 deep learning library.
First we'll grab the dataframe with reviews and labels from this post (available directly from github), the vocabulary from the language model, and the encoder from the fine-tuned language model (see this post).
df = pd.read_csv('https://raw.githubusercontent.com/hallvagi/dl-explorer/master/uploads/norec.csv')
df.head(3)
We are mainly interested in the text, split and sentiment columns. Let's take a quick check for missing values:
df.info()
It seems there are a few missing values in our text columns, so let's get rid of them and reset the index:
df = df.dropna()
df = df.reset_index(drop=True)
We'll load the language model vocabulary lm_itos.pkl
that we made in a previous post:
path = Path('~/.fastai/data/norec/')
Path.BASE_PATH = path
(path/'models').ls()
with open(path/'models/lm_itos.pkl', 'rb') as f:
itos = pickle.load(f)
itos[:10], itos[-5:], len(itos)
We'll load the encoder, finetuned_encoder.pth
, at a later stage when our classifier is set up.
Before we can make a model we need a dataloader. The dataloader is responsible for keeping track of the data, labels and dataset splits among other. The data loader will then feed batches to our model during training. We will test the data block api, instead of a simpler factory method. The data block represents the mid-level of the fastai api and is a flexible way to set up our data loader. For the ickiest datasets, one might have to drop down the lowest level of the api.
A data block is basically an assembly line that we will send our data through. That mean we set up a string of functions that will take our data frame and extract the relevant information the data loader needs:
- define which types (block) our dependent and independent variables are. In our case that is text and category.
- create a tokenizer and a vocabulary. We need to turn our text in to a numerical representation according to some vocabulary.
- define how to get the text: in our case we read it from the dataframe column 'text'
- define how to get the labels: in our case we'll read it form the dataframe column 'sentiment'
- split the data in a train, validation and test dataset. In our case we have to locate the indexes of rows in the column 'split' depending on the value of split: 'train', 'dev' or 'test'.
First let's set up a basic function that splits the data according to which split it belongs to. The datablock expects a set of indexes for the training, validation and test dataset:
def split(df):
train_idx = df.loc[df['split'] == 'train'].index.to_list()
valid_idx = df.loc[df['split'] == 'dev'].index.to_list()
test_idx = df.loc[df['split'] == 'test'].index.to_list()
return L(train_idx), L(valid_idx), L(test_idx)
If we pass our df
through this function it simply returns the indexes of the various splits:
split(df)
We can use the built-in class 'ColReader' to read the actual texts and labels. The ColReader class has a __call__ method that reads a particular column from a data frame that is passed:
reader = ColReader(cols='sentiment')
reader(df)[:3]
We now have all we need to set up our data block:
reviews = DataBlock(blocks=(TextBlock.from_df(text_cols='text', vocab=itos, seq_len=72), CategoryBlock),
get_x=ColReader('text'),
get_y=ColReader('sentiment'),
splitter=split
)
A grate way of debugging your data block with is the summary(df)
method. It's takes you through all the steps in the pipeline. The output is very long though, so I left it out in this notebook.
#reviews.summary(df)
Finally let's create the actual data loader from our source dataframe:
dls = reviews.dataloaders(df, bs=64)
dls.show_batch(max_n=4)
Note that our data block reviews
is specific to the source we plan to use it with. If we instead pass the underlying numpy values of the dataframe to it, the reader and splitter functions won't work:
#reviews.dataloaders(df.values)
What does our data actually look like? Grabbing a batch and inspecting it is often a useful way of understanding what the data actually looks like for our model:
xb, yb = dls.one_batch()
xb.shape, yb.shape
The dependent variable xb represents the numericalized text of the reviews. We have a batch size of 64 and the longest text in this particular batch has a length of 2990 tokens - but this will vary from batch to batch. We can get the vocab from our data loader, dls
, to check what the numbers represent. We have two vocabs, one for the text and one for the labels of our data:
tokens, classes = dls.vocab
tokens[:5], classes
Let's first grab a few numericalized tokens from our batch:
nums = xb[0][10:15].cpu()
nums
We can use the fastai L-class to look up those indexes in the vocab. Remember that token 2631 is simply the token at index 2631 in our vocab:
tokens[2631]
L(tokens)[nums]
Our dependent variable is as expected, either positive or negative. Once again we can use the vocab to get the actual classes. 1 is positive and 0 is negative:
yb[:5].cpu(), L(classes)[yb[:5]]
The longest text in the batch decides the shape of the enitre batch. What happens with the shorter texts? They are padded with our padding token, in our case token id 1:
tokens[1]
If we look at the final text in the batch we see that it's padded on both sides:
xb[-1].cpu(), len(xb[-1])
We can count the padding of each text in the batch:
(xb==1).sum(dim=1).cpu()
We see the longest review is placed first, and the batch has progressively shorter texts.
Setting up the model is pretty straight forward, but since our language model had a hidden size of 1150, we have to change the default in the config dictionary. Note also that we pass pretrained=False
. We'll load our pretrained encoder manually instead.
awd_lstm_clas_config['n_hid'] = 1150
learn = text_classifier_learner(dls, arch=AWD_LSTM, metrics=accuracy,
config=awd_lstm_clas_config, pretrained=False).to_fp16()
learn.load_encoder(path/'models/finetuned_encoder')
When looking at the model we can see that the first layers of the model are frozen, i.e. the trainable column says False. It's only the randomly initialized linear classification layers that are trainable. This makes sense. We first want to calibrate these layers as much as possible before we proceed to fine tune the language model. We'll come back to freezing and unfreezing of the model later in the post.
learn.summary()
We also recognize most of the other dimensions in the architecture:
- batch size of 64
- embedding size of 400
- hidden size of LSTMs is 1150
- 1200 and 50 output channels from the linear classifier layers
- but why is seq_len 38? Didn't we set it to 72 for the data loader? We'll investigate this later in the post.
learn.model
Before we do the actual classification, let's have a look at what the model actually does at each step.
In short, the SentenceEncoder takes text and outputs a vector representation of it. Let's split the model into the SentenceEncoder and the classifier and inspect both:
enc, lin = learn.model.children()
enc
enc.summary(xb)
Here we see a seq_len of 38 again. This is an artifact of the model summary which is specific to the batch. The LSTM processes a batch in chunks of 72 - the actual seq_len. The summary shows the final chunk, where it simply happens to be 38 items left:
2990%72
What comes out of the encoder if we pass it a batch of data?
enc_x = enc(xb)
[e.shape for e in enc_x]
The SentenceEncoder outputs two things, and if we check the source code, SentenceEncoder??
, we find that it returns: return outs,mask
.
The first output is our encoded text. The shape out the output is bs, len, embedding size. But where does the 1406 size come from? This is also specific to the batch we pass in. The get_text_classifier is called with a default max_len of 72*20 = 1440. So the SentenceEncoder encodes the final sequence of tokens up to a maximum length of 1440. It also makes sure to include the last sequence of the batch:
SentenceEncoder??
Our batch has size 2990, much longer than the maximum. So we will fill up 19 chunks of seq_len 72 and then add the remainder of 38 of the final chunk as the last part of the encoded sentence:
2990%72
So 19 full chunks and the remainder gives us:
72*19 + 38
The mask
output is a padding mask, that tells us where the padding tokens are in the processed text:
enc_x[1].sum(1).cpu()
If we take the final 1406 tokens of our batch we get the same sum of padding tokens:
(xb[:, -1406:]==1).sum(1).cpu()
The classifier takes the encoded sentence and produces a binary prediction:
lin.summary(enc_x)
If we pass the encoded batch through the model, we get a binary output for each item in our batch:
lin(enc_x)[0].shape
It' not entirely clear how we go from from the 1406 dimension input to the first batchnorm layer that expects 1200 size input. But if we check the source code of the PoolingLinearClassifier??
module it indeed has a masked_concat_pool
in it's forward method which changes the dimensionality of the input.
Let's pass the batch we grabbed above through the entire model:
preds = learn.model(xb)
[p.shape for p in preds]
Our model produces three things. The binary predictions for each item in the batch and the encoded sentences. Note that the two encoded batches of sentences are identical:
(preds[1]!=preds[2]).sum().item()
Since the model is untrained at this point, the prediction should be random. If we take the softmax of the final dimension we see that the model is mostly 50-50 on every prediction - so just a random guess at this point!
preds[0].softmax(-1)[:5].cpu()
Let's also take a quick glance at the various weights of the model: If we plot one of the weight matrices it's clear that the above weights matrix has been randomly and uniformly initialized, and thus has not been trained yet.
plt.hist(to_np(lin.state_dict()['layers.0.2.weight']).flatten());
If we instead check the one of the encoder weights, we get a completely different pattern. Most weights a centered around 0:
plt.hist(to_np(enc.state_dict()['module.rnns.1.module.weight_ih_l0']).flatten(), bins=50);
Fastai operates with a concept of frozen parameter groups. I.e. the model is split into groups which can be made trainable or not. We can check the status of the models parameter groups:
len(learn.opt.param_groups), learn.opt.frozen_idx
That is a total of 5 parameter groups and currently the first 4 is frozen. This corresponds with what we saw from the model summary - only the top classifier layer was trainable. During training we can unfreeze parts of the model:
learn.freeze_to(-3), learn.opt.frozen_idx
Let's reset the model before we train it:
learn.freeze_to(-1), learn.opt.frozen_idx
The parameter groups aren't named, but if we check the shape of the weights from each group we recognize the various layers:
- The embedding layer
- LSTM 1
- LSTM 2
- LSTM 3
- Linear classifier
[group.get('params')[0].shape for group in learn.opt.param_groups]
It's high time to actually train the model. We'll stick with standard fastai procedure for fine tuning a language sentiment classifier:
- lr_find() to find a sensible learning rate
- train with the one cycle policy
- gradual unfreezing of the layers - train the top layers first
- discriminative learning rates. That is train the lower layers with a smaller learning rate than the top layers.
This procedure is well known from the course and documentation. It's also a very robust method in my experience. It seem to just work for mosts datasets.
learn.lr_find()
This pattern is typical for untrained networks: There is a part of the graph which shows substantial lower loss than the other parts. This makes sense since changing random weights should give us substantial improvements pretty fast. We'll start with a learning rate of 1e-2 and train for a few epochs. Note: We are only training the top layers in this step:
lr = 1e-2
learn.fit_one_cycle(3, lr)
We're already at nearly 78% accuracy, and we haven't touched the actual language model yet! Let's unfreeze the first LSTM and run LR-find one more time:
learn.freeze_to(-2)
learn.lr_find()
This pattern is what you would expect from a model that has trained for a while. There are no more random weights left anymore, and further progress is much harder to find. We'll set a lower base learning rate for the remainder of the training, and slice this with a scaling factor. This means that the lowest parameter groups gets a lower learning rate, and the top layers a higher. When we slice we'll use the magic scaling factor* of 2.64 ~ 45. The source of this scaling factor is fastai guru Jeremy Howard, which found it to empirically work well. I've set this to 50 for sake of simplicity.
lr = 1e-3
scaling = 50
learn.fit_one_cycle(1, slice(lr/scaling, lr))
We continue to unfreeze the next layer, and train a bit more:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(lr/scaling,lr))
Our model is gradually improving, and we're at 8 % accuracy. Finally we unfreeze the entire model and train a final few epochs:
learn.unfreeze()
learn.fit_one_cycle(5, slice(lr/scaling,lr))
We end up with around 83% validation accuracy in our best epoch. This is much better than the 50-50 we would have gotten from a random guess:
df['sentiment'].value_counts(normalize=True)
But is our model really a good one? This is kind of difficult to say, since I haven't seen any results for this particular dataset, and we don't have any other models as baselines to compare with. The English IMDb dataset gets over 95% accuracy with ULMFiT. But this task likely has access to a better language model and has 10x more data. On the other hand, the reviews considered in this dataset are written by relatively few journalists, and we would maybe expect their writing style to be more consistent. Finally it should also be noted that our sentiment classes probably are closer than in the IMDb example. That is, we based negative on a rating of 1-3 of 6, and positive as 5-6. The IMDb reviews are more polarized, so the task is perhaps a bit easier.
Note that we also have a test set for this model, so this is the actual data set to score our model on
We already passed the test dataset to our dataloader, so now we can access it directly. The test set has index 2 (0 is train, and 1 is validation). So we can just validate with this data to get the score since we're not actually interested in the particular predictions at this point.
learn.validate(ds_idx=2)
The test set accuracy is the same as the validation set accuracy. This is kind of what we hoped for. That is, if there was larger differences, we would have to go back to our model to see if we did something wrong, or if the test set was sampled from a different distribution.
A finally important point is to investigate the predictions of the model. This could increase our understanding of what the model actually learns, discover data leaks, and in general increase our confidence in the model. We'll come back to model interpretation in a later post!