from fastai2.text.all import *

Load data and language model

In this post we'll try to predict movie reviews from a Norwegian language dataset explored in this post. It's taken a couple of steps to get here:

  • first we explored the underlying AWD-LSTM architecture
  • then we fine-tuned a pretrained Norwegian language model
  • and tried to interpret what the model learn with dimensionality reduction techniques.

In this post we want to do actual classification, the final step of the ULMFiT method using fastai2 deep learning library.

First we'll grab the dataframe with reviews and labels from this post (available directly from github), the vocabulary from the language model, and the encoder from the fine-tuned language model (see this post).

df = pd.read_csv('https://raw.githubusercontent.com/hallvagi/dl-explorer/master/uploads/norec.csv')
df.head(3)
filename rating title split sentiment text
0 html/train/000000.html 6 Rome S02 train positive Den andre og siste sesongen av Rome er ute på DVD i Norge. Om du så sesong 1, vet du at du har noe stort i vente. Har du aldri sett Rome før, stikk ut og kjøp begge sesongene. Dette er nemlig en av verdens beste tv-serier, og etter å ha sett de fire første episodene av sesong 2, konstaterer jeg at kvaliteten ser ut til å holde seg på et nesten overraskende høyt nivå! Sesong 2 starter nøyaktig der sesong 1 sluttet. Julius Cæsar ligger myrdet i Senatet og Lucius Vorenus hulker over liket av Neobie. Så blir historien enda mørkere. Marcus Antonius tar over styringen av Roma, men utfordres fra ...
1 html/train/000001.html 6 Twin Peaks - definitive gold box edition train positive Tv-serien Twin Peaks, skapt av David Lynch og Mark Frost, trollbandt publikum på starten av 1990-tallet. Nå er begge sesongene samlet på DVD i en såkalt ”definitive gold box edition” som viser at serien ikke har mistet noe av appellen. Det eneste som egentlig røper alderen, er at serien ikke er i widescreen, og at flere av skuespillerne fremdeles er unge og vakre. 17 år etter premieren har de falmet, som mennesker gjør, men Twin Peaks sikrer dem evig liv. Serien handler om et mordmysterium i den lille byen Twin Peaks, et sted langs USAs grense til Canada. Unge, vakre Laura Palmer blir funn...
2 html/train/000002.html 6 The Wire (sesong 1-4) train positive I neste uke kommer sesong 5 av tv-serien ”The Wire” på DVD. 2008 har for meg vært sterkt preget av denne serien. Hjemme hos oss begynte vi med sesong 1 i vår. Da hadde jeg i lengre tid hørt panegyriske lovord om serien fra både venner og media. Vi ble også fanget av skildringene av purk og skurk i Baltimore, og pløyde oss igjennom alt til og med sesong 4 på sensommeren. Jeg vil ikke gå så langt som å kalle det ”verdens beste serie”, som noen har gjort, men det er ingen tvil om at dette er noe av det bedre som er blitt vist på tv! Serien forteller om en gruppe politietterforskere som samles...

We are mainly interested in the text, split and sentiment columns. Let's take a quick check for missing values:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8613 entries, 0 to 8612
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   filename   8613 non-null   object
 1   rating     8613 non-null   int64 
 2   title      8613 non-null   object
 3   split      8613 non-null   object
 4   sentiment  8613 non-null   object
 5   text       8557 non-null   object
dtypes: int64(1), object(5)
memory usage: 403.9+ KB

It seems there are a few missing values in our text columns, so let's get rid of them and reset the index:

df = df.dropna()
df = df.reset_index(drop=True)

We'll load the language model vocabulary lm_itos.pkl that we made in a previous post:

path = Path('~/.fastai/data/norec/')
Path.BASE_PATH = path
(path/'models').ls()
(#8) [Path('models/finetuned_model.pth'),Path('models/norwegian_wgts.h5'),Path('models/norwegian_enc.pth'),Path('models/lm_itos.pkl'),Path('models/finetuned_encoder.pth'),Path('models/norwegian.zip'),Path('models/norwegian_enc.h5'),Path('models/norwegian_itos.pkl')]
with open(path/'models/lm_itos.pkl', 'rb') as f:
    itos = pickle.load(f)
itos[:10], itos[-5:], len(itos)
(['xxunk', 'xxpad', '.', 'i', ',', 'og', '\n\n', 'av', 'som', 'en'],
 ['learning', 'initiativtager', 'forskningsleder', 'devils', 'graeme'],
 30002)

We'll load the encoder, finetuned_encoder.pth, at a later stage when our classifier is set up.

Setup dataloader

Before we can make a model we need a dataloader. The dataloader is responsible for keeping track of the data, labels and dataset splits among other. The data loader will then feed batches to our model during training. We will test the data block api, instead of a simpler factory method. The data block represents the mid-level of the fastai api and is a flexible way to set up our data loader. For the ickiest datasets, one might have to drop down the lowest level of the api.

A data block is basically an assembly line that we will send our data through. That mean we set up a string of functions that will take our data frame and extract the relevant information the data loader needs:

  1. define which types (block) our dependent and independent variables are. In our case that is text and category.
  2. create a tokenizer and a vocabulary. We need to turn our text in to a numerical representation according to some vocabulary.
  3. define how to get the text: in our case we read it from the dataframe column 'text'
  4. define how to get the labels: in our case we'll read it form the dataframe column 'sentiment'
  5. split the data in a train, validation and test dataset. In our case we have to locate the indexes of rows in the column 'split' depending on the value of split: 'train', 'dev' or 'test'.

Note: In fastai and kaggle lingo the test set is an unlabeled dataset on which we test our model. In our case, though, we also have labels for the test set. So we can assess the performance on both the validation and test set.

First let's set up a basic function that splits the data according to which split it belongs to. The datablock expects a set of indexes for the training, validation and test dataset:

def split(df):
    train_idx = df.loc[df['split'] == 'train'].index.to_list()
    valid_idx = df.loc[df['split'] == 'dev'].index.to_list()
    test_idx = df.loc[df['split'] == 'test'].index.to_list()
    return L(train_idx), L(valid_idx), L(test_idx)

If we pass our df through this function it simply returns the indexes of the various splits:

split(df)
((#6863) [0,1,2,3,4,5,6,7,8,9...],
 (#876) [136,137,138,140,141,142,143,144,145,146...],
 (#818) [183,184,186,187,188,189,190,191,192,193...])

We can use the built-in class 'ColReader' to read the actual texts and labels. The ColReader class has a __call__ method that reads a particular column from a data frame that is passed:

reader = ColReader(cols='sentiment')
reader(df)[:3]
0    positive
1    positive
2    positive
Name: sentiment, dtype: object

We now have all we need to set up our data block:

Note: The seq_len has to be the same as we trained our language model with!
reviews = DataBlock(blocks=(TextBlock.from_df(text_cols='text', vocab=itos, seq_len=72), CategoryBlock),
                    get_x=ColReader('text'),
                    get_y=ColReader('sentiment'),
                    splitter=split
                   )

A grate way of debugging your data block with is the summary(df) method. It's takes you through all the steps in the pipeline. The output is very long though, so I left it out in this notebook.

#reviews.summary(df)

Finally let's create the actual data loader from our source dataframe:

dls = reviews.dataloaders(df, bs=64)
dls.show_batch(max_n=4)
text category
0 xxbos html , body { border : xxunk ; } — xxunk det enkle er grunnlaget for all \n xxunk , sa xxunk xxunk xxunk xxunk til xxunk xxunk xxunk xxunk i 1923 . xxunk filmen « xxunk xxunk xxunk \n & xxunk igor xxunk xxunk » , som handler om xxunk xxunk korte xxunk med den \n russiske komponisten xxunk igor xxunk xxunk , baserer seg på samme dekret . xxunk xxunk selv \n skjærer som en mørk xxunk inn i scenene , med sine smale , svarte antrekk . xxunk store \n deler av filmen foregår på xxunk xxunk landsted , dit den velstående xxunk \n inviterer den fattige komponisten og familien hans for at han skal få jobbe , og \n de svarte og hvite linjene i xxunk xxunk xxunk xxunk seg xxunk gjennom regissør \n xxunk jan xxunk xxunk bilder slik xxunk xxunk selv skjærer gjennom positive
1 xxbos xxunk vakre , xxunk xxunk christine xxunk brown sover trygt i sin xxunk xxunk , med kjæresten ved sin side . xxunk ei flue flyr inn gjennom vinduet , summer xxunk gjennom rommet og lander på xxunk , før den setter kursen mot vår blonde xxunk . xxunk musikken er xxunk og xxunk . xxunk xxunk likeså og xxunk inn i xxunk browns ene xxunk . xxunk og ut igjen fra det andre . xxunk før den forsvinner inn i xxunk hennes og vekker henne til hennes livs verste mareritt , ei xxunk , xxunk xxunk , som xxunk over henne med brune xxunk , xxunk for å bite henne til døde . xxup scenen xxup fra « drag xxunk me to xxunk hell » representerer det meste xxunk sam xxunk xxunk står for som filmskaper . xxunk den er xxunk , xxunk og preget av humor . xxunk positive
2 xxbos xxunk lawrence of xxunk arabia er en film for de store lerret . xxunk den bør helst oppleves i en xxunk , fortrinnsvis i sitt opprinnelige 70 mm format . xxunk men siden det er en xxunk i 2012 , er den nye xxunk blu - ray - utgivelsen det nest beste . xxunk den er faktisk dobbelt restaurert . xxunk den digitale xxunk er nemlig basert på en xxunk xxunk fra 1988 , da den også ble rekonstruert til sin opprinnelige lengde . xxunk og på xxunk blu - ray har xxunk lawrence of xxunk arabia en fantastisk klarhet og dybde som forsterker følelsen av xxunk xxunk . xxunk det er helt utrolig at en 50 år gammel film kan se så bra ut ! xxunk david xxunk xxunk mesterverk skildrer en mann som lar sin indre kamp komme til uttrykk i ytre handlinger . xxunk når en positive
3 xxbos xxunk 3 | • xxunk dagbladets reporter xxunk i xxunk cannes . xxunk les hans rapport her . xxunk xxup film : xxunk vi har lest om xxunk og \n skandale . xxunk om sex , blod , xxunk og vold . xxunk om xxunk av xxunk og \n xxunk xxunk som xxunk blod . xxunk om journalister som har xxunk og \n kritikere som har slaktet og xxunk om hverandre . xxunk xxunk for en tid tilbake befant regissør von xxunk trier seg i en dyp \n depresjon . xxunk den svært xxunk dansken har fortalt om psykiske problemer før , \n men ifølge ham selv var tidligere depresjoner ingen ting sammenliknet med mørket \n som rammet ham nå . i lang tid var han ikke i stand til å gjøre annet enn å ligge \n og xxunk tomt ut i lufta . xxunk som en form for positive

Note that our data block reviews is specific to the source we plan to use it with. If we instead pass the underlying numpy values of the dataframe to it, the reader and splitter functions won't work:

#reviews.dataloaders(df.values)

Inspect a batch

What does our data actually look like? Grabbing a batch and inspecting it is often a useful way of understanding what the data actually looks like for our model:

xb, yb = dls.one_batch()
xb.shape, yb.shape
(torch.Size([64, 2990]), torch.Size([64]))

The dependent variable xb represents the numericalized text of the reviews. We have a batch size of 64 and the longest text in this particular batch has a length of 2990 tokens - but this will vary from batch to batch. We can get the vocab from our data loader, dls, to check what the numbers represent. We have two vocabs, one for the text and one for the labels of our data:

tokens, classes = dls.vocab
tokens[:5], classes
(['xxunk', 'xxpad', '.', 'i', ','], (#2) ['negative','positive'])

Let's first grab a few numericalized tokens from our batch:

nums = xb[0][10:15].cpu()
nums
tensor([2631,    0,   18, 2770,   10])

We can use the fastai L-class to look up those indexes in the vocab. Remember that token 2631 is simply the token at index 2631 in our vocab:

tokens[2631]
'—'
L(tokens)[nums]
(#5) ['—','xxunk','det','enkle','er']

Note: L is a fastai version of the python List datatype with some added functionality such as slicing from a list of indexes

Our dependent variable is as expected, either positive or negative. Once again we can use the vocab to get the actual classes. 1 is positive and 0 is negative:

yb[:5].cpu(), L(classes)[yb[:5]]
(tensor([1, 1, 0, 1, 1]),
 (#5) ['positive','positive','negative','positive','positive'])

Padding

The longest text in the batch decides the shape of the enitre batch. What happens with the shorter texts? They are padded with our padding token, in our case token id 1:

tokens[1]
'xxpad'

If we look at the final text in the batch we see that it's padded on both sides:

xb[-1].cpu(), len(xb[-1])
(tensor([1, 1, 1,  ..., 1, 1, 1]), 2990)

We can count the padding of each text in the batch:

(xb==1).sum(dim=1).cpu()
TensorText([   0,   94,  404, 1179, 1352, 1374, 1376, 1388, 1405, 1434, 1499, 1513,
        1563, 1580, 1615, 1617, 1617, 1634, 1687, 1717, 1730, 1743, 1763, 1793,
        1794, 1813, 1851, 1870, 1893, 1893, 1895, 1900, 1951, 1980, 1984, 1986,
        1991, 1996, 2006, 2010, 2014, 2014, 2023, 2032, 2055, 2067, 2072, 2079,
        2081, 2082, 2091, 2091, 2117, 2118, 2128, 2131, 2136, 2140, 2143, 2144,
        2147, 2159, 2164, 2165])

We see the longest review is placed first, and the batch has progressively shorter texts.

Creating our model

Setting up the model is pretty straight forward, but since our language model had a hidden size of 1150, we have to change the default in the config dictionary. Note also that we pass pretrained=False. We'll load our pretrained encoder manually instead.

awd_lstm_clas_config['n_hid'] = 1150

learn = text_classifier_learner(dls, arch=AWD_LSTM, metrics=accuracy, 
                                config=awd_lstm_clas_config, pretrained=False).to_fp16()

learn.load_encoder(path/'models/finetuned_encoder')
<fastai2.text.learner.TextLearner at 0x7fb1d16c5350>

Inspecting the model

When looking at the model we can see that the first layers of the model are frozen, i.e. the trainable column says False. It's only the randomly initialized linear classification layers that are trainable. This makes sense. We first want to calibrate these layers as much as possible before we proceed to fine tune the language model. We'll come back to freezing and unfreezing of the model later in the post.

learn.summary()
SequentialRNN (Input shape: ['64 x 2990'])
================================================================
Layer (type)         Output Shape         Param #    Trainable 
================================================================
RNNDropout           64 x 38 x 400        0          False     
________________________________________________________________
RNNDropout           64 x 38 x 1150       0          False     
________________________________________________________________
RNNDropout           64 x 38 x 1150       0          False     
________________________________________________________________
BatchNorm1d          64 x 1200            2,400      True      
________________________________________________________________
Dropout              64 x 1200            0          False     
________________________________________________________________
Linear               64 x 50              60,000     True      
________________________________________________________________
ReLU                 64 x 50              0          False     
________________________________________________________________
BatchNorm1d          64 x 50              100        True      
________________________________________________________________
Dropout              64 x 50              0          False     
________________________________________________________________
Linear               64 x 2               100        True      
________________________________________________________________

Total params: 62,600
Total trainable params: 62,600
Total non-trainable params: 0

Optimizer used: <function Adam at 0x7fb1fc4a0f80>
Loss function: FlattenedLoss of CrossEntropyLoss()

Model frozen up to parameter group number 4

Callbacks:
  - ModelReseter
  - RNNRegularizer
  - ModelToHalf
  - TrainEvalCallback
  - Recorder
  - ProgressCallback
  - MixedPrecision

We also recognize most of the other dimensions in the architecture:

  • batch size of 64
  • embedding size of 400
  • hidden size of LSTMs is 1150
  • 1200 and 50 output channels from the linear classifier layers
  • but why is seq_len 38? Didn't we set it to 72 for the data loader? We'll investigate this later in the post.
learn.model
SequentialRNN(
  (0): SentenceEncoder(
    (module): AWD_LSTM(
      (encoder): Embedding(30002, 400, padding_idx=1)
      (encoder_dp): EmbeddingDropout(
        (emb): Embedding(30002, 400, padding_idx=1)
      )
      (rnns): ModuleList(
        (0): WeightDropout(
          (module): LSTM(400, 1150, batch_first=True)
        )
        (1): WeightDropout(
          (module): LSTM(1150, 1150, batch_first=True)
        )
        (2): WeightDropout(
          (module): LSTM(1150, 400, batch_first=True)
        )
      )
      (input_dp): RNNDropout()
      (hidden_dps): ModuleList(
        (0): RNNDropout()
        (1): RNNDropout()
        (2): RNNDropout()
      )
    )
  )
  (1): PoolingLinearClassifier(
    (layers): Sequential(
      (0): LinBnDrop(
        (0): BatchNorm1d(1200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (1): Dropout(p=0.2, inplace=False)
        (2): Linear(in_features=1200, out_features=50, bias=False)
        (3): ReLU(inplace=True)
      )
      (1): LinBnDrop(
        (0): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (1): Dropout(p=0.1, inplace=False)
        (2): Linear(in_features=50, out_features=2, bias=False)
      )
    )
  )
)

Looking inside our model

Before we do the actual classification, let's have a look at what the model actually does at each step.

What does the SentenceEncoder do?

In short, the SentenceEncoder takes text and outputs a vector representation of it. Let's split the model into the SentenceEncoder and the classifier and inspect both:

enc, lin = learn.model.children()
enc
SentenceEncoder(
  (module): AWD_LSTM(
    (encoder): Embedding(30002, 400, padding_idx=1)
    (encoder_dp): EmbeddingDropout(
      (emb): Embedding(30002, 400, padding_idx=1)
    )
    (rnns): ModuleList(
      (0): WeightDropout(
        (module): LSTM(400, 1150, batch_first=True)
      )
      (1): WeightDropout(
        (module): LSTM(1150, 1150, batch_first=True)
      )
      (2): WeightDropout(
        (module): LSTM(1150, 400, batch_first=True)
      )
    )
    (input_dp): RNNDropout()
    (hidden_dps): ModuleList(
      (0): RNNDropout()
      (1): RNNDropout()
      (2): RNNDropout()
    )
  )
)
enc.summary(xb)
SentenceEncoder (Input shape: ['64 x 2990'])
================================================================
Layer (type)         Output Shape         Param #    Trainable 
================================================================
RNNDropout           64 x 38 x 400        0          False     
________________________________________________________________
RNNDropout           64 x 38 x 1150       0          False     
________________________________________________________________
RNNDropout           64 x 38 x 1150       0          False     
________________________________________________________________

Total params: 0
Total trainable params: 0
Total non-trainable params: 0

Here we see a seq_len of 38 again. This is an artifact of the model summary which is specific to the batch. The LSTM processes a batch in chunks of 72 - the actual seq_len. The summary shows the final chunk, where it simply happens to be 38 items left:

2990%72
38

What comes out of the encoder if we pass it a batch of data?

enc_x = enc(xb)
[e.shape for e in enc_x]
[torch.Size([64, 1406, 400]), torch.Size([64, 1406])]

The SentenceEncoder outputs two things, and if we check the source code, SentenceEncoder??, we find that it returns: return outs,mask.

The first output is our encoded text. The shape out the output is bs, len, embedding size. But where does the 1406 size come from? This is also specific to the batch we pass in. The get_text_classifier is called with a default max_len of 72*20 = 1440. So the SentenceEncoder encodes the final sequence of tokens up to a maximum length of 1440. It also makes sure to include the last sequence of the batch:

SentenceEncoder??

Our batch has size 2990, much longer than the maximum. So we will fill up 19 chunks of seq_len 72 and then add the remainder of 38 of the final chunk as the last part of the encoded sentence:

2990%72
38

So 19 full chunks and the remainder gives us:

72*19 + 38
1406

The mask output is a padding mask, that tells us where the padding tokens are in the processed text:

enc_x[1].sum(1).cpu()
tensor([  0,  22,  44,  27,  56,   6,   8,  20,  37,  66,  59,   1,  51,  68,
         31,  33,  33,  50, 103, 133, 146, 159, 179, 209, 210, 229, 267, 286,
        309, 309, 311, 316, 367, 396, 400, 402, 407, 412, 422, 426, 430, 430,
        439, 448, 471, 483, 488, 495, 497, 498, 507, 507, 533, 534, 544, 547,
        552, 556, 559, 560, 563, 575, 580, 581])

If we take the final 1406 tokens of our batch we get the same sum of padding tokens:

(xb[:, -1406:]==1).sum(1).cpu()
tensor([  0,  22,  44,  27,  56,   6,   8,  20,  37,  66,  59,   1,  51,  68,
         31,  33,  33,  50, 103, 133, 146, 159, 179, 209, 210, 229, 267, 286,
        309, 309, 311, 316, 367, 396, 400, 402, 407, 412, 422, 426, 430, 430,
        439, 448, 471, 483, 488, 495, 497, 498, 507, 507, 533, 534, 544, 547,
        552, 556, 559, 560, 563, 575, 580, 581])

What does the classifier look like?

The classifier takes the encoded sentence and produces a binary prediction:

lin.summary(enc_x)
PoolingLinearClassifier (Input shape: ["['64 x 1406 x 400', '64 x 1406']"])
================================================================
Layer (type)         Output Shape         Param #    Trainable 
================================================================
BatchNorm1d          64 x 1200            2,400      True      
________________________________________________________________
Dropout              64 x 1200            0          False     
________________________________________________________________
Linear               64 x 50              60,000     True      
________________________________________________________________
ReLU                 64 x 50              0          False     
________________________________________________________________
BatchNorm1d          64 x 50              100        True      
________________________________________________________________
Dropout              64 x 50              0          False     
________________________________________________________________
Linear               64 x 2               100        True      
________________________________________________________________

Total params: 62,600
Total trainable params: 62,600
Total non-trainable params: 0

If we pass the encoded batch through the model, we get a binary output for each item in our batch:

lin(enc_x)[0].shape
torch.Size([64, 2])

It' not entirely clear how we go from from the 1406 dimension input to the first batchnorm layer that expects 1200 size input. But if we check the source code of the PoolingLinearClassifier?? module it indeed has a masked_concat_pool in it's forward method which changes the dimensionality of the input.

Making a prediction

Let's pass the batch we grabbed above through the entire model:

preds = learn.model(xb)
[p.shape for p in preds]
[torch.Size([64, 2]), torch.Size([64, 1406, 400]), torch.Size([64, 1406, 400])]

Our model produces three things. The binary predictions for each item in the batch and the encoded sentences. Note that the two encoded batches of sentences are identical:

(preds[1]!=preds[2]).sum().item()
0

Since the model is untrained at this point, the prediction should be random. If we take the softmax of the final dimension we see that the model is mostly 50-50 on every prediction - so just a random guess at this point!

preds[0].softmax(-1)[:5].cpu()
tensor([[0.4973, 0.5027],
        [0.4912, 0.5088],
        [0.4805, 0.5195],
        [0.4879, 0.5121],
        [0.5008, 0.4992]], grad_fn=<CopyBackwards>)

A quick glance at the weights

Let's also take a quick glance at the various weights of the model: If we plot one of the weight matrices it's clear that the above weights matrix has been randomly and uniformly initialized, and thus has not been trained yet.

plt.hist(to_np(lin.state_dict()['layers.0.2.weight']).flatten());

If we instead check the one of the encoder weights, we get a completely different pattern. Most weights a centered around 0:

plt.hist(to_np(enc.state_dict()['module.rnns.1.module.weight_ih_l0']).flatten(), bins=50);

What are the frozen parameter groups?

Fastai operates with a concept of frozen parameter groups. I.e. the model is split into groups which can be made trainable or not. We can check the status of the models parameter groups:

len(learn.opt.param_groups), learn.opt.frozen_idx
(5, 4)

That is a total of 5 parameter groups and currently the first 4 is frozen. This corresponds with what we saw from the model summary - only the top classifier layer was trainable. During training we can unfreeze parts of the model:

learn.freeze_to(-3), learn.opt.frozen_idx
(None, 2)

Let's reset the model before we train it:

learn.freeze_to(-1), learn.opt.frozen_idx
(None, 4)

The parameter groups aren't named, but if we check the shape of the weights from each group we recognize the various layers:

  1. The embedding layer
  2. LSTM 1
  3. LSTM 2
  4. LSTM 3
  5. Linear classifier
[group.get('params')[0].shape for group in learn.opt.param_groups]
[torch.Size([30002, 400]),
 torch.Size([4600, 1150]),
 torch.Size([4600, 1150]),
 torch.Size([1600, 400]),
 torch.Size([1200])]

Training the model

It's high time to actually train the model. We'll stick with standard fastai procedure for fine tuning a language sentiment classifier:

  • lr_find() to find a sensible learning rate
  • train with the one cycle policy
  • gradual unfreezing of the layers - train the top layers first
  • discriminative learning rates. That is train the lower layers with a smaller learning rate than the top layers.

This procedure is well known from the course and documentation. It's also a very robust method in my experience. It seem to just work for mosts datasets.

learn.lr_find()
SuggestedLRs(lr_min=0.010000000149011612, lr_steep=0.0063095735386013985)

This pattern is typical for untrained networks: There is a part of the graph which shows substantial lower loss than the other parts. This makes sense since changing random weights should give us substantial improvements pretty fast. We'll start with a learning rate of 1e-2 and train for a few epochs. Note: We are only training the top layers in this step:

lr = 1e-2
learn.fit_one_cycle(3, lr)
epoch train_loss valid_loss accuracy time
0 0.619132 0.507724 0.752283 00:24
1 0.579055 0.467570 0.768265 00:23
2 0.540422 0.468025 0.779680 00:24

We're already at nearly 78% accuracy, and we haven't touched the actual language model yet! Let's unfreeze the first LSTM and run LR-find one more time:

learn.freeze_to(-2)
learn.lr_find()
SuggestedLRs(lr_min=6.309573450380412e-08, lr_steep=3.0199516913853586e-05)

This pattern is what you would expect from a model that has trained for a while. There are no more random weights left anymore, and further progress is much harder to find. We'll set a lower base learning rate for the remainder of the training, and slice this with a scaling factor. This means that the lowest parameter groups gets a lower learning rate, and the top layers a higher. When we slice we'll use the magic scaling factor* of 2.64 ~ 45. The source of this scaling factor is fastai guru Jeremy Howard, which found it to empirically work well. I've set this to 50 for sake of simplicity.

lr = 1e-3
scaling = 50
learn.fit_one_cycle(1, slice(lr/scaling, lr))
epoch train_loss valid_loss accuracy time
0 0.519882 0.468881 0.779680 00:29

We continue to unfreeze the next layer, and train a bit more:

learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(lr/scaling,lr))
epoch train_loss valid_loss accuracy time
0 0.468425 0.448788 0.797945 00:43

Our model is gradually improving, and we're at 8 % accuracy. Finally we unfreeze the entire model and train a final few epochs:

learn.unfreeze()
learn.fit_one_cycle(5, slice(lr/scaling,lr))
epoch train_loss valid_loss accuracy time
0 0.440907 0.427563 0.809361 00:56
1 0.374653 0.418673 0.819635 00:54
2 0.329612 0.423127 0.831050 00:51
3 0.279565 0.419359 0.828767 00:57
4 0.262696 0.429828 0.825342 00:55

We end up with around 83% validation accuracy in our best epoch. This is much better than the 50-50 we would have gotten from a random guess:

df['sentiment'].value_counts(normalize=True)
negative    0.511044
positive    0.488956
Name: sentiment, dtype: float64

But is our model really a good one? This is kind of difficult to say, since I haven't seen any results for this particular dataset, and we don't have any other models as baselines to compare with. The English IMDb dataset gets over 95% accuracy with ULMFiT. But this task likely has access to a better language model and has 10x more data. On the other hand, the reviews considered in this dataset are written by relatively few journalists, and we would maybe expect their writing style to be more consistent. Finally it should also be noted that our sentiment classes probably are closer than in the IMDb example. That is, we based negative on a rating of 1-3 of 6, and positive as 5-6. The IMDb reviews are more polarized, so the task is perhaps a bit easier.

Note that we also have a test set for this model, so this is the actual data set to score our model on

We already passed the test dataset to our dataloader, so now we can access it directly. The test set has index 2 (0 is train, and 1 is validation). So we can just validate with this data to get the score since we're not actually interested in the particular predictions at this point.

learn.validate(ds_idx=2)
(#2) [0.42844530940055847,0.8227384090423584]

The test set accuracy is the same as the validation set accuracy. This is kind of what we hoped for. That is, if there was larger differences, we would have to go back to our model to see if we did something wrong, or if the test set was sampled from a different distribution.

A finally important point is to investigate the predictions of the model. This could increase our understanding of what the model actually learns, discover data leaks, and in general increase our confidence in the model. We'll come back to model interpretation in a later post!