Finding a Norwegian language dataset for sentiment analysis
In this post I'll try to find a Norwegian language dataset suitable for sentiment analysis. This dataset will be used in the upcoming posts to test various deep learning methods such as ULMFiT and MultiFiT. I will also experiment with the fastai2 deep learning library.
I'm currently taking the fast.ai online course practical deep learning for coders (to be released publicly in July 2020). As part of the studies I want to explore the fastai2 deep learning library. I will do so by testing various NLP methods.
Over the next few posts I'll got through the entire process of finding and processing data, training various models and also interpret the results. I'll try to highlight the various things I've been struggling with or confused about. I think writing short blog posts such as these are a great way of learning new material.
I wanted to find a dataset in my native Norwegian to analyze. NLP for languages other than English is often challenging, even though many new techniques are addressing this. After a bit of searching I found the Norec dataset. It contains Norwegian language reviews of various films, music etc. The dataset even comes with a paper that explains the setup of the data! This seems like a great case study, and is similar to the IMDB movie review sentiment analyses, which is one of the built in datasets of the fastai2 library.
The repo includes a utility library and a download.sh script. This is what you normally would want to use to ease the process of actually getting the data, but in this case I would like to do things manually. We'll be using the fastai2 library (which upon release will be renamed to fastai).
Note that import *
is usually not encouraged, but the fastai2 library has defined it's __all__ variables properly, so this won't be a problem. See this file for the various imports that is part of the library.
from fastai2.text.all import *
I'm using fastai2 '0.0.17' and fastcore '0.1.17'. To check the version you can uncomment the following lines:
#collapse
# import fastai2, fastcore
# fastai2.__version__, fastcore.__version__
First we'll download and extract the data from the url
given in the download.sh file in the github-repo to a dest
with archive_name
. Note the !command
syntax which runs the corresponding wget shell command. The wget command only has to be run once.
url = f'http://folk.uio.no/eivinabe/norec-1.0.1.tar.gz'
data_path = Path('~/.fastai')
dest = data_path/'archive'
archive_name = 'norec.tar.gz'
!wget {url} -O {dest/archive_name} -q
Then we'll extract the archive with the tarfile library (imported via the fastai2 import at the beginning of the notebook). We'll extract from the archive location to data
. Finally we set a new path
that points to this location.
tarfile.open(dest/archive_name).extractall(data_path/'data')
path = data_path/'data/norec'
Path.BASE_PATH=path # this avoids printing the entire file-path when we list files in a directory
path.ls()
The archive contains a .json file with metadata, and among other, a html.tar.gz archive with our desired raw texts. We'll extract this archive too. The conllu.tar.gz contains tokenized and filtered text, and we don't need this for the time being.
tarfile.open(path/'html.tar.gz').extractall(path)
path.ls()
The extracted archive is now in the data/norec/html
folder, which contains the train, dev (we'll call it validation) and test split.
(path/'html').ls()
Let's inspect the metatdata file and see if we are able to make sense of the data. But what does this json file look like? We can use the head
command to have a look at the raw file contents before we attempt to read it into a pandas dataframe:
!head -n 20 {path/'metadata.json'}
Let's try to read the data with pandas default read_json()
pd.read_json(path/'metadata.json').head(3)
Not quite what we looked for! From the read_json documentation we see that we can change the orientation of the data with the orient
option. 'index' seems to be what we look for. Note that we also could have transposed the data frame for a similar result.
There is another problem that is harder to spot: The index of our dataframe should be the string representation of the json key. This is the actual file name of the corresponding review, but it is cast to an int
. This turns '000000' into 0. So we'll set convert_axes
to False
. We'll also reset the index and rename it to filename
.
df = pd.read_json(path/'metadata.json', orient='index', convert_axes=False)
df = df.reset_index().rename(columns={'index': 'filename'})
df.head(3)
There are several category-like columns as explained in the paper. We will use the category
column.
df['category'].value_counts()
df['language'].value_counts()
We'll proceed with a subset of the data. We'll look at the screen sub category and bokmål (nb) language. The screen category contains both movie and TV-reviews. The data frame also contains several columns we won't be using now so let's select the relevant columns. We're left with ~ 13000 reviews.
screen = df.loc[(df['category']=='screen') & (df['language']=='nb')]
screen = screen.loc[:, ['filename', 'rating', 'title', 'split']]
print(screen.shape)
screen.head(3)
Let's also change the filename so that is gives the path to our review files.
screen['filename'] = 'html/'+screen['split']+'/'+screen['filename']+'.html'
screen.sample(3)
The ratings of the screen category has a slight positive skew.
screen['rating'].value_counts().sort_index().plot(kind='bar');
We'll encode a rating of 1 to 3 as negative, and 5 and 6 as positive. Reviews rated 4 will be removed. We'll lose a bit of data this way, but this means that the positive and negative review are a bit more distinct, and will make down stream classification a bit simpler. This leaves us with ~8600 reviews.
screen = screen.loc[screen['rating']!=4].reset_index(drop=True)
screen['sentiment']=screen['rating'].apply(lambda k: 'positive' if k>=4 else 'negative')
print(screen.shape)
screen.head(3)
The train, validation (dev) and test split is ok:
screen['split'].value_counts(normalize=True)
And the dataset is well balanced, i.e. similar amount of labels for each class.
screen['sentiment'].value_counts(normalize=True)
Let's also add the full text to the dataframe for convenience. This step is not strictly necessary, and doesn't scale to big data. The html/train
folder contains html files, and the data frame gives us our filenames.
(path/'html/train').ls()
screen.head(3)
Let's inspect the second file in the data frame. We expect it should be a review of Twin Peaks. We'll open the file and print the raw contents.
fn = screen.loc[1, 'filename']
item = (path/fn)
item
with open(item) as f:
review = f.read()
review[:1500]
The data contains normal text but also several html-tags. The REMOVE tag was added by the authors of the dataset to tag unwanted text such as image captions. In general we want to keep our text as intact as possible, but some text is clearly noise. So we'll proceed to get rid of the REMOVE tags and titles. The norec repo contains a method to do this with the lxml library. We'll change the code slightly to also remove headers.
from lxml.html import fragments_fromstring
def html_to_text(html):
return ' '.join(elem.text_content() for elem in fragments_fromstring(html) if elem.tag == 'p')
html_to_text(review)[:1500]
That looks much better! Now lets combine the two methods to make a function to easily get html reviews from a file name
def get_review(fn):
with open(fn) as f:
return(html_to_text(f.read()))
get_review(item)[:1500]
Finally we append the review text to our data frame, and save it for future use.
screen['text'] = screen['filename'].apply(lambda o: get_review(path/o))
screen.to_feather(path/'norec_df')
screen.head(3)
I will use this dataset in future posts to explore various NLP techniques. In the upcoming post we will see if we are able to train a ULMFiT classifier on this dataset, and see how it compares to the results for the similar english IMDB-dataset. I also hope to test MultiFiT in a later post.