Basic Sentiment Analysis with Julia using LSTM

5 min readSep 15, 2020

The idea of this post is to make an introduction to sentiment analysis using Julia, a language design to high performance, and have a similar syntax with Python.

Sentiment analysis has grown over the scenario of artificial intelligence in the last years, bring changes in how to collect information about the perception of the user to a certain product, treat patients, discover diseases, etc. Many datasets have been used by researchers to measure their performance, so for this post, we are using IMDB’s dataset which contains reviews from users.

There are packages in Julia that provide a pre-processed dataset. For this article, we will use the dataset provide by CorpusLoader.
To load the dataset we just need a simple command:dataset_train_pos = load(IMDB("train_pos"))

Some variations of this command is passing as parameters: “train_pos”, “train_neg”, “test_pos”, “test_neg”. This will give you part of this dataset already labels.

dataset_test_pos = load(IMDB("test_pos"))
dataset_train_neg = load(IMDB("train_neg"))
dataset_test_neg = load(IMDB("test_neg"))

Let’s transform this into a single array of tokens

julia> using Base.Iterators

julia> docs = collect(take(dataset_train_pos, 2)

This will transform or dataset into an array of arrays ( Array{Array{String,1}}, in summary, we got a list of sentences tokenized. But within the tokens, we can found stopwords. In the next step let’s remove them.

Stopwords

Stopwords are words that don’t aggregate value to the sentences, some of them are: is, like, as … and many others.

Julia has a package that contains stopwords populate, this package is called Languages.

Using Languageslist_stopwords = stopwords(Languages.English())

490-element Array{String,1}:  "a"           "about"       "above"       "across"      "after"       "again"       "against"     "all"         "almost"      "alone"       "along"       "already"     "also"        ⋮             "young"       "younger"     "youngest"    "your"        "you're"      "yours"       "yourself"    "yourselves"  "you've"      "z"           ""

This way we can create a function to remove all the stopword from our arrays

function removeStopWords(tokens)
    filtered_sentence = []
    for token in tokens
        if  !(lowercase(token) in list_stopwords)
            push!(filtered_sentence, lowercase(token))
        end
    end
            
    return filtered_sentence
end

Removing punctions

Since our sentences are tokenized, we will assume that every punctuation represents a single cell on an array. This way we can use the following functions to convert the array into string than back to the array

using Unicode
function convert_clean_arr(arr)
    arr = string.(arr)
    arr = Unicode.normalize.(arr, stripmark=true)
    arr = map(x -> replace(x, r"[^a-zA-Z0-9_]" => ""), arr)
    return arr
end

Now it’s time to create our vocab. The vocab will be used to transform the strings into numbers. By doing the weights will be fit in order to provide the functions to activates the neurons in our model

First, let put all the positives and negatives together:

train_set = [docs_train_pos; docs_train_neg]
all_letters = collect(Iterators.flatten(train_set));

Now we can iterate over the all_letters and mapping by the index of the array

vocab = Dict()
index = 1
for (item,v) in counter_letter
    vocab[lowercase(item)] = index
    index = index + 1
end

So let transformer the words into index

reviews_index_vocab = []
for review in train_set
    r = [get(vocab, lowercase(w), 0) for w in review]
    push!(reviews_index_vocab,r)
end

Pad sequences

In order to fit our data in our model, let us pad the sentences. This way we can normalize the array size

function pad_features(reviews_int, length_max)
    features = []
    for review_int in reviews_int
        dim_review = size(review_int)[1]
        pad_size = length_max-dim_review
        if pad_size > 0
            pad_array = zeros(Int64, pad_size)
            result = append!(pad_array,review_int)
        else
            result = review_int[1:length_max]
        end
        push!(features, result)
    end
    return features
end

This function will provide an array with length_max, if some sentence is larger than the length_max it suffers a cut.

Create LSTM model

Flux provides us the Chain structure, this simplifies how we can build multiple layers to our Deep Learning model. Let's build to LSTM layers with a softmax output layer.

model = Chain(
LSTM(300, 128),
LSTM(128,10),
Dense(10, 2),
softmax)

Let’s create a Loss function, fortunately, Flux provides some for us, so for this example, we are using Flux.mse and using Adam as our optimizer.

L(x, y) = Flux.mse(model(x), y)

opt = ADAM(0.001)

Now we have to calculate measure whether our model is getting better. So let create a function to calculate the prediction.

prediction(i) = findmax(model(new_test_features[i]))[2]-1

Before we train the model, it’s necessary to put the dataset in the format to be accepted into Flux.train. The format’s a tuple from data and classification.

train_set_full = [ (new_features[i], train_label[i])  for i = 1:size(new_features)[1]];
test_set_full = [ (new_test_features[i], test_label[i])  for i = 1:size(new_test_features)[1]];

Now, we need to train our model over that data. For this, we will use Flux.train function, this is a powerful tool from Flux, where you can iterate over your dataset and update the Loss function. Here is an example of how to use it and sample the model to a future.

@info("Beginning training loop...")
best_acc = 0.0
last_improvement = 0
for epoch_idx in 1:200
    
    global best_acc, last_improvement
    Flux.train!(Loss, params(model), train_set_full, opt)

    # Calculate accuracy:
    acc = sum(prediction(i) == test_label[i] for i in 1:length(test_label))/length(test_label)
    @info(@sprintf("[%d]: Test accuracy: %.4f", epoch_idx, acc))
    
    # If our accuracy is good enough, quit out.
    if acc >= 0.999
        @info(" -> Early-exiting: We reached our target accuracy of 99.9%")
        break
    end

    # If this is the best accuracy we've seen so far, save the model out
    if acc >= best_acc
        @info(" -> New best accuracy! Saving model out to mnist_conv.bson")
        BSON.@save "mnist_conv.bson" model epoch_idx acc
        best_acc = acc
        last_improvement = epoch_idx
    end

    # If we haven't seen improvement in 5 epochs, drop our learning rate:
    if epoch_idx - last_improvement >= 5 && opt.eta > 1e-6
        opt.eta /= 10.0
        @warn(" -> Haven't improved in a while, dropping learning rate to $(opt.eta)!")

        # After dropping learning rate, give it a few epochs to improve
        last_improvement = epoch_idx
    end

    if epoch_idx - last_improvement >= 10
        @warn(" -> We're calling this converged.")
        break
    end
end

After running the cell above, we can see the messages:

┌ Info: [30]: Test accuracy: 0.5057
└ @ Main In[91]:13
┌ Info: [31]: Test accuracy: 0.5057
└ @ Main In[91]:13
┌ Info: [32]: Test accuracy: 0.5057
└ @ Main In[91]:13
┌ Info: [33]: Test accuracy: 0.5056
└ @ Main In[91]:13
┌ Info: [34]: Test accuracy: 0.5056
└ @ Main In[91]:13
┌ Info: [35]: Test accuracy: 0.5056
└ @ Main In[91]:13
┌ Info: [36]: Test accuracy: 0.5056
└ @ Main In[91]:13
┌ Warning:  -> We're calling this converged.
└ @ Main In[91]:39

Conclusion

This post focus in create a basic sentiment model and how to do it step by step. The best result I got so far was only 0.51 accuracy, but there are a lot of improvements to do in pre-preprocessing, chance the model, use bi-directional LSTM. Even use more data for the Sentiment dataset in the beginning.

I hope you have enjoyed it, let me know what you think and your ideas.