The Numerical Blindspot of Sentiment Models and How to Fix It

By Marvin Rajwadi, Data Scientist, Intelligent Voice Ltd

Sentiment analysis, also known as opinion mining, is a method of determining the emotional tone within text. Sentiment analysis is being widely used by organizations and companies to categorize and identify customer feedback about a service or a product, this gives the unstructured data structure and insight which can be used to improve, build, or innovate on the existing product/service.

Sentiment analysis technology mines data from sources such as social media comments, blogs, and product evaluations using natural language processing (NLP), machine learning, and computational linguistics. This information is usually categorised as positive, neutral, or negative.

Sentiment Analysis in conversational text

Most existing Sentiment analysis approaches prioritise on identifying sentiment of movie reviews or similar type of texts (product reviews, twitter posts etc). The review data utilised in these investigations are written as single narratives with no interaction between the authors or speakers.

With the advance in technology and availability of large-scale user data from social networking services such as Whatsapp, Twitter and Wechat, conversational messaging has risen as a popular means of communication among people. As a result, a significant number of interactive texts have been created, each of which contains a wealth of subjective data.

Lot of organizations rely on phone calls and audio-visual customer interaction to provide, monitor, and evaluate their services, such as banks, insurance companies and education sectors. Using sentiment analysis on customer interaction call and help determine corelation of sentiment between specific agents or teams that generate high or low customer satisfaction, it can help agents to flag calls that require attention based on customer sentiment, it can also help understand the impact of the length of the call associated with satisfied and unsatisfied customers.

Using sentiment analysis in more sensitive places like healthcare or emergency service presents extra challenges, as the model’s prediction can contribute to a life-or-death situation when used to help look for signs of vulnerability or deception.

Challenges of sentiment analysis in conversational text

The top sentiment analysis English dataset include:

1) Amazon Product Reviews

2) Stanford Sentiment Treebank

3) Multi-Domain Sentiment Dataset

4) IMDB Movies Reviews Dataset

5) Sentiment140

6) Twitter US Airline Sentiment

7) Paper Reviews Dataset

However, using deep learning models trained on these datasets can lead to misclassifications with numbers.

As the above datasets are reviews the numbers in the data largely skew the sentiment of the sentence, for example “The movie was 10 out of 10” here the overall sentiment of the sentence is conveyed by numbers. Using such datasets in more conversational environments can led to misclassification when numbers appear without context such as the customer giving their contact details or providing their age. A large majority of top available sentiment datasets are mostly reviews based on movies or products and are limited to only English language. Classifying sentiment other languages is difficult due to the scarcity of resources. To combat this issue, we are going to use a technique called “Zero-shot” learning.

Zero-shot Learning

Zero-shot learning is the ability to predict on data that the model has never seen in training, for instance using zero-shot learning we can train a model on only English sentiment and use the same model to predict sentiment in other languages without any explicit training on those languages.

For this experiment,  we will be using a pre-trained model as a base for speed, in this case Multilingual BERT (mBERT). mBERT is a multilingual model released by Google, the model was trained on 104 different languages. Wikipedia posts from different languages were used to train the model.

Conversational Sentiment Dataset

ScenarioSA is a high-quality English conversational dataset with labelled sentiment. There are 2,214 multi-turn discussions in the dataset, totalling approximately 24K utterances. Each dialogue has two speakers, identified only as A and B. Each phrase is carefully labelled with the polarity of its associated sentiment: positive, negative, or neutral.

When the dialogue is over, each participant’s final emotional state is likewise labelled. Furthermore, ScenarioSA’s talks encompass more grounded and natural scenarios, such as shopping, student life, employment, and so on.

To pre-process the data we remove any trailing whitespaces, remove all sorts of brackets and its contents, and most importantly, replace all numbers with ‘num’.


Our mBERT model was trained using the dataset. For training the dataset was randomly split 80% for training and 20% for testing. We used Adam optimizer with learning rate of 5e-5. Cross entropy loss was used as the loss function. The model was trained for 10 epochs and evaluated at the end of each epoch. The highest evaluation accuracy was achieved in the 8th epoch at F1 score of 87.71.


We tested the model on review datasets in various languages and compared the results with AWS sentiment for positive and negative classes (the review data did not include a neutral class and the trained model did not include a mixed class). In the table below the class column contains the groundtruth sentiment labels, followed by the accuracy achieved by AWS and OUR model, total sentences display the number of total data points in their corresponding classes, data source contains a link to outlined review dataset. mBERT is trained on the 104 different language versions of Wikipedia. However, each language does not contain the same number of articles – some languages like English have many more articles than others. The wiki article column shows the number of Wikipedia articles that mBERT was pre-trained on for the corresponding language.

English Amazon review dataset

French Amazon review dataset

German Amazon review dataset

Spanish Amazon review dataset

Real-life conversation data labelled by linguistic expert

Interpreting models classification

We trained two different BERT-base models, one using the IMDB movie review dataset (OLD model) and the other with the ScenarioSA (Zhang, 2020) dataset (NEW model), we also compared them against Amazon’s AWS Comprehend English sentiment model (AWS model). Now to visualize and interpret these models’ prediction we use a technique called Text Deconvolution by Occlusion (Rajwadi et al, 2019) which uses word masking to observe shifts in sentiment polarity and estimate the importance of each word in the sentence. We observed the shift in the positive and negative probabilities generated by the model regardless of the model’s overall prediction which showed us how positive or negative a word in the sentence is.

In the above figure, we can see the interpretation for both models’ classification. For the first sentence, the OLD model classifies it as Positive based on words rating, 10, and review. Whereas the NEW model classifies this sentence as negative based on the context of words ‘rating’ and ‘just’, we can also see this pattern in the AWS model which also considers the number 10 as being negative but not enough to impact the model’s prediction of Neutral. We can observe the same behaviour in the second example sentence where the OLD model considers the word ‘5 star’ as overly positive which overwhelms the negativity of the word ‘missing’. The NEW model classifies the sentence correctly based on the words then and missing, the AWS model considers the word ‘5 star’ as positive and the word ‘missing‘ as negative, hence generating a Mixed classification. This shows how the numbers in a review dataset can have a negative impact on the models’ predictions.

We asked a forensic statement analyst to review these results, and he was very clear that our “NEW” model had predicted correctly, and that both AWS and our “OLD” model had been thrown by the presence of numbers.


To conclude, we can safely say that excluding numbers from training data and employing the latest corpora for conversational sentiment can make models more suitable to deployment in the conversational setting, and can prevent misclassifications due to numerical bias whilst maintaining a competitively high accuracy. It should also be noted that the AWS sentiment models used in the test are individual models trained on corresponding languages with proprietary data, whereas we demonstrate here that a single multilingual model trained on publicly available English conversational sentiment data can provide comparable accuracy, proving the efficacy of zero-shot learning using the benchmark generated by AWS, for sentence classification tasks such as sentiment. This is a simple experiment using a limited data set and a pre-trained base model.  In production, we enhance these models with further labelled data to make them even more robust across a wider variety of languages and use cases.


1) Zhang, Y., 2020, ‘ScenarioSA: A Dyadic Conversational Database for Interactive Sentiment Analysis’, IEEE Access,

2) Blog post by Analytics Vidhya

3) Googles mBERT repo

4) 10 Popular sentiment datasets by Sameer Balagunur

5) IMDB movie review dataset

6) Rajwadi, M., et al., 2019, ‘Explaining Sentiment Classification,’ Interspeech 2019,