“How do the themes and sentiments in critic reviews compare between highly rated and poorly rated movies?”
Hello, basically, I need to analyze the attached data using R to answer the research question
“How do the themes and sentiments in critic reviews compare between highly rated and
poorly rated movies?”
Belos is the link for the data:- https://www.transfernow.net/dl/20240421001BnhBN
We aim to use text mining and sentiment analysis techniques such as decision trees and SVMs for
feature extraction along with categorizing sentiments such that reviews can not only be distinguished
as positive or negative, but the intensity and subjectivity of the review is also assessed. The text
mining approach aims to dissect and compare the thematic and sentiment nuances in critic reviews
for highly rated versus poorly rated movies. This comparative analysis leverages sophisticated text
mining techniques to extract and analyze linguistic features, identifying distinctive patterns that
correlate with the movies’ ratings. Initially, reviews will undergo preprocessing to cleanse and
standardize the text, facilitating deeper linguistic analysis. Key phrases and terms will be extracted
using bag of words, n-gram, and TF-IDF (Term Frequency-Inverse Document Frequency) metrics to
highlight significant words and phrases that are uniquely prevalent in either high or low-rated movie
reviews
Unveiling Sentiments: A Data-Driven Semantic Analysis of the Polarity in Movie Reviews
Background and Literature Review
With over decades of development experience, the movie industry has been a huge,
worldwide industry that continually reinvents itself to keep up with the sentiment of the times. The
various genres of movies, from comedy to action to thriller, not only provide colorful
entertainment to audiences, but also connect them with the vision and message the director is trying
to portray. In the era of technology and social media, it has never been easier to watch a movie and
leave a subsequent review of it online. While movie reviews have traditionally come in both
numerical and descriptive formats, there has been a shift where new websites, apps, and forms of
social media have developed to encourage textual reviews. Websites like Rotten Tomatoes,
Letterboxd, and Criticker are growing in popularity among frequent movie watchers, and are often
used to determine if a movie is worth watching in the first place. In fact, a 2018 study done on
topic consistency found that about 63% of U.S. adults heavily rely on online reviews before seeing
a movie.1 To the everyday user, a collection of these reviews provide both a quantitative and
qualitative description of a movie’s success. The rating figures determine how high a movie is
scored among others in the same genre, and the content of the review illustrates a deeper insight
into the movie, highlighting characteristics from the plot to the actor choice to the wardrobe to the
screenplay and so on, and gives a conclusion as to whether the film met the expectations of the
reviewer or not. Similarly, to the filmmaker and cast, descriptive movie reviews speak to the
audience’s sentiment about the movie, both its highlights and drawbacks, and serves as feedback
that could be incorporated into further films.
Sentiment analysis is a machine learning approach, aiming to gather opinions from a piece
of text by categorizing them as positive, negative, or neutral.2 In the context of movie reviews,
sentiment analysis extracts subjective descriptors from written reviews to label it as ‘exciting’,
‘upsetting’, ‘thrilling’, and so on and uses techniques similar to text mining and natural language
processing. In this research project, we utilize Sentiment Analysis on a dataset of user reviews
collected from Rotten Tomatoes to gauge what the overall reaction to the movie was, whether they
liked or disliked it. We further aim to use the specific word relationships and patterns that are
highlighted in the review to assess whether the review is positive or negative. This research
analysis can then be used by filmmakers to measure the overall sentiment and performance of their
creation, or to create a recommender system as part of a software to recommend movies to critics
based on their preferences.
1
Kim, E., Ding, M., Wang, X. et al (2022) Does Topic Consistency Matter? A Study of Critic and User Reviews in the Movie Industry. Journal
of Marketing, 87 (3), 428-450, doi: 10.1177/0022229221127927
2
Sentimental Analysis of Movie Reviews Using Machine Learning Harsh Sharma, Satyajit Pangaonkar, Reena Gunjan and Prakash Rokade ITM
Web Conf., 53 (2023) 02006 DOI: https://doi.org/10.1051/itmconf/20235302006
1
Literature surrounding sentiment analysis has been comprehensive, Pang and Lee (2008) 3
published a report describing the different techniques used in sentiment analysis and opinion
mining, along with its applications, demand, and classification. Similarly, Maas et al. (2011)4
conducted research on word vectors via classification models to understand and differentiate the
polarity amongst reviews. Their research was creative and came in handy when faced with rich
datasets using subjective language and phrasing. Peter Turney (2002)5 also implemented a simple
learning algorithm delineating reviews as either “thumbs up” or “thumbs down” predicted by a
collection of adjectives or adverbs denoting semantic outlook. All the studies mentioned above
employ feature extraction techniques along with Random Forest and SVMs as classifiers, however
exclude the concept of neutrality entirely. Among each paper, the researchers study polarity
amongst reviews, as the neutral category is too similar to the binary classifiers, making it
increasingly difficult to label. The varying language used in reviews, along with structure, length,
and tone can also create errors when machines try to interpret them. As such, our research aims to
test some of the common machine learning and sentiment analysis methods employed in the
literature above, to understand sentiments around descriptive reviews provided by online critics
and to assess how these factors are associated with a movie’s success.
Research Questions & Modelling
In order to test various modeling approaches, our research study will assess the following
questions:
1. What is the distribution of sentiment across critic reviews, and how does sentiment
correlate with the numerical scores assigned by critics? Can we predict the score of a
review based on its sentiment?
We aim to use clustering to assess the distribution of sentiment and a linear regression to see if we
can predict the score.
2. How do the themes and sentiments in critic reviews compare between highly rated and
poorly rated movies?
We aim to use text mining and sentiment analysis techniques such as decision trees and SVMs for
feature extraction along with categorizing sentiments such that reviews can not only be
distinguished as positive or negative, but the intensity and subjectivity of the review is also
3
Pang, B. and Lee, L. (2008) Opinion Mining and Sentiment Analysis. Foundations and Trends® in Information Retrieval, 2, 1-135.
https://doi.org/10.1561/1500000011
4
Maas, Andrew & Daly, Raymond & Pham, Peter & Huang, Dan & Ng, Andrew & Potts, Christopher. (2011). Learning Word Vectors for
Sentiment Analysis. 142-150.
5
Turney, P. D. (2002). Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. arXiv preprint
cs/0212032.
2
assessed. The text mining approach aims to dissect and compare the thematic and sentiment
nuances in critic reviews for highly rated versus poorly rated movies. This comparative analysis
leverages sophisticated text mining techniques to extract and analyze linguistic features,
identifying distinctive patterns that correlate with the movies’ ratings. Initially, reviews will
undergo preprocessing to cleanse and standardize the text, facilitating deeper linguistic analysis.
Key phrases and terms will be extracted using bag of words, n-gram, and TF-IDF (Term
Frequency-Inverse Document Frequency) metrics to highlight significant words and phrases that
are uniquely prevalent in either high or low-rated movie reviews.
Data
The data used by our research project is obtained from the RottenTomatoes movies and
critic reviews dataset posted on Kaggle (Leono, 2020). It was collected from Rotten Tomatoes
website as of October 31 in 2020. It contains two datasets: the dataset with all the movies, including
17,712 rows of records with columns titled: movie title, description, genres, duration, director,
actors, users’ ratings, and critics’ ratings. The second dataset contains data about reviews and
includes 1,130,017 rows of records regarding critic name, review publication, date, score, and
content. These two datasets are interrelated by the column “rotten_tomatoes_link” which means a
unique code for a movie available on Rotten Tomatoes.
Data Preparation
The raw datasets were processed via R to clean and prepare for further analysis. For the
dataset of reviews, three columns (critic_name, review_score, and review_content) contain
missing values. As it is not feasible to impute missing values to textual columns, we removed the
records with missing values from the dataset. The class of column “review_date” was identified
by R as characters instead of date, so we revised this column to make it the correct format. The
main part of data preparation was dealing with the column “review_score”. This char-column
contained various kinds of data, including character-style data like “A-” or even “A -”, numericstyle data like “8” (assuming a 1-10 score scale) or “80” (assuming a 1-100 score scale) , and the
most complex fraction-style data like “3.5/5”. For the convenience of analysis, we converted this
special column to a quantitative variable range between 0 and 1. The abnormal scores like “4/0”
were removed from the dataset. After processing similar steps to the other dataset which is
regarding the information of movies, we merged it with the dataset of reviews to gain a new
combined dataset. This dataset will be used for the following procedures of our research project.
As the size of the cleaned combined dataset is too large to upload, here is a link to the
shared Google folder where the file named cleaned_combined_dataset.csv is in: https://drive.goo
gle.com/drive/folders/1w0LlBRfR5WJNrU0XR5_AIHwnondH5D37?usp=sharing.
3
References
Kim, E., Ding, M., Wang, X. et al (2022) Does Topic Consistency Matter? A Study of Critic and
User Reviews in the Movie Industry. Journal of Marketing, 87 (3), 428-450, doi:
10.1177/00222429221127927
Sentimental Analysis of Movie Reviews Using Machine Learning Harsh Sharma, Satyajit
Pangaonkar, Reena Gunjan and Prakash Rokade ITM Web Conf., 53 (2023) 02006 DOI:
https://doi.org/10.1051/itmconf/20235302006
Pang, B. and Lee, L. (2008) Opinion Mining and Sentiment Analysis. Foundations and Trends®
in Information Retrieval, 2, 1-135. https://doi.org/10.1561/1500000011
Maas, Andrew & Daly, Raymond & Pham, Peter & Huang, Dan & Ng, Andrew & Potts,
Christopher. (2011). Learning Word Vectors for Sentiment Analysis. 142-150.
Turney, P. D. (2002). Thumbs up or thumbs down? Semantic orientation applied to unsupervised
classification of reviews. arXiv preprint cs/0212032.
Williams, S. D. (n.d.) How Filmmakers Connect With Audiences. Movie Outline. https:
//www.movieoutline.com/articles/how-filmmakers-connect-with-audiences.html
d’Astous, A. & Touil, N. (1999) Consumer evaluations of movies on the basis of critics’
judgments. Psychology and Marketing, 16 (8), 677-694. doi: 10.1002/(SICI)1520-6793
(199912)16:8%3C677::AID-MAR4%3E3.0.CO;2-T
Legoux, R., Larocque, D., Laporte, S. et al. (2016) The effect of critical reviews on exhibitors’
decisions: Do reviews affect the survival of a movie on screen? International Journal of
Research in Marketing, 33 (2), 357-374. doi: 10.1016/j.ijresmar.2015.07.003
Leono, S. (2020) Rotten Tomatoes movies and critic reviews dataset. Kaggle. https://www.kaggl
e.com/datasets/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset/data
N. L. Adam, N. H. Rosli and S. C. Soh, “Sentiment Analysis on Movie Review using Naïve
Bayes,” 2021 2nd International Conference on Artificial Intelligence and Data Sciences
(AiDAS), IPOH, Malaysia, 2021, pp. 1-6, doi: 10.1109/AiDAS53897.2021.9574419.
4