Movie Review Sentiment Analysis
A movie review website allows users to submit reviews describing what they either liked or disliked about a particular movie. Being able to mine these reviews and generate valuable meta data that describes its content provides an opportunity to understand the general sentiment around that movie in a democratized way. That’s a pretty cool thing if you think about it. Using machine learning we can democratize subjectivity about anything in the world. We can make an objective analysis of subjective content, giving us the ability to better understand trends around products and services that we can use to make better decisions as consumers.
Sentiment Analysis Data Model
One of the major barriers to unlocking this ability is in the way we structure and transform our data. The current state-of-the-art methods include approaches such as Naive Bayes, Support Vector Machines, and Maximum Entropy. The challenges imposed by these approaches still remains in how features are extracted from a text and structured as data in a way that is least costly in terms of performance. I decided to focus on solving the problem of performance, in the way features are selected and extracted, and the availability of that data as the number of features grow over time.
Using a feature selection algorithm I describe here, I used the Graph Database Neo4j to solve the challenge of data transformation and availability. While the state of the art natural language parsing algorithms are focused on sentence structure, I’ve decided to pursue a statistical approach to natural language grammar induction. My approach focuses on generalizations across a vast corpus of text, generating new features using deep learning to predict features with the highest probability of being present to the left or right of a new feature.
Graph-based NLP Example
Let’s assume that the phrase “one of the worst” has been extracted as a feature of a set of texts. The reason that this phrase was extracted was that a phrase that it was descended from had determined that this particular phrase was the most statistically relevant, meaning that the phrase had the best chance of being matched after the parent phrase. Using Neo4j we can determine the line of inheritance that produced this phrase as a feature.