Do you know what an “Arianator” is? Even if you have never seen the term before, you can intuit its meaning when it occurs alongside another word used in a similar way. Here is “Arianator” used in the same context as “Belieber”:
Arianators are the best fandom!!! We love ari!
Beliebers are the best fandom!
Anyone who has studied for the SAT knows that when you come across a word you don’t know in a passage, you should treat the words around it as clues to help you understand its meaning. This is especially useful given the ever-changing nature of human language: meaning constantly shifts and new words surface on a regular basis, adding coverage of our conceptual space. Online resources do their best to keep up with the rapidly changing system, but even urbandictionary.com can’t reliably keep tabs on the constantly shifting relationships of the words we use.
Not long ago, the Natural Language Processing (NLP) community witnessed the development of “Word2Vec,” a very promising technique for automatically identifying relationships between words and their meanings using context. By training the model on many thousands of news documents or Wikipedia articles, researchers demonstrated that they could develop very accurate word representations that reflect real word usage in those domains.
At NetBase, we are primarily interested in understanding language use in social media. Training our system to process language based on a corpus of news documents can produce interesting results, but let’s face it: a Millennial on Twitter doesn’t use the same language as a professional journalist writing about the heat wave in Islamabad.
So instead of training our model on typical edited-and-proofread articles, we decided to train it using Google’s Word2Vec on about 50 Million Tweets from our enormous store of social media posts. As a part of our quest to find cutting-edge solutions for the most challenging problems in social-media analytics, we wanted to explore the behavior of this kind of model when used with our in-house, state-of-the-art NLP.
What is Word2Vec?
Google’s Word2Vec is a deep-learning-inspired method that attempts to understand meaning and semantic relationships among words. It works similarly to deep approaches, such as recurrent neural nets or deep neural nets, but is computationally more efficient. Word2Vec, since its introduction in this 2013 paper from Google, has taken the blogosphere by storm as a system that yields shockingly accurate and useful results without the need for any human or hand-coded annotation—it uses only completely unsupervised machine learning.
Word2Vec uses a set of unsupervised machine learning algorithms that train a Neural Network to learn the behavior of words in a corpus. More specifically, given a word, the Neural Network tries to estimate the context of that word (the paper discusses a number of ways to do this; however, in this blog post, I will use the Skip-gram language model). The output for each word is represented as a high-dimensional vector and the user can perform basic linear algebra to capture impressively accurate word relationships. Generally, bloggers point out the interesting properties of Word2Vec models when trained on large news corpora like Google News. The following canonical example has made the rounds on the Internet:
King – Man + Woman = Queen
The model appears to learn the semantic properties of words, and therefore knows which feature differentiates “King” from “Queen” (namely, gender).
How Word2Vec Works on Commonplace Language
Using the Gensim library for Python, we first trained a phrase-detection model to identify two- and three-word collocations that we want to treat as word units. Using the phrase model to tokenize our 50 million Tweet corpus, we trained a model with the Gensim Word2Vec implementation. After training, we set a minimum frequency threshold of 50 to remove unwanted noise (such as gibberish, uninformative words, and strange misspellings) from the vector space.
This allowed us to see some very interesting results that reflect insights about commonplace, conversational language. Our model, after reading millions of Tweets, could identify the parallel usage of two slang terms of endearment and recognize on which dimension they differ. Compare the canonical example above with the following result of our model:
dawg – man + woman = shawty
When controlling for gender, the words “dawg” and “shawty” have basically the same behavior on Twitter.
Finding Conversational Insights
After you have a model that knows about word usage and behavior on Twitter, you can use it to mine useful conversational insights. For instance, if we want to find out which restaurant is talked about in the most similar way to Taco Bell, we can submit this to the model:
taco bell ≈ chipotle
This is accurate and straightforward, as both Taco Bell and Chipotle are affordable, quick-service restaurants that serve Mexican-inspired foods. So let’s get a little more complicated. For example, we want to know what restaurant is discussed in the same way as Taco Bell on Twitter, but we don’t want to limit the results to Mexican-style food:
taco bell – taco ≈ pizza hut
Subtracting the vector for “taco” in essence removes the Mexican food component from the “taco bell” vector in our model. The resulting vector, “taco bell – taco,” is most similar to “Pizza Hut.” Note that in this case “taco” is simply a word that is representative of the Mexican-food aspect of the restaurant. We get the same results if we subtract the vector for “churros” instead.
taco bell – churros ≈ pizza hut
These results are surprisingly accurate; both Pizza Hut and Taco Bell are Yum Brands restaurants with a similar target market. After controlling for the type of food (subtracting “taco” or “churros” from “taco bell”), Taco Bell and Pizza Hut are the most similar.
The demonstrated functionality of finding similar entities based on how people talk about them on Twitter can be extended to a variety of domains. Let’s take musical artists as a second use case. How can you discover new artists similar to one that you know and like? One option is to observe how people talk about the artist you like and find others who are discussed in a similar way. If two artists’ names are found to occur in the same environment in millions of posts, we can consider them similar. Our Word2Vec model does this for us:
dierks bentley ≈ (darius rucker, rascal flatts, keith urban, kip moore, jason aldean)
And for the hip-hop heads out there:
childish gambino ≈ (frank ocean, kendrick lamar, a$ap rocky, denzel curry, mac miller)
The most impressive aspect of all of these examples is that we gave our model absolutely zero additional information—no annotated sentences, no world-knowledge data—just millions of Tweets. Using only this input data, our model was able to learn the semantic properties of words, places, and people in an almost human-like way.
Discovering New Words
In addition to these impressive results, the model can also be used to discover new words. Take as an example the term “beliebers,” the name of the legion of extreme Justin Bieber fans on social media. If we query our model for terms similar to “beliebers,” we find the names of a number of other pop-music fan-groups: “selenators” (Selena Gomez), “directioners” (One Direction), and “arianators” (Ariana Grande). If we want to explore this further, we can pick any of these three and find similar terms in the model.
arianators ≈ (lovatics, 5sosfam, smilers , swifties, vampettes, pentaholics, mahomies)
Identifying these fan groups would take time and research without the help of the Word2Vec model to read and analyze millions of Tweets, extracting actionable insights in the process.
To extend this functionality even further, let’s look at the modern-day hieroglyphs of the Internet—emojis. Since our model isn’t burdened by part-of-speech classes or hand-coded lexical properties, it has no problem learning the semantics of pictographic words and identifying relationships between them:
Our model easily learns the semantics of emojis, something that even users have difficulty with at times, by keeping track of the context in which they are used. In learning word representations, as in real estate, it’s all about location, location, location!
Training Word2Vec on millions of Tweets proved to be a useful experiment, allowing us to discover relationships between a wide range of words and concepts used in social media. We saw how to find restaurants based on similarity and feature-specification, and that conversation about musical artists is reflective of characteristics of the artists themselves (like musical genre, for instance). Furthermore, the model’s ability to rapidly digest such large amounts of data means that it can learn the meanings and relationships of novel words, such as the names of social media fandom groups and even diverse emojis.
And this insight has proved invaluable when analyzing sentiment of today’s online consumer. Reach out and we’re happy to share specific use cases and even run a scenario specific to you!