Tag Archives: twitter

emoji header

The Most Popular Emoji Characters on Twitter

On Twitter, about 10% of general-topic tweets contain emoji characters, the tiny icons and emoticons, which are starting to get more attention when analyzing tweets, Facebook messages, or text messages. An emoji [] can capture an emotion or completely change the meaning of the written text. Before exploring how different emojis are used and what they mean to people, I wanted to get an idea of how prevalent they are and which ones are the most popular on Twitter.

Emotion:

Changes Meaning:

How I Did This

I collected tweets using a sampled stream from Twitter. In order to get a general representative sample of tweets, I tracked five popular, basic words: ‘the’, ‘and’, ‘to’, ‘you’, and ‘it’. These words are good search words, since there aren’t many sentences or thoughts that don’t use them. A Python script was used to find and count all the the emojis present in a collection of over 100,000 tweets. To avoid skewing due to a popular celebrity or viral tweet, I removed any retweets which were obvious retweets, and not retweets which function more like mentions.

Results

Emoji Use on Twitter

In the general collection of tweets, I found that 10.23% of tweets contained at least one emoji. So there isn’t an overwhelming number of tweets which contain an emoji, but 10% of Twitter content is a significant portion. The ‘Emoji Selection’ graph shows the percentage of tweets containing that particular emoji out of the tweets that HAD an emoji in it. The most popular emoji by and far was the ‘tears of joy’ emoji followed by the ‘loudly crying’ emoji . Heart-related emoji [the ones I thought would prove most popular] was third and fourth.

Emoji Selection on Twitter

Since I only collected these over the course of a day and not over several weeks or months, I would be hesitant to think these results would hold up over time. An event or seasonality can trigger a cascade of people using a certain emoji. For example, the Christmas tree emoji was popular being present in 2.16% of tweets that included emojis; this would be expected to get larger as we get closer to Christmas and smaller after Christmas. Another interesting find is that the emoji ranks high. My pure conjecture is that this emoji’s high use rate is due to protests in Ferguson and around the country. To confirm this I would need a sample of tweets from before the grand jury announcement or track the use as time passes.

Further analysis could utilize emoji groups or clusters. Emojis with similar meanings would not necessarily produce a high number if people spread their selection over 5 emoji instead of one. I plan to update this and expand on this as time passes and I’m able to collect more data.

Technical

In order to avoid any conflicts with ASCII conversions that some Python or R packages do on Twitter data, I stored tweets from the Twitter Streaming API directly into a MongoDB database, which encodes strings in UTF-8. Since tweets come from the API as a JSON object, they can be naturally stored in the document-orientated database with each metadata field in the tweet being accessible without parsing the entire tweet into a data frame or SQL database. Retweets were removed by finding any tweets with ‘RT’ in the first two characters of the text entry. This is how Twitter represents automatic retweets in JSON format.

Also since I collected 103,416 tweets the margin of error for any of the proportions given are well below 1%. Events within the social network would definitely outweigh any margin of error.

AL Wild Card Game Twitter Map

2014 ALWCG Twitter Graphs

The Royals and A’s had quite the entertaining 12-inning game Tuesday night. These are a few graphs I made from Twitter data. Yellow is Oakland; blue is Kansas City. The proportions of tweets between teams might be off, but I would venture to guess the Royals had much more social media activity than the A’s. The map shows geotagged tweets from 5PM to 1AM EDT from yesterday. The middle of the country was solid blue, California was pretty yellow, and the East Coast was rather mixed.

AL Wild Card Game Twitter Map

The volume of tweets per minute is a pretty cool view of what happened during the game. It looks like the Royals really outpaced the A’s for volume, but I’d have to use some controls to determine that for sure. These are just for fun.

AL Wild Card Game Twitter Time Series

I used Twitter’s streaming API to collect tweets with keywords like “Royals”, “A’s”, “TakeTheCrown”, “GreenCollar”, etc. I could have missed a crucial element of discussion, and none of this takes into account sentiment just frequency of mention in a tweet.

Twitter Retweet Analysis

Twitter Retweet Decay

This uses the same data set I obtained from my NU Data Mining final project [summary].

Recently, @MLBcathedrals tweeted a photo I submitted to them:

I got a bunch of retweet/favorite notifications at first then I got fewer as the day went on. Now a month later, I’ll get a favorite or retweet [RT] notification every so often. The process of getting a retweet follows a Poisson process, where there is a discrete and somewhat small outcome that can be thought of as count data — you can count retweets per minute.

I used the tweets I just had lying around from my project and pulled out several collections of native RTs that had their first RT in the data set and a high volume of retweets. This was to ensure I had the first part of the tweet’s life and not just had captured it in the middle. Time is measured in seconds from the first retweet event. This simplifies things by giving each collection of a RTed tweets a time base relative to when RTing started.

The common time base enabled me to make this comparison chart of different RTed tweets:

Twitter Retweet Analysis

Not every RT pattern is the same. Some have many more RTs, some take a little while to get momentum, but generally they start off strong, then slowly die out taking the shape of a logarithmic function. The total number of RTs over time is interesting, but this problem works better if we look at the rate of RTing. The reason why the RT pattern flattens out is that there is steadily decreasing RTing rate over time. This makes intuitive sense if you have ever used Twitter, people react to things as they happen then rarely go back to it.

It turns out you can mathematically model the RT rate with a Poisson generalized linear model reasonably well. The following three graphs show the actual RT rate data points as red dots, the expected value regression as the black line, and a probability range as blue bands.

Twitter Retweets Per Minute

The model for this particular RTed tweet is described by the equation:

$latex ln(E[Y|t]) = 2.980154 – 0.0017236*t&s=1$

$latex Y&s=1$ is the number of RT per minute. $latex t&s=1$ is time. And $latex E[Y|t]&s=1$ is read as the expected value of $latex Y&s=1$ given $latex t&s=1$, or what is the most likely number of RTs per minute at a given time. The constant [2.980154] represents the rate at t=1, and the negative regression coefficient [-0.0017236] indicates that rate will decline. This regression line represents the expected value, which is essentially an average of possible outcomes. Using the Poisson distribution and the expected value, I constructed a probability distribution showing a band where 50% of all data points should be located, and another band that should encompass 90% of them.

The bands, in my opinion, are more important than the regression line, because we are dealing with count data. So having an expected value of say 2.5342 doesn’t mean much if you don’t know the probability for getting a value of 0, 1, 2, 3, 4, etc. For this reason the last graph in the series of three has only the actual data points in the probable area bands. For each minute, the data point has a 50% chance to be in the dark blue region, a 40% chance to be in the light blue region [90%-50%], and a 10% chance to be in the white.

This is all predicated on RTing being a random process with a fixed audience. This described most of the RTs I looked at fairly well, but there will always be other factors such as viral growth and time of day. Viral growth means it starts off slow then grows large. If this were to happen, it would not follow this pattern; it would look more like an S. For better or worse, most RTs come from accounts with a large number of followers, so they aren’t actually viral, they are propagations of already popular tweets.

This specific regression by itself won’t predict how many RTs a tweet might get before it’s tweeted, but it describes what happens after people have begun retweeting.

Average Sentiment Per Tweet -- Corporations

Twitter Sentiment Analysis

This is a summary of a final project I did from my Introduction to Data Mining class at NU. The goal of the project was to find a business need and execute a data mining process. The general process I used is outlined here and the sentiment lexicon is found here. The lexicon is from a paper: Minqing Hu and Bing Liu. “Mining and Summarizing Customer Reviews.” Proceedings of the ACM SIGKDD International Conference on Knowledge.Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA.

My experiences using social media, the business-centric focus of my grad classes, and my love for burritos inspired me to look into Twitter sentiment analysis. [I also needed to research something that wasn’t baseball.] Imagine every time you’ve misinterpreted a text message from your friend. Or every time an irate Twitter follower takes a sarcastic tweet seriously. That’s how hard it is for normal people to correctly interpret sentiment of written communication. So now picture trying to get a computer to do the same thing. Not easy. But at the very least we can find a way to categorize more tweets correctly than we misidentify.

On August 25th & 26th, I scraped tweets containing Twitter handles of some companies:Company Twitter Handles

I choose these based on my personal preferences or companies that I thought might have strong sentiment. The tweets were scraped using R and the package streamR. [These are pretty easy to use, if anyone wants to start doing any Twitter research, I’d start here.] The tweets are saved as JSON files, which are a mess, but human-readable. R parses the tweets into a data frame then that can go into a SQL database if so desired.

The way I determined a tweet’s sentiment was by matching words in the tweet to a predetermined sentiment lexicon. This process is outlined in this presentation, if you are curious about how to execute it. The algorithm is simple enough to write with one FOR loop. The toughest problem I had was dealing with the tweet’s data object. The scoring system was simple, each word from a tweet that matched to the lexicon list will count as a +1 [positive] or -1 [negative]. The words are added together to give a sentiment score. The sentiment score can be interpreted like algebra [0 is neutral, greater than 0 is positive, and less than 0 is negative].

I grouped the results by company:

Absolute Twitter Sentiment Corporations

The bar graph sums the sentiment scores of every tweet I captured. So for example, two negative words in one tweet will drop the company’s total score 2 points. No surprises here that Comcast is in last place. I’ve never heard anyone say anything nice about Comcast, and apparently Twitter is not much kinder. News about the Burger King-Tim Horton’s merger broke while I was capturing tweets, so that accounts for the great discontent about Burger King. I’m not quite sure what BK’s baseline should be since this is the only data I have on hand. Verizon has very positive scores. I am a little skeptical of this, because Verizon (and Apple) had a lot tweets that were ad-based. While it’s great to know who is advertising your product, it’s not the goal of this particular project. Ads and news stories gets retweeted and regurgitated a lot, but the tweets from real customers don’t.

One way to account for large volume of tweets is to look at average sentiment per tweet:

Average Sentiment Per Tweet -- Corporations

This graph takes the total sentiment score and divides it by the total number of tweets mentioning that company. This will give more weight to a company’s Twitter-customer base that’s strongly opinionated one way or another. To no one’s surprise, Comcast is dead last again. But to my delight, Chipotle ranks first! [I love burritos…and apparently a lot of Twitter users do too.] Chipotle is a good example of a company that does not have the volume that Comcast or Verizon has, but their users feel strongly about their product and tweet positively about it.

Luckily, there is a bunch of metadata available with the tweets including my favorite variable, a timestamp. First, here’s a time series baseline for August 25, 2014:

Total Tweets Per Hour -- 8/25/14

The volume of tweets increased through out the day, and peaked around lunch in EDT. Let’s look at Chipotle’s time series graph broken up by sentiment classification:

Chipotle Sentiment Per Hour 8/25/14

Neutral and positive tweets peaked during the 1PM EDT hour, with no corresponding spike in negative tweets. This is very encouraging for Chipotle since you would expect the negative tweets to follow the same pattern, and they don’t. They actually rise later in the day. Further research would have to be done to determine if this trend is real and what the source of it is. It could be a time zone delayed problem, general staffing/production issues in non-lunch hours, selection bias of people who go to a later lunch, or a random fluctuation that happened that day.

Comcast also has some interesting patterns:

Comcast Sentiment Per Hour -- 8/25/2014

There are two spikes in neutral tweet volume early in the day. I think these are results of mentions in a news related tweet that was retweeted a lot. The large spike at 2PM EDT is probably also caused by retweets as well. However, during the early afternoon, there are distinctive negative customer tweets accounting for the surge in negative sentiment after lunch. My conjecture for this surge would be an increase of people dealing with Comcast’s customer service. It would be interesting to see if call center data matched up.

This is just the most basic implementation of sentiment analysis. There are more advanced machine learning techniques that can weight words differently and looks at consecutive word groups (n-grams) in addition to individual words. The advantage of this can be seen easily. The phrase ‘not good’ is negative, but only looking at the constituent words it would be scored neutral. There is a lot of other processes I can try to get more accurate results, but unfortunately, not in this post.

Twitter Penguins Analysis

Twitter Analysis – Penguins Game 7

I’ve been listening to 93.7 The Fan while running the analysis for this, and I never realized that people can say the same thing over and over again but in slightly different ways. Also all tweets were captured AFTER THE CONCLUSION OF THE 1st PERIOD.

Everyone knows Twitter is the best venue to vent your anger about sports teams. I was able to the statistical programming language R to scrape tweets which had certain keywords or hashtags in them, put them in a database and then flag the tweets that contain certain keywords or collection of words. I had about 20 keywords including: “penguins”, “pens”, “rangers”, “game 7”, and “firebylsma”. I also search for any of the handles of the local hockey writers, because a lot of people will reply to the sports writers during the game with their own opinions.

The first graph has the total number of tweets that I scraped and tweets that I flagged as ‘swearing’. For the most part, I feel like if someone swore in the tweet, it indicates anger or at the very least aggressiveness. As mentioned before these graphs begin after the 1st period ends. And tweets containing the keyword ‘rangers’ has been filtered from this first graph to include a greater amount of Penguins fan.

The quickest and most basic analysis is the number of tweets as a time-series. As soon as you look at the time-series line graph, you can tell when the game ends [9:41 PM]. It’s like Mt. Everest in the graph. Looking closer you can see the spikes where each team scores. An interesting occurrence happened right before the end of the game. Twitter got quiet. The tweets per minute dropped below 500 right before it exploded to a few thousand per minute. I am attributing this silence to people actually watching the game during the tense last minute.

The tweets peaked about 2 minutes after the game ended indicating a minimal lag which includes the time of picking up the smart phone, unlocking it, and composing the tweet. However, the angry, swearing tweets peaked right as the game ended indicating more visceral emotions instead of more thought-out, 140-character commentary. There are two severe dips that I can’t account for at 9:49 PM and 10:04 PM. If anyone knows something that occurred at this time, please let me know. Since there is a clear downward trend and the game was no longer being played, I am going to write those dips off as some technical difficulties that didn’t allow a lot of tweets to be sent at those times.

Since Twitter isn’t an invention specific to just Penguins fans, I separated and compared two sets of tweets: one containing the word “penguins” and one comparing “rangers”. There are a lot more Rangers fans than Penguins fans, because the “rangers” tweets outnumber “penguins” tweets at almost every time. The “penguins” tweets did spike when they scored their only goal. Interestingly enough, there was a lot of swearing right when the goal was scored. Penguins fans are just so angry!

Bottom-line calm down; it’s just sports.