Tag Archives: python

emoji

Emoji iOS 9.1 Update — The Taco Emoji Analysis

Before I get too far I don’t actually analysis taco emojis. At least not yet. I, however, give you the tools to start parsing them from tweets, text or anything you can get into Python.

This past month Apple released their iOS 9.1 and their latest OS X 10.11.1 El Capitan update. That updated included a bunch of new emojis. I’ve made a quick primer on how to handle emoji analysis in Python. Then when Apple released an update to their emojis to include the diversity, I updated my small Python class for emoji counting to include to the newest emojis. I also looked at what is actually happening with the unicode when diversity modifier patches are used.

Click for Updated socialmediaparse Library

With this latest update, Apple and the Unicode Consortium didn’t really introduce any new concepts, but I did update the Python class to include the newest emojis. In my GitHub the data folder includes a text file with all the emojis delimitated by ‘\n’. The class uses this file to find any emoji’s in a unicode string which has been passed to the add_emoji_count() method.

Building off of the diversity emoji update, I added a skin_tone_dict property of the EmojiDict class. This property returns a dictionary with the number of unique human emojis per tweet and their skin tones. This property will not catch multiple human emojis written if they in the same execution of the add_emoji_count() method

import socialmediaparse as smp #loads the package
 
counter = smp.EmojiDict() #initializes the EmojiDict class
 
#goes through list of unicode objects calling the add_emoji_count method for each string
#the method keeps track of the emoji count in the attributes of the instance
for unicode_string in collection:
   counter.add_emoji_count(unicode_string)  
 
#output of the instance
print counter.dict_total #dict of the absolute total count of the emojis in corpus
print counter.dict       #dict of the count of strings with the emoji in corpus
print counter.baskets    #list of lists, emoji in each string.  one list for each string.
print counter.skin_tones_dict #dictionary of unique emoji emojis aggregated by the counter.

#print counter.skin_tones_dict output
#{'human_emoji': 4, '\\U0001f3fe': 1, '\\U0001f3fd': 1, '\\U0001f3ff': 0, '\\U0001f3fc': 2, '\\U0001f3fb': 1}
 
counter.create_csv(file='emoji_out.csv')  #method for creating csv

Above is an example of how to use the new attribute. It is a dictionary so you can work that into your analysis however you like. I will eventually create better methods and outputs to make this feature more robust and useful.

The full code / class I used in this post can be found on my GitHub .

Collecting Twitter Data: Using a Python Stream Listener

Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener [current page] | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8


I use the term stream listener [2 words] to refer to program build with this code and StreamListener [1 word] to refer to the specific class from the tweepy package. The two are related but not the same. The StreamListener class makes the stream listener program what it is, but the program entails more than the class.

While using R and its streamR package to scrape Twitter data works well, Python allows more customization than R does. It also has a steeper learning curve, because the coding is more invovled. Before using Python to scrape Twitter data, a software package like tweepy must be installed. If you have the pip installer installed on your system, the installation procedure is rather easy and executed in the Terminal.

Call Tweepy Library

Terminal:

$ pip install tweepy

After the software package is installed, you can start writing a stream listener script. First, the libraries have to be imported.

import time
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import os

The three tweepy class imports will be used to construct the stream listener, the time library will be used create a time-out feature for the script, and the os library will be used to set your working directory.

Set Variables Values

Before diving into constructing the stream listener, let’s set some variables. These variables will be used in the stream listener by being feed into the tweepy objects. I code them as variables instead of directly into the functions so that they can be easily changed.

ckey = '**CONSUMER KEY**'
consumer_secret = '**CONSUMER SECRET KEY***'
access_token_key = '**ACCESS TOKEN**'
access_token_secret = '**ACCESS TOKEN SECRET**'


start_time = time.time() #grabs the system time
keyword_list = ['twitter'] #track list

Using and Modifying the Tweepy Classes

I believe that tweet scraping with Python has a steeper learner curve than with R, because Python is dependent on combining instances of different classes. If you don’t understand the basics of object-oriented programming, it might be difficult to comprehend what the code is accomplishing or how to manipulate the code. The code I show in this post does the following:

  • Creates an OAuthHandler instance to handle OAuth credentials
  • Creates a listener instance with a start time and time limit parameters passed to it
  • Creates an StreamListener instance with the OAuthHandler instance and the listener instance

Before these instances are created, we have to “modify” the StreamListener class by creating a child class to output the data into a .csv file.

#Listener Class Override
class listener(StreamListener):

	def __init__(self, start_time, time_limit=60):

		self.time = start_time
		self.limit = time_limit
		self.tweet_data = []

	def on_data(self, data):

		saveFile = io.open('raw_tweets.json', 'a', encoding='utf-8')

		while (time.time() - self.time) < self.limit:

			try:

				self.tweet_data.append(data)

				return True


			except BaseException, e:
				print 'failed ondata,', str(e)
				time.sleep(5)
				pass

		saveFile = io.open('raw_tweets.json', 'w', encoding='utf-8')
		saveFile.write(u'[\n')
		saveFile.write(','.join(self.tweet_data))
		saveFile.write(u'\n]')
		saveFile.close()
		exit()

	def on_error(self, status):

		print statuses

This is the most complicated section of this code. The code rewrite the actions taken when the StreamListener instance receives data [the tweet JSON].

saveFile = io.open('raw_tweets.json', 'w', encoding='utf-8')
saveFile.write(u'[\n')
saveFile.write(','.join(self.tweet_data))
saveFile.write(u'\n]')
saveFile.close()

This block of code opens an output file, writes the opening square bracket, writes the JSON data as text separated by commas, then inserts a closing square bracket, and closes the document. This is the standard JSON format with each Twitter object acting as an element in a JavaScript array. If you bring this into R or Python built-in parser and the json library can properly handle it.

This section can be modified to or modify the JSON file. For example you can place other properties/fields like a UNIX time stamp or a random variable into the JSON. You can also modified the output file or eliminate the need for a .csv file and insert the tweet directly into a MongoDB database. As it is written, this will produce a file that can be parsed by Python's json class.
After the child class is created we can create the instances and start the stream listener.

auth = OAuthHandler(ckey, consumer_secret) #OAuth object
auth.set_access_token(access_token_key, access_token_secret)


twitterStream = Stream(auth, listener(start_time, time_limit=20)) #initialize Stream object with a time out limit
twitterStream.filter(track=keyword_list, languages=['en'])  #call the filter method to run the Stream Object

Here the OAuthHandler uses your API keys [consumer key & consumer secret key] to create the auth object. The access token, which is unique to an individual user [not an application], is set in the following line. Unlike the filterStream() function in R, this will take all four of your credentials from the Twitter Dev site. The modified StreamListener class simply called listener is used to create an listener instance. This contains the information about what to do with the data once it comes back from the Twitter API call. Both the listener and auth instances are used to create the Stream instance which combines the authentication credentials with the instructions on what to do with the retrieved data. The Stream class also contains a method for filtering the Twitter Stream. This method works just like the R filterStream() function taking similar parameters, because the parameters are passed to the Stream API call.

Python vs R

At this stage in the tutorial, I would recommend parsing this data using the parser in R from the last section of the Twitter tutorial or creating your own. Since it's easier to customize the StreamListener methods in Python, I prefer to use it over other R. Generally, I think Python works better for collecting and processing data, but isn't as easy to use for most statistical analysis. Since tweet scraping would fall into the data collection category, I like Python. It becomes easier to access databases and to manipulate the data when you are already working in Python.

11-10-2015 -- I've updated the StreamListener to output properly formatted JSON. The old script which works well with R's tweetParse is still available on my GitHub.

 


 

Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener [current page] | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV -- Errors | Part VI: Twitter JSON to CSV -- ASCII | Part VII: Twitter JSON to CSV -- UTF-8

Emoji, UTF-8, and Python

I have updated [better] code that allows for easy counting of emoji’s in string objects in Python, it can be found on my GitHub. I have a two counting classes in a mini-package loaded there.

Emoji [], those ubiquitous emoticons that popped up when iPhone users found them in 2011 with iOS 5 are a different set of characters aside from the traditional alphanumeric and punctuation characters. These are essentially another alphabet, and this concept will be useful when using the emoji in Python. Emoji are NOT a font like Wingdings from Windows95, they are unique characters with no corresponding letter or symbol representation. If you have a document or webpage that has the Wingding font, you can simply change the font to a typical Latin font to see the normal characters the Wingding font represents.

Technical Background

Without getting into the technical encoding problems, emoji are defined in Unicode and UTF-8, which can represent just about a million characters. A lot of applications or software packages default to ASCII, which only encodes the typical 128 characters. Some Python IDEs, csv writing packages, or parsing software default to or translate to ASCII, so they don’t necessarily handle the emoji characters properly.

I wrote a Python script [or this Python ‘package’] that takes tweets that are stored in a MongoDB database (more on that later) and counts the number of different emoji in the tweet corpus. To make sure Python plays nice with the emojis, first I loaded in the data by making sure I had UTF-8 encoding specified otherwise you’ll get this encoding error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 0: ordinal not in range(128)

I loaded an emoji key I made using all the emoji’s in Apple’s implementation by loading this code into a Panda’s data frame:

emoji_key = pd.read_csv('emoji_table.txt', encoding='utf-8', index_col=0)

If Python loads you data in correctly with UTF-8 encoding, each emoji will be treated as separate unique character, so string function and regular expressions can be used to find the emoji’s in other strings such as Twitter text. In some IDEs emoji’s don’t display [Canopy] or don’t display well [PyCharm]. I remedied the invisible/messy emoji’s by running the script in Mac OS X’s terminal application, which displays emoji . Python can also produce an ASCII compliant string by using a unicode escape encoding:

unicode_object.encode('unicode_escape')

The escape encoded string will display something like this:

\U0001f604

All IDEs will display the ASCII string. You would need to decode it from the unicode escape to get it back into a unicode object. Ultimately I had a Pandas data frame containing unicode objects. To make sure the correct encoding was used on the output text file, I used the following code:

 with open('emoji_out.csv', 'w') as f: 
    emoji_count.to_csv(f, sep=',', index = False, encoding='utf-8')  

Emoji Counter Class

I made an emoji counter class in Python to simplify the process of counting and aggregating emoji counts. The code [socialmediaparse] is on my GitHub along with the necessary emoji data file, so it can load the key when the instance is created. Using the package, you can repeatedly call the add_emoji_count() method to change the internal count for each emoji. The results can be retrieved using the .dict, dict_total, and .baskets attributes of the instance. I wrote this because it organizes and simplifies the analysis for any social media or emoji application. Separate emoji dictionary counter objects can be created for different sets of tweets that someone would want to analyze.

import socialmediaparse as smp #loads the package

counter = smp.EmojiDict() #initializes the EmojiDict class

#goes through list of unicode objects calling the add_emoji_count method for each string
#the method keeps track of the emoji count in the attributes of the instance
for unicode_string in collection:
   counter.add_emoji_count(unicode_string)  

#output of the instance
print counter.dict_total #dict of the absolute total count of the emojis in corpus
print counter.dict       #dict of the count of strings with the emoji in corpus
print counter.baskets    #list of lists, emoji in each string.  one list for each string.

counter.create_csv(file='emoji_out.csv')  #method for creating csv

Project

MongoDB was used for this project because the data stores the JSON files very well, not needing a parser or a csv writer. It also has the advantage of natively storing strings in UTF-8. If I used R’s StreamR csv parser, there would be many encoding errors and virtually no emoji’s present in the data. There might be possible work arounds, but MongoDB was the easiest way I’ve found to work with Twitter JSON, UTF-8 encoded data.