There are three doors. And hidden behind them are two goats and a car. Your objective is to win the car. Here’s what you do:
- Pick a door.
- The host opens one of the doors you didn’t pick that has a goat behind it.
- Now there are just two doors to choose from.
- Do you stay with your original choice or switch to the other door?
- What’s the probability you get the car if you stay?
- What’s the probability you get the car if you switch?
It’s not a 50/50 choice. I won’t digress into the math behind it, but instead let you play with the simulator below. The game will tally up how many times you win and lose based on your choice.
What’s going on here? Marilyn vos Savant wrote the solution to this game in 1990. You can read vos Savant’s explanations and some of the ignorant responses. But in short, because the door that’s opened is not opened randomly, the host gives you additional information about the set of doors you didn’t choose. Effectively, if you switch, you are select all the other doors. If you choose to stay, you are select just one door.
In her answer, she suggests:
Here’s a good way to visualize what happened. Suppose there are a million doors, and you pick door #1. Then the host, who knows what’s behind the doors and will always avoid the one with the prize, opens them all except door #777,777. You’d switch to that door pretty fast, wouldn’t you?
To illustrate that in the simulation, you can increase number of number of doors in the simulator. It becomes pretty clear that switch is the correct choice.
Finally, here’s some Kevin Spacey:
Make a HTML Table with jQuery
For a project I was working on, I needed a quick, simple solution to make a dynamic table based on data sent back from an AJAX call. I used jQuery to build and manipulate the table HTML, since it was quick to use jQuery and it’s already in my project.
After considering a few different ways to approach this, I decided different arrays would be the easiest way to handle the data. The data looks like this:
var data = { k: ['Name', 'Occupation', 'Salary', 'Roommate'], v: [['Chandler', 'IT Procurement Manager', '$120,000', 'Joey'], ['Joey', 'Out-of-work Actor', '$50,000', 'Chandler'], ['Monica', 'Chef', '$80,000', 'Rachel'], ['Rachel', 'Assistant Buyer', '$70,000', 'Monica'], ['Ross', 'Dinosaurs', '$100,000', 'No Roommate']] }
It’s a JavaScript object with two keys: one for the header that I abbreviated k
and the main data values which have the key of v
. The header is just an array of strings, while the values are an array of arrays. I specifically designed this code to work within these parameters, so there could be more checks built in, but the data source is rather rigid.
To make the Table
class, I defined the attributes:
function Table() { //sets attributes this.header = []; this.data = [[]]; this.tableClass = '' }
Using this prototype code is a little bit of overkill, but it can be reused and extended. I plan on having the application update with new data and possibly other features. Creating a prototype allows that to be a little bit easy and cleaner.
I have three setter methods, which just allow the Table
object to have it’s attributes set and have the data set.
Table.prototype.setHeader = function(keys) { //sets header data this.header = keys return this } Table.prototype.setData = function(data) { //sets the main data this.data = data return this } Table.prototype.setTableClass = function(tableClass) { //sets the table class name this.tableClass = tableClass return this }
All the methods I’ve written have return this
in them. That allows method chaining, which makes the implementation of the code a lot simpler. The meat of the code is in the build
method.
Table.prototype.build = function(container) { //default selector container = container || '.table-container' //creates table var table = $('
I’ve annotated most of the code, but basically this creates jQuery objects for each of part of the table structure: the table (table
), a row (tr
), a header cell (th
) and a normal table cell (td
). The clone()
method is necessary so that jQuery creates another HTML element. Otherwise it will keep on removing, modifying and appending the same element.
Using the prototype we just created is rather easy, we did the hard part already. We use the new
keyword to instantiate a new object. This allows us to create many different independent Table
objects which can be manipulated individually within the application.
//creates new table object var table = new Table() //sets table data and builds it table .setHeader(data.k) .setData(data.v) .setTableClass('sean') .build()
Above is the short snippet of code which has method chaining. This allows us not to have to write separate lines of code for each method which would look like table.setData()
. I used the setHeader()
to set the array (data.k
) which populates the table’s header. The setData()
method sets the array of an array (data.v
) as the source of the data for the rest of the table.
Finally, the build()
method uses the data we just set to actually run the code that manipulates the HTML and this is what you see in your web browser.
Before using it on a web page, there has to be some HTML. (And some CSS so that the table looks decent.) The most important part is that the div
container has the class of "table-container"
. The Table
class is by default looking for that class to append the table to. You can customize that by changing using a jQuery selection string as a parameter in the table.build([jQuery selection string])
method.
Above is an working version of all the code from above. The full code I used for this post can be found on my GitHub .
D3 Visualization Basics — First Steps
D3 visualizations work by manipulating elements in the browser window. This short tutorial will demonstrate the very basics of that. This is also a working, simple demonstration of the interplay of HTML, CSS and JavaScript from the introduction page in this D3 tutorial set.
For the sake of making this simple, everything will come from one HTML document, which can be found in my GitHub . This will container the HTML and JavaScript. You can [and should] separate the JavaScript into its own file on bigger projects.
In this small project, we will start with a simple div container with the class of “container”.
Right now this doesn’t do anything, so it’s not worth showing. But if D3 code is added to create a blue box, it might look a little more interesting. [Provided you find blue boxes interesting.]
Blue Block
Above is a simple example of a basic D3 procedure. It can helpful to think about this command as having two parts: getting a DOM [browser] element to manipulate and giving instructions for those manipulations. Here is what the code is doing:
- The select statement [
d3.selectAll()
] code is finding every instance of the class of “container”. [There is only one element on the page.] - The
append()
method adds another div as a child element inside of the container div. - The
attr()
method gives the new div a class of “new-block”. - The
style()
method gives the new div several style properties - The
text()
method puts the text “Blue block” into the div
The style()
and attr()
methods can be a little confusing since they do similar things, but attr()
will place attributes in the HTML tag, while style()
writes in-line CSS for the element, which will override any CSS stylesheet you load in the header. These are just a few of the different methods you can use for a D3 DOM selection object, and you can find more in the D3 API reference.
Blue Block with Functions
Creating the blue block was a blast, but adding some functionality might make this a little more useful. Let’s make the block change color on a click and then remove it on a double click.
The first part of the script to create the blue block is the same, except I’ve doubled the code to have to blue blocks and I added new code that interacts with the blue block. The on()
method allows you to attach an event listener to elements rendered in the browser. These will wait until a certain event happens. In the example it executes a function to turn the box orange when the blue box is clicked. You can put many different instructions in here and aren’t limited to manipulating the element that is being clicked. Below is an illustration that I find useful for visualizing how event listeners are attached. I will devote another post to the details of D3 event listeners.
You might notice the d3.select(this)
in the code. this
is a fun keyword in JavaScript syntax which deals with scope, and the this
in the code refers to the specific DOM element which was clicked or double clicked. If you click on the left block, only that block turns orange. You could change the code replace this
with '.block-new'
and clicking one button will change both buttons to orange.
Having the d3.select(this)
code blocks in the event listener function makes it so they are only executed when the event happens. The block’s background color is changed to orange [#FF9821] when it’s clicked. The remove()
method deletes any DOM elements within the selection including the children elements. This comes in handy when you need to rebuild or update a data visualization.
[Next]
Data! D3’s most powerful tool.
D3 Visualization Basics — Introduction
Data visualization is important, really important. I can’t be more blunt than that. We are able to process much more information faster by seeing a visual representation than we could look at a table, database or interacting with a spreadsheet. I will be writing a series of posts that explore some of the foundations D3 is built on along with how to create engaging data visualizations using it.
D3 is a powerful tool that allows you to create interactive data visualizations for the web. Understanding how D3 works starts with understanding how modern web pages are designed.
If you have found this page, you probably at least have some knowledge of how to make a modern website: HTML, CSS, JavaScript, responsive design, etc. D3 uses basic elements from these components of web design to create the visualizations. This by no means the only way to create interactive visualizations, but this is an effective way to produce them.
Before jumping into D3 nuts and bolts, let’s look at what does each of these components do. [If you already know this stuff, feel free to skip ahead…once I get the other posts built out.]
In the most simplistic terms, HTML provides the structure of the webpage, CSS provides the styling and formatting, and JavaScript provides the functionality of the site. The browser brings these three components together and interprets them into something the end user (you) can understand and use. Sometimes one component can accomplish what the other does, but if you stick to this generalization you’ll be in good shape.
To produce a professional-looking, fully-functional D3 data visualization you will need to understand, write and manipulate all three components.
HTML
The most vivid memories I have of HTML is from the websites of the late 90s: Geocities, Angelfire, etc. HTML provides instructions on how browsers should interpret information; it organizes the information. Everything you see on a webpage has corresponding HTML code.
If you look at the source HTML or inspect one of this site’s page you’ll see some of the structure. When HTML renders in the browser these elements are referred to DOM elements. DOM stands for Document Object Model, which is the structure of a webpage.
Looking at the DOM tree you can see the many of the div
containers that provide structure for the how the site is laid out. The p
tags contain each paragraph in the content of my posts. h1
, h2
and h3
are subheadings I’ve made to make the post more organized. You also notice some attributes especially class
which have many uses for CSS, JavaScript and D3. Classes in particular are used to identify what function that DOM element plays in JavaScript or how to style it in CSS.
CSS
A house without painted walls or decorations is pretty boring. The same thing happens with bare bones HTML. You can organize the information, but it won’t be in an appealing format.
Most sites have style sheets (CSS) which sets margins, colors, display options, etc. Style sheets have a specific syntax which identifies HTML elements by type, class or id. This identification and selection concept is used extensively in D3.
Above is some CSS from this site. It contains formatting instructions for elements of the class “page-links”. It includes instructions for the font size, margins, height, width and to make the text all uppercase. The advantage of CSS is that it keeps formatting away from the structure of the HTML allowing you to format many elements at once. For example if you wanted to change the color of every link, you could easily do that by modifying the CSS.
There is an alternative to using CSS style sheets and that’s by using inline style definitions.
Inline styles use the same markup as the CSS in the style sheets. Inline styles
- control only the element they are in
- OVERRIDE any CSS styles [without an !important tag]
The code above overrides the normal paragraph’s style property which aligns it left. Using inline styles are generally bad for web design, but it’s important to understand how they work since D3 manipulates inline styles often.
JavaScript
JavaScript breathes life into your web page. It certainly not the only way to have your website become interactive or build programming into it, but it is widely used and supported in the popular browsers. D3 is a JavaScript library, so you will inevitably have to write JavaScript to use it.
For D3 visualization, JavaScript will be use to
- Manage and manipulate data for the visualization
- Create DOM elements
- Manipulate DOM elements
- Destroy DOM elements
- Attach data to DOM elements
JavaScript will be used insert elements onto the page, it will also be used to change colors and styles of those elements. You might be able to see how this could be useful. For example JavaScript could map data points to an element’s position for a scatter plot or to an element’s height or width for a bar chart.
I bolded the last function D3 does, attaching data to elements, because it’s so critical to D3. This allows you to attach a data point beyond x, y data to allow for rich visualization.
Above is data attached to a D3 visualization I made for FanGraphs. This is a simple example, but I was able to attach data detailing the team’s name, id, league, ERA and FIP. Using the attached data I was able to create the graph and tooltips. More complex designs can take advantage of the robust data structure D3 provides.
[Next]
I’ll look at how to set up a basic project by organizing data, files and code.
Stattleship! Sport Stats API
I’ve been in contact with the team over at Stattleship. They have a cool API that allows you to get various stats for basketball, football and hockey. I used data from that API to create the following data visualization for their blog. The visualization shows the offensive and special team yards gained by each team remaining in the playoffs. The yardage is totaled for the entire season as well as the one playoff game each team played. I’ve displayed the points off of offensive TDs and special teams scoring, and that score is color coded with to wins and loses. A black background is a win, and a white background is a loss.
The Backwards K — Baseball Strikeout Looking
The backwards K is normally used to denote a called third strike in a strikeout. It’s typically written on a scorecard. I’ve been looking for the backwards K so I can denote the strikeout looking on Twitter, and I finally found it:
ꓘ
(for unsupported browsers — Chrome)
The easiest way to use this character is to copy and paste the backwards K from above and save it in a note or something you can copy and paste from routinely. This character is actually from Apple’s implementation of the Unicode from the artificial, Latinized version of the Lisu alphabet. This alphabet contains an upside-down, turned K which looks similar enough to a backwards K I think this pass on Twitter.
If you don’t see the backwards K in the block above, you computer or mobile device probably isn’t using a font that supports that specific character. It’s supported on Macs and iPhones (as well as the Edge browser in Windows 10).
References:
http://unicode.org/charts/PDF/UA4D0.pdf
https://en.wikipedia.org/wiki/Fraser_alphabet
Collecting Twitter Data: Converting Twitter JSON to CSV — UTF-8
Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8 [current page]
The main drawback to the ASCII CSV parser and the csv library and is that it can’t handle unicode characters or objects. I want to be able to make a csv file that is encoding in UTF-8, so that will have to be done from scratch. The basic structure follows the previous ASCII post so the json Python object description can be found on the previous tutorial.
io.open
First, to handle the UTF-8 encoding, I used the io.open
class. For the sake of consistency, I used this class for both reading the JSON file and writing the CSV file. This actually doesn’t require much change to the structure of the program, but it’s an important change. The json.loads()
reads the JSON data and parses it into an object you can access like a Python dictionary.
import json import csv import io data_json = io.open('raw_tweets.json', mode='r', encoding='utf-8').read() #reads in the JSON file data_python = json.loads(data_json) csv_out = io.open('tweets_out_utf8.csv', mode='w', encoding='utf-8') #opens csv file
Unicode Object Instead of List
Since this program uses the write()
method instead of a csv.writerow()
method, and the write()
method requires a string or in this case a unicode object instead of a list. Commas have to be manually inserted into the string to properly. For the field names, I just rewrote the line of code to be a unicode string instead of the list used for the ASCII parser. The u'*string*'
is the syntax for a unicode string, which behave similarly to normal strings, but they are different. Using the wrong type of string can cause compatibly issues. The line of code that uses the u'\n'
creates a new line in the CSV. Once again this is need in this parser needs to insert the new line character to create a new line in the CSV file.
fields = u'created_at,text,screen_name,followers,friends,rt,fav' #field names csv_out.write(fields) csv_out.write(u'\n')
The for loop and Delimiters
This might be the biggest change relative to the ASCII program. Since this is a CSV parser made from scratch, the delimiters have to be programmed in. For this flavor of CSV, it will have the text field entirely enclosed by quotation marks ("
) and use commas (,
) to separate the different fields. To account for the possibility of having quotation marks in the actual text content, any real quotation marks will be designated by double quotes (""
). This can give rise to triple quotes, which happens if a quotation mark starts or ends a tweet’s text field.
for line in data_python: #writes a row and gets the fields from the json object #screen_name and followers/friends are found on the second level hence two get methods row = [line.get('created_at'), '"' + line.get('text').replace('"','""') + '"', #creates double quotes line.get('user').get('screen_name'), unicode(line.get('user').get('followers_count')), unicode(line.get('user').get('friends_count')), unicode(line.get('retweet_count')), unicode(line.get('favorite_count'))] row_joined = u','.join(row) csv_out.write(row_joined) csv_out.write(u'\n') csv_out.close()
This parser implements the delimiters requirements of the text fields by
- Replacing all quotation marks with double quotes in the text.
- Adding quotation marks to the beginning and end of the unicode string
'"' + line.get('text').replace('"','""') + '"', #creates double quotes
Joining the row
list using a comma as a separator is a quick way to write the unicode string for the line of the CSV file.
row_joined = u','.join(row)
The full code I used in this tutorial can be found on my GitHub .
Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8 [current page]
Collecting Twitter Data: Converting Twitter JSON to CSV — ASCII
Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII [current page] | Part VII: Twitter JSON to CSV — UTF-8
I outlined some of the potential hurdles that you have to overcome when converting Twitter JSON data to a CSV file in the previous section. Here I outline a quick Python script that allows you to parse your Twitter JSON file with the csv
library. This has the obvious drawback in that it can’t handle the utf-8 encoded characters that can be present in tweets. But this program will produce a CSV file that will work well in Excel or other programs that are limited to ASCII characters.
The JSON File
The first requirement is to have a valid JSON file. This file should contain an array of Twitter JSON objects, or in analogous Python terms a list of Twitter dictionaries. The tutorial for the Python Stream Listener has been updated to make the correctly formatted file to work in Python.
[{Twitter JSON Object}, {Twitter JSON Object}, {Twitter JSON Object}]
The JSON file is loaded into Python and is automatically parsed into a Python friendly object by the json
library using the json.loads()
method. This opens and reads the file in as a string in the open()
line, then decodes the string into a json Python object which behaves similar to a list of Python dictionaries — one dictionary for each tweet.
import json import csv data_json = open('raw_tweets.json', mode='r').read() #reads in the JSON file into Python as a string data_python = json.loads(data_json) #turns the string into a json Python object
The CSV Writer
Before getting too ahead of things, a CSV writer should create a file and write the first row to label the data columns. The open()
line creates a file and allows Python to write to it. This is a generic file, so anything could be written to it. The csv.writer()
line creates an object which will write CSV formatted text to file we just opened. There are some other parameters you are able to specify, but it defaults to Excel specifications, so it those options can be omitted.
csv_out = open('tweets_out_ASCII.csv', mode='w') #opens csv file writer = csv.writer(csv_out) #create the csv writer object fields = ['created_at', 'text', 'screen_name', 'followers', 'friends', 'rt', 'fav'] #field names writer.writerow(fields) #writes field
The purpose of this parser is to get some really basic information from the tweets, so it will only get the date and time, text, screen name and the number of followers, friends, retweets and favorites [which are called likes now]. If you wanted to retrieve other information, you’d would create the column names accordingly. the writerow()
method writes a list with each element being a value which is separated by the comma in the CSV file.
The json Python object can be used in a for
loop to access the individual tweets. From there each line can be accessed to get the different variables we are interested in. I’ve condensed the code so that is all in one statement. Breaking it down the line.get('*attribute*')
retrieves the relevant information from the tweet. The line
represents an individual tweet.
for line in data_python: #writes a row and gets the fields from the json object #screen_name and followers/friends are found on the second level hence two get methods writer.writerow([line.get('created_at'), line.get('text').encode('unicode_escape'), #unicode escape to fix emoji issue line.get('user').get('screen_name'), line.get('user').get('followers_count'), line.get('user').get('friends_count'), line.get('retweet_count'), line.get('favorite_count')]) csv_out.close()
You might not notice this line, but it’s critical for this program working.
line.get('text').encode('unicode_escape'), #unicode escape to fix emoji issue
If the encode()
method isn’t included, unicode characters (like emojis) are included in their native encoding. This will be sent to the csv.writer
object, which can’t handle those characters and fail. This would be necessary for any field that could possibly have a unicode character. I know the other fields I chose cannot have non-ASCII characters, but if you were to add name
or description
, you’d have to make sure they do not have incompatible characters.
The unicode escape rewrites the unicode as a string of letters and number much like \U0001f35f
. These represent the characters and can actually be decoded later.
The full code I used in this tutorial can be found on my GitHub .
Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII [current page] | Part VII: Twitter JSON to CSV — UTF-8
Collecting Twitter Data: Converting Twitter JSON to CSV — Possible Errors
Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors [current page] | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8
ASCII JSON-to-CSV | UTF-8 JSON-to-CSV
Before diving into the problem of how to save tweets in a CSV file, let me say there are a 1,000 ways to do this and about 100 complications that arise depending which way you want to accomplish this. I will devote two posts which covers using both ASCII and UTF-8 encoding because many tweets contain characters beyond the normal Latin alphabet.
Let’s look at some of the issues with writing CSV from tweets.
- Tweets are JSON and contain a massive amount of metadata. More than you probably want.
- The JSON isn’t a flat structure; it has levels. [Direct contrast to a CSV file.]
- The JSON files don’t all have the same elements.
- There are many foreign languages and emoji used in tweets.
- Tweets contain many different grammatical marks such as commas and quotation marks.
These issues aren’t incredibly daunting, but those unfamiliar will encounter frustrating errors.
Tweets are JSON and contain a massive amount of metadata. More than you probably want.
I’m always in favor of keeping as much data as possible, but tweets contain a massive amount of different metadata attributes. All of these are designed for the Twitter platform and for the associated client apps. Some items like the profile_background_image_url_https
really don’t have much of an impact on any analysis. Choosing which attributes you want to keep will be critical before embarking on a process to parse the data into a CSV. There’s a lot to choose from: timestamp data, user data, retweet data, geocoding data, hashtag data and link data.
The JSON isn’t a flat structure; it has levels.
This issue is an extension of the previous issue, since tweet JSON data isn’t organized into a flat, spreadsheet-like structure. The created_at
and text
elements are located on the top level and are easy to access, but something as simple as the tweeter’s name
and screen_name
are located in the user
nested object. Like everything else mentioned in this post, this isn’t a huge issue, but the structure of a tweet JSON file has to be considered when coding your program.
The JSON files don’t all have the same elements.
The final problem with JSON files is the fields aren’t necessarily present in every object. Many geo
related attributes do not appear unless geotagging is enabled. This means if you write your program to look for geotagging data, it can throw a key error if those keys don’t exist in that specific tweet. To avoid this you have to account for the exception or use a method that already does that. I use the get()
method to avoid these key errors in the CSV parser.
There are many foreign languages and emoji used in tweets.
I quickly addressed this issue in a few posts, and it’s one of the reasons why I like to store tweets in MongoDB. Tweets contain a lot of of [read: important] unicode characters. These are typically many foreign language characters and the ubiquitous emojis. This is important because the presence of UTF-8 unicode characters can and will cause encoding errors when parser a file or loading a file into Excel. Excel (at least the version on my computer) can’t handle these characters. Other tools like the built-in CSV writer in Python can’t handle unicode out of box. Being able to deal with these characters is critical to compatibility with other software as long as the integrity of your data.
This issue forces me to write two different parsers for examples. I have a CSV parser that outputs ASCII that imports well into Excel along with a UTF-8 version which allows you to natively save the characters and emojis in a human-readable CSV file.
Tweets contain many different grammatical marks such as commas and quotation marks.
This is a problem that I had when I first started working with Twitter data and tried to write my own parser — characters that are part of your text content sometimes get confused with the delimiters. In this case I’m talking about quotation marks (")
and commas (,)
. Comma sseparate the values for each ‘cell’, hence the acronym CSV. If you tweet you’ve probably tweeted using one of these characters. I’ve stripped them out of the text previously to solve this problem, but that’s not a great solution. The way Excel handles this is to enclose any elements that contain commas with quotation marks then to use double quotation marks to signify an actual quotation mark and not enclosed text. This will be demonstrated in the UTF-8 parser since I made that from scratch.
Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors [current page] | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8
R Bootcamp: Making a Subset
Data Manipulation: Subsetting
Making a subset of a data frame in R is one of the most basic and necessary data manipulation techniques you can use in R. If you are brand new to data analysis, a data frame is the most common data storage object in R and subsets are a collection of rows from that data frame based on certain criteria.
V1 | V2 | V3 | V4 | V5 | V6 | V7 |
Row1 | ||||||
Row2 | ||||||
Row3 | ||||||
Row4 | ||||||
Row5 | ||||||
Row6 |
V1 | V2 | V3 | V4 | V5 | V6 | V7 |
Row2 | ||||||
Row5 | ||||||
Row6 |
The Data
For this example, I’m using data from FanGraphs. You can get the exact data set here, and it’s provided in my GitHub. This data set has players names, teams, seasons and stats. We are able to create a subset based on any one or more of these variables.
The Code
I’m going to show four different ways to subset data frames: using a boolean vector, using the which()
function, using the subset()
function and using filter()
function from the dplyr
package. All of these functions are different ways to do the same thing. The dplyr
package is fast and easy to code, and it is my recommended subsetting method, so let’s start with that. This is especially true when you have to loop an operation or run something that will be run repeatedly.
dplyr
The filter()
requires the dplyr
package to be loaded in your R environment, and it removes the filter()
function from the default stats
package. You don’t need to worry about but it does tell you that when you first install and load the package.
#install.packages('dplyr') library(dplyr) #load the package #from http://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2015&month=0&season1=2010&ind=1&team=&rost=&age=&filter=&players=&page=2_30 setwd('***PATH***') data <- read.csv('FanGraphs Leaderboard.csv') #loads in the data
Aside from the loading the package, you'll have to load the data in as well.
#finds all players who played for the Marlins data.sub.1 <- filter(data, Team=='Marlins') #finds all the NL East players NL.East <- c('Marlins','Nationals','Mets','Braves','Phillies') #makes the division data.sub.2 <- filter(data, Team %in% NL.East) #finds all players that are in the NL East #Both of these find players in the NL East and have more than 30 home runs. data.sub.3 <- filter(data, Team %in% NL.East, HR > 30) #uses multiple arguments data.sub.3 <- filter(data, Team %in% NL.East & HR > 30) #uses & sign #Finds players in the NL East or has more than 30 HR data.sub.4 <- filter(data, Team %in% NL.East | HR > 30) #Finds players not in the NL East and who have more than 30 home runs. data.sub.5 <- filter(data, !(Team %in% NL.East), HR > 30)
The filter()
function is rather simple to use. The examples above illustrate a few simple examples where you specify the data frame you want to use and create true/false expressions, which filter()
uses to find which rows it should keep. The output of the function is saved into a separate variable, so we can reuse the original data frame for other subsets. I put a few other examples in the code to demonstrate how it works.
Built-in Functions
#method 1 -- using a T/F vector data.sub.1 <- data[data$Team == 'Marlins',] #method 2 -- which() data.sub.2 <- data[which(data$Team == 'Marlins'),] #method 3 -- subset() data.sub.3 <- subset(data,subset = (Team=='Marlins')) #other comparison functions data.sub.4 <- data[data$HR > 30,] #greater than data.sub.5 <- data[data$HR < 30,] #less than data.sub.6 <- data[data$AVG > .320 & data$PA > 600,] #duel requirements using AND (&) data.sub.7 <- data.sub3 <- subset(data, subset = (AVG > .300 & PA > 600)) #using subset() data.sub.8 <- data[data$HR > 40 | data$SB > 30,] #duel requirements using OR (|) data.sub.9 <- data[data$Team %in% c('Marlins','Nationals','Mets','Braves','Phillies'),] #finds values in a vector data.sub.10 <- data[data$Team != '- - -',] #removes players who played for two teams
If you don't want to use the dplyr
package, you are able to accomplish the same thing uses the basic functionality of R. #method 1
uses a boolean vector to select rows for the subset. #method 2
uses the which()
function. This function finds the index of a boolean vector of True values. Both of these techniques use the original data frame and uses the row index to create a subset.
The subset()
function works much like the filter()
function, except the syntax is slightly different and you don't have to download a separate package.
Efficiency
While subset works in a similar fashion, it doesn't perform the same way. While some data manipulation might only happen once or a few times throughout a project, many projects require constant subsetting and possibly from a loop. So while the gains might seem insignificant for one run, multiply that difference and it adds up quickly.
I timed how long it would take to run the same [complex] subset of a 500,000 row data frame using the four different techniques.
Subset Method | Elapsed Time (sec) |
boolean vector | 0.87 |
which() | 0.33 |
subset() | 0.81 |
dplyr filter() | 0.21 |
The dpylr filter()
function was by far the quickest, which is why I prefer to use it.
The full code I used to write up this tutorial is available on my GitHub .
References:
Introduction to dplyr. https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html