# Where Do People Tweet?

This is a representative map of twitter from 11am to 11pm EDT yesterday.

# Predicting Baseball Wins with WAR

This is a lot of debate about the usefulness of the comprehensive baseball statistic, WAR — Wins Above Replacement. I don’t think that WAR is the end all statistic, but it is a useful tool. Why? Because it can describe relatively accurately how a player contributes to a team. It also can help fans understand the real impact of one player. I might have to refer people here once people start clamoring that a single player will change the direction of a team at the trade deadline.

If anyone wants a primer on the details of what goes into the WAR stat, check out baseball-reference.com’s comparison between systems. Basically, WAR is the number of statistical wins the player is responsible for above a replacement player. In theory the replacement is the mediocre AAA player that is not a prospect. That statistic is the middle estimate of the impact the player will have, a player can be ‘responsible’ for more wins than their WAR number, but also drastically less. Think of WAR as the average wins he’s responsible for.

For probably over a year, I’ve wanted to see if WAR actually can predict the number of wins a team will have. I forget my original methods of trying to determine this, but this time round, I used FanGraphs’ WAR numbers for both pitching and batting from the last decade of season for all 30 teams. That’s 300 data points. After assembling the data and then running it through a basic linear regression, I was quite happy with what I saw. I’ve heard that if you add 48 to the team’s WAR number that you will get their total wins, and this can be seen mathematically by looking at real data.

I’ve graphed the actual wins to WAR and actual wins to the Pythagorean predicted wins for comparison. [Pythagorean wins performed better.] The linear regression for the WAR comparison actually turns out to be incredibly powerful. The regression coefficient is almost exactly equal to one meaning that each unit increase in WAR means an equal increase in wins. The y-intercept is +48.5, which means for the last decade the number of theoretical replacement wins has been just about 48. This should make sense, since the calculation of WAR is calibrated to a 48 win replacement level. The actual implementation of WAR works really well to predict teams wins. Unfortunately, this model will have a 95% prediction interval of 20 wins. That seems like a lot but, it shows how much luck has to do with a baseball season.

Pythagorean wins are typically used to show how lucky the team has been this year or not. This is actually a slightly better predictor of a teams’ success than WAR. There is less variance since run differential is just one step away from wins. You can see from the histograms that the spread on Pythagorean wins is less than with WAR. This can also be seen in the r-square for the linear regression. Pythagorean wins linear model has an r-square of .87 while the WAR model has an r-square of .77. This ultimately means that 87% and 77% of the variance is explained by the model indicating that the Pythagorean wins is slightly more accurate. The trade off is that WAR can give you player-level detail while run differential is only team-specific.

As always, let’s look at what the Pirates did.

A theme I always harp on was that the 2013 Pirates were good and really lucky. This can be seen by the data point for 2013 falling above the linear regression trend line. If you were wondering 2012 and 2011 (the two ‘collapse’ years) also fall above this line. I don’t know if this is the best way to measure a collapse, but the in-season stats did indicate regression during all three seasons 2011-2013.

# Twitter Analysis – Penguins Game 7

I’ve been listening to 93.7 The Fan while running the analysis for this, and I never realized that people can say the same thing over and over again but in slightly different ways. Also all tweets were captured AFTER THE CONCLUSION OF THE 1st PERIOD.

Everyone knows Twitter is the best venue to vent your anger about sports teams. I was able to the statistical programming language R to scrape tweets which had certain keywords or hashtags in them, put them in a database and then flag the tweets that contain certain keywords or collection of words. I had about 20 keywords including: “penguins”, “pens”, “rangers”, “game 7”, and “firebylsma”. I also search for any of the handles of the local hockey writers, because a lot of people will reply to the sports writers during the game with their own opinions.

The first graph has the total number of tweets that I scraped and tweets that I flagged as ‘swearing’. For the most part, I feel like if someone swore in the tweet, it indicates anger or at the very least aggressiveness. As mentioned before these graphs begin after the 1st period ends. And tweets containing the keyword ‘rangers’ has been filtered from this first graph to include a greater amount of Penguins fan.

The quickest and most basic analysis is the number of tweets as a time-series. As soon as you look at the time-series line graph, you can tell when the game ends [9:41 PM]. It’s like Mt. Everest in the graph. Looking closer you can see the spikes where each team scores. An interesting occurrence happened right before the end of the game. Twitter got quiet. The tweets per minute dropped below 500 right before it exploded to a few thousand per minute. I am attributing this silence to people actually watching the game during the tense last minute.

The tweets peaked about 2 minutes after the game ended indicating a minimal lag which includes the time of picking up the smart phone, unlocking it, and composing the tweet. However, the angry, swearing tweets peaked right as the game ended indicating more visceral emotions instead of more thought-out, 140-character commentary. There are two severe dips that I can’t account for at 9:49 PM and 10:04 PM. If anyone knows something that occurred at this time, please let me know. Since there is a clear downward trend and the game was no longer being played, I am going to write those dips off as some technical difficulties that didn’t allow a lot of tweets to be sent at those times.

Since Twitter isn’t an invention specific to just Penguins fans, I separated and compared two sets of tweets: one containing the word “penguins” and one comparing “rangers”. There are a lot more Rangers fans than Penguins fans, because the “rangers” tweets outnumber “penguins” tweets at almost every time. The “penguins” tweets did spike when they scored their only goal. Interestingly enough, there was a lot of swearing right when the goal was scored. Penguins fans are just so angry!

Bottom-line calm down; it’s just sports.

# Probability and Sunday Night Baseball

There’s nothing I like more than a bases-loaded, no-outs situation in baseball. This might be my favorite situation/stat no one realizes. There’s around a 15% chance that the team who has the bases loaded will not score at all that inning! 15% might not seem like much, but over the course of the season it happens often.

Let’s set the scene: Bottom of the ninth, down by two, the Pirates knock in a run and get McCutchen on 1st with no outs to move within one run of the Cardinals.

This is a win probability graph FanGraphs has for every game. I’m not entirely sure what all they consider when calculating a win probability, but it mirrors the data I have, so there’s not much to discuss there. Clearly, the closest they came to winning the game was after Barmes walked putting Alvarez, the winning run on 2nd.

source: FanGraphs

According my run probability calculations for 2013, the probability to score at least one run with bases loaded and no outs was lower than the Pirates batting with a runner on second/third or first/third and no outs [Probabilities –123: 77.9%, 1_3: 82.4%, _23: 90.9%] The advantage of having the bases loaded is a walk or HBP brings a runner home, but the downside is there is an easy force at home. That would hurt the Pirates in this instance because Mercer didn’t hit the ball past the pitcher’s mound making for an easy 1-2-3 double play.

# Chicago Transit Authority — Ridership

Waiting for the break of day…oooOOOOO…25 or 6 to 4!
-Chicago (formerly The Chicago Transit Authority)

I was lucky to live in Chicago during the summer of 2012. The thing I most miss from Chicago is the transit system. Taking the ‘L’ to work everyday was much more relaxing and interesting than having to drive in. No parking hassles or gas. It was great. Transit data is critical to making those systems much more efficient. Fortunately, Rahm Emanuel is kind enough to release some of the transit data from the Chicago Transit Authority (CTA). The data only contains ridership per day information from each station, so I am limited in the insight this analysis can produce.

Before diving into the results of the descriptive analytics, let’s look at how the ‘L’ is designed. There are eight different lines, each designated by a color. All of the lines goes through the downtown area called ‘The Loop’, because the elevated track forms a huge loop around a huge block of the city. There are two main lines which also go underground: the Red and Blue Line. These two lines run all night and carry the most passengers. When a Chicagoan rides the ‘L’ they swipe their pass at the entrance of stations, then board their desired train in either direction. Unlike transit systems like the Metro in DC, there is no exit swipe. So every data point in this post is going to be a person swiping at a station to board a train, but we can’t determine which direction or destination.

There’s also another problem, at several stations that service different lines. Clark/Lake has practically every line go through it. Without more resolution in how the data are measure, the most I can infer from the data is what stations are the most popular on certain days. This comes from the assumption that if a person arrives at a station they will leave from the same station.

This visualization looks a lot like a CTA map. I don’t have a good way to automatically draw the lines between the stations, but I think the location data that’s attached to the station names does a good job of recreating the CTA map everyone is used to seeing. I’ve labeled any ‘L’ stations which service multiple lines in an order of importance. The priority is Red, Blue, Orange, Brown, Green, Purple, Pink, Yellow in that order. The reasoning behind this is that these are the largest or most popular lines, so the station will have the majority of patrons using these lines. From this map Clark/Lake, the station with the most train lines, is the most popular. Terminuses (termini?) of the the lines also have a lot of use. This can help visualize where the transfer points, the most popular entry points or destinations are. The Red Line and Blue Line have the most stops and the most ridership. Admittedly, this analysis has problems parsing Brown/Red Line customers, but there is higher ridership at the non-transfer stations of the Red Line; that confirms that more customers are using the Red Line in general. I have a separate post for the chart of the ridership of every station. The chart is way to big to put in this post, you’ll be scrolling for days. It’s worth checking out though to drill down into the details.

Ridership is rarely constant. In fact, the ridership of the ‘L’ varies into three predictable groups of days: weekdays, Saturday, and Sundays. The differences in the data based on the different day-group will effectively ruin any time-series analysis, because the average values between the three groups varies so much that any trends are going to be hard to spot. The graph would look very erratic. To account for this, any time-series graphs are split into those three groups.

I can’t write a post without tying baseball into it. This will be no exception, because in 2012 I spent a lot of time watching baseball games on the North Side and South Side of Chicago. Both teams are connected by the Red Line, and the stations are extremely close the parks. So what would we expect to see? Baseball games have attendance ranging from 10k to 30k, so this should present a spike in daily ridership. I graphed the daily ridership for the two stops adjacent to the ballparks and then label when there was a home or away game.

Any spike or dip not described by baseball is labeled. St. Patrick’s day has large spike all over the CTA system, but the spike at Addison is particularly high because of the all the bars in the neighborhood. The largest non-baseball spike in the Addison station’s ridership came during the gay pride parade, since this is an incredibly popular event in a neighborhood nearby the station.

There is one anomaly I forgot to point out on the graph, but there’s a spike when the Cubs are away for Sept 8th. At first I thought this might have been labor day, but it turns out that The Boss was playing at Wrigley Friday night, and I remember walking by it late at night. So there’s a spike for the 7th (the day the concert actually was) and then a larger spike from the 8th presumably the concert ended near or after midnight or people stayed after and drank in one of the fine establishments around the park. [I went to a Sox game early that day, so I accounted for a few data points that day.]

I arrived on June 12, 2012 right in the middle of a Cubs-Tigers game. Those were three really crowded games in Wrigleyville. Ridership at the two ballpark ‘L’ stations peaks during the summer, especially at Wrigley. You wouldn’t have guess that the Sox were in contention for a division title up until the last week of the season. The Cubs have a huge tourist draw including me because I went to at least a dozen games while I lived there, and you can see a surge during the summer (vacation) months. I would leave Chicago on October 20, 2012, and start out on #SeanTrek about 11 days later.

# Text Message Analytics — Numbers

People communicate a lot through text messages, and lucky for me iPhones keep track of those text messages I’ve sent. iPhones store your text messages in a SQLite database, and this database is readily accessible in your iPhone backup on your computer. [This is why encrypting your backup might be a good idea if you have sensitive data.] I want to eventually perform some advanced text analytics to try to interpret the content of the text message. This post is only going to look at the ‘numbers’ aspect of my text messages. All the numbers on the following pages include both sent and received texts, and excludes texts that I either deleted or where deleted by the system. [I know I’ve deleted threads. I don’t think iOS deletes old messages, but it’s a possibility till I know otherwise.]

The most simple stat from text messages is how many have I sent/received per day or per week. The chart below has both. The notable trend is that there has been more text messages sent/received the longer I’ve had my iPhones. I’d suggest this is a little biases since I would be more likely to delete text threads that are much older, but I still think there would be the slight trend upwards regardless.

I wanted to look at area codes just out of curiosity. I thought that I would have the most texts between me and a 412 or 724 number. I’m a little surprised how many 412 numbers there are given how many people I know living in the Pittsburgh suburbs and I’m a 724. I think 412 is Allegheny County, while 724 is anything outside of that. I’m little surprised how close traditional SMS text messages, which go through your carrier network opposed over the Internet, since most of my friends have iPhones.

The last chart is my favorite, a breakdown of how often I text for each hour of the day. I think this tells you something about my behavior, albeit nothing common sense won’t tell you. I generally text earlier in the morning (7am to 10am) more often during the work week compared to the weekend. There’s a spike at 12PM (lunch time) and 9PM (making plans/socializing) for any day of the week. There’s virtually no texting between 4am and 6am. There are some texts that occur after the ‘Ted-Mobsy-Hour’ of 2am, where nothing good happens after that time, but not a spike like there might have been in college.

‘Kids, if it’s after 2am, don’t text, just go home….and watch How I Met Your Mother.’

# Charlie Morton — PitchFX

I’m in a predictive modeling class for my grad program at NU, and we are learning a statistical programming language called SAS. One of the things we are trying early on is cluster analysis to determine if variables are related. I decided to play around with data that’s a little more interesting than housing prices. Charlie Morton has been on of my favorite pitchers to watch pitch. His curveball is just sexy. Cluster analysis can help us separate Morton’s pitches into different pitch types using PitchFX data I’ve been scraping.

I’ve plotted two charts, one is the vertical movement vs. the release speed. The second is the vertical movement vs the horizontal movement. [The movement parameters are calculated from the deviation of the ball from a straight path with no spin. And the horizontal movement is from the perspective of the catcher/batter. So imagine that Morton is throwing toward you.] So fastballs with backspin will have a positive vertical movement. Curveballs with top spin will have negative vertical movement. I used SAS to look at the speed, vertical, and horizontal movement and cluster similar pitches together. Without much tweaking, I was able to identify Morton’s fastballs and curveballs. He also has a third group which is a splitter according to brooksbaseball.net

Morton is famous for his sinker, which is a two-seam fastball that ‘sinks’ relative to a four-seam fastball thrown at the same angle. I’ve annotated the sinker on the vertical movement to release speed chart below. Morton’s sinker is hard to differentiate because it’s almost as fast as his four-seamer. (low-90s) It doesn’t stay as high due to the different spin compared to the four-seam fastball. The advantage here is that a batter will swing as to hit the four-seam fastball, but the sinker will be an inch or two lower than what the batter adjusted for. Since the bat is round, the ball will come off the bat at a low angle, and bam! Ground ball.

Brooksbaseball.net has updated and historical PitchFX data presented very nicely. I suggest checking them out if you want to see visualizations like this for other games or pitchers. Their visualization tools are easy to use and updated right after games end.

# #SeanTrek GeoTracks 2012

You might remember #SeanTrek — the 46 day, 12,000 mile, 34 state excursion I took back at the very end of 2012. I didn’t know what I how I was going to use this at the time, but I geotagged just about everything I did on the trip. I checked-in to every place on Foursquare and obtained over 700 points in Portland and San Francisco, which is insane because I checked in just about everything I did or place I went. On top of Foursquare I geotagged every tweet I sent and picture I took. This resulted in me now having thousands of data points of both timestamps and location data.

The above map is what happens when you put all of them together. It outlines my entire trip! The more dense the marks the more I was in one place longer exploring it. Sparse points means I was driving a lot. You’ll find a lot of marks around Pittsburgh, Portland, SF, LA, Austin, and New Orleans, because I spent the most time there and didn’t drive much in most of those cities. I have a rather nice record of a long trip that didn’t require me to painstakingly record exactly what I did.

This map only has geotag data and the type of media. I’m hoping to use the geotag data and the timestamp to get an average speed between the two points. I also want to geocode some tweets or photos that were not geocoded in 2012 by interpolating using the timestamp now.

Once I properly extract the data from the tweets, I can have hashtags or mentions searchable by frequency and location. I used #SeanTrek a lot more than any other hashtag on the trip. Though curiously enough the first tweet mentioning #SeanTrek is not geotagged. (technical glitch) Hopefully, I’ll get some more things mapped out in the future.