Category Archives: coding

D3 Visualization Basics — First Steps

D3 visualizations work by manipulating elements in the browser window. This short tutorial will demonstrate the very basics of that. This is also a working, simple demonstration of the interplay of HTML, CSS and JavaScript from the introduction page in this D3 tutorial set.

For the sake of making this simple, everything will come from one HTML document, which can be found in my GitHub . This will container the HTML and JavaScript. You can [and should] separate the JavaScript into its own file on bigger projects.

In this small project, we will start with a simple div container with the class of “container”.

Right now this doesn’t do anything, so it’s not worth showing. But if D3 code is added to create a blue box, it might look a little more interesting. [Provided you find blue boxes interesting.]

Blue Block


Above is a simple example of a basic D3 procedure. It can helpful to think about this command as having two parts: getting a DOM [browser] element to manipulate and giving instructions for those manipulations. Here is what the code is doing:

  1. The select statement [d3.selectAll()] code is finding every instance of the class of “container”. [There is only one element on the page.]
  2. The append() method adds another div as a child element inside of the container div.
  3. The attr() method gives the new div a class of “new-block”.
  4. The style() method gives the new div several style properties
  5. The text() method puts the text “Blue block” into the div

The style() and attr() methods can be a little confusing since they do similar things, but attr() will place attributes in the HTML tag, while style() writes in-line CSS for the element, which will override any CSS stylesheet you load in the header. These are just a few of the different methods you can use for a D3 DOM selection object, and you can find more in the D3 API reference.

Blue Block with Functions

Creating the blue block was a blast, but adding some functionality might make this a little more useful. Let’s make the block change color on a click and then remove it on a double click.


The first part of the script to create the blue block is the same, except I’ve doubled the code to have to blue blocks and I added new code that interacts with the blue block. The on() method allows you to attach an event listener to elements rendered in the browser. These will wait until a certain event happens. In the example it executes a function to turn the box orange when the blue box is clicked. You can put many different instructions in here and aren’t limited to manipulating the element that is being clicked. Below is an illustration that I find useful for visualizing how event listeners are attached. I will devote another post to the details of D3 event listeners.

attach event listeners

You might notice the d3.select(this) in the code. this is a fun keyword in JavaScript syntax which deals with scope, and the this in the code refers to the specific DOM element which was clicked or double clicked. If you click on the left block, only that block turns orange. You could change the code replace this with '.block-new' and clicking one button will change both buttons to orange.

Having the d3.select(this) code blocks in the event listener function makes it so they are only executed when the event happens. The block’s background color is changed to orange [#FF9821] when it’s clicked. The remove() method deletes any DOM elements within the selection including the children elements. This comes in handy when you need to rebuild or update a data visualization.

[Next]

Data! D3’s most powerful tool.

D3 Visualization Basics — Introduction

Data visualization is important, really important. I can’t be more blunt than that. We are able to process much more information faster by seeing a visual representation than we could look at a table, database or interacting with a spreadsheet. I will be writing a series of posts that explore some of the foundations D3 is built on along with how to create engaging data visualizations using it.

D3 is a powerful tool that allows you to create interactive data visualizations for the web. Understanding how D3 works starts with understanding how modern web pages are designed.

If you have found this page, you probably at least have some knowledge of how to make a modern website: HTML, CSS, JavaScript, responsive design, etc. D3 uses basic elements from these components of web design to create the visualizations. This by no means the only way to create interactive visualizations, but this is an effective way to produce them.

Before jumping into D3 nuts and bolts, let’s look at what does each of these components do. [If you already know this stuff, feel free to skip ahead…once I get the other posts built out.]

Basic Web Programming

In the most simplistic terms, HTML provides the structure of the webpage, CSS provides the styling and formatting, and JavaScript provides the functionality of the site. The browser brings these three components together and interprets them into something the end user (you) can understand and use. Sometimes one component can accomplish what the other does, but if you stick to this generalization you’ll be in good shape.

To produce a professional-looking, fully-functional D3 data visualization you will need to understand, write and manipulate all three components.

HTML

The most vivid memories I have of HTML is from the websites of the late 90s: Geocities, Angelfire, etc. HTML provides instructions on how browsers should interpret information; it organizes the information. Everything you see on a webpage has corresponding HTML code.

If you look at the source HTML or inspect one of this site’s page you’ll see some of the structure. When HTML renders in the browser these elements are referred to DOM elements. DOM stands for Document Object Model, which is the structure of a webpage.

HTML containers

Looking at the DOM tree you can see the many of the div containers that provide structure for the how the site is laid out. The p tags contain each paragraph in the content of my posts. h1, h2 and h3 are subheadings I’ve made to make the post more organized. You also notice some attributes especially class which have many uses for CSS, JavaScript and D3. Classes in particular are used to identify what function that DOM element plays in JavaScript or how to style it in CSS.

CSS

A house without painted walls or decorations is pretty boring. The same thing happens with bare bones HTML. You can organize the information, but it won’t be in an appealing format.

Most sites have style sheets (CSS) which sets margins, colors, display options, etc. Style sheets have a specific syntax which identifies HTML elements by type, class or id. This identification and selection concept is used extensively in D3.

CSS example

Above is some CSS from this site. It contains formatting instructions for elements of the class “page-links”. It includes instructions for the font size, margins, height, width and to make the text all uppercase. The advantage of CSS is that it keeps formatting away from the structure of the HTML allowing you to format many elements at once. For example if you wanted to change the color of every link, you could easily do that by modifying the CSS.

There is an alternative to using CSS style sheets and that’s by using inline style definitions.

Inline styles use the same markup as the CSS in the style sheets. Inline styles

  • control only the element they are in
  • OVERRIDE any CSS styles [without an !important tag]

The code above overrides the normal paragraph’s style property which aligns it left. Using inline styles are generally bad for web design, but it’s important to understand how they work since D3 manipulates inline styles often.

JavaScript

JavaScript breathes life into your web page. It certainly not the only way to have your website become interactive or build programming into it, but it is widely used and supported in the popular browsers. D3 is a JavaScript library, so you will inevitably have to write JavaScript to use it.

For D3 visualization, JavaScript will be use to

  • Manage and manipulate data for the visualization
  • Create DOM elements
  • Manipulate DOM elements
  • Destroy DOM elements
  • Attach data to DOM elements

JavaScript will be used insert elements onto the page, it will also be used to change colors and styles of those elements. You might be able to see how this could be useful. For example JavaScript could map data points to an element’s position for a scatter plot or to an element’s height or width for a bar chart.

I bolded the last function D3 does, attaching data to elements, because it’s so critical to D3. This allows you to attach a data point beyond x, y data to allow for rich visualization.

__data__

Above is data attached to a D3 visualization I made for FanGraphs. This is a simple example, but I was able to attach data detailing the team’s name, id, league, ERA and FIP. Using the attached data I was able to create the graph and tooltips. More complex designs can take advantage of the robust data structure D3 provides.

[Next]

I’ll look at how to set up a basic project by organizing data, files and code.

Stattleship! Sport Stats API

I’ve been in contact with the team over at Stattleship. They have a cool API that allows you to get various stats for basketball, football and hockey. I used data from that API to create the following data visualization for their blog. The visualization shows the offensive and special team yards gained by each team remaining in the playoffs. The yardage is totaled for the entire season as well as the one playoff game each team played. I’ve displayed the points off of offensive TDs and special teams scoring, and that score is color coded with to wins and loses. A black background is a win, and a white background is a loss.

R Bootcamp: Making a Subset

Data Manipulation: Subsetting

Making a subset of a data frame in R is one of the most basic and necessary data manipulation techniques you can use in R. If you are brand new to data analysis, a data frame is the most common data storage object in R and subsets are a collection of rows from that data frame based on certain criteria.

Data Frame
V1 V2 V3 V4 V5 V6 V7
Row1
Row2
Row3
Row4
Row5
Row6

Arrow

Subset
V1 V2 V3 V4 V5 V6 V7
Row2
Row5
Row6

The Data

For this example, I’m using data from FanGraphs. You can get the exact data set here, and it’s provided in my GitHub. This data set has players names, teams, seasons and stats. We are able to create a subset based on any one or more of these variables.

The Code

I’m going to show four different ways to subset data frames: using a boolean vector, using the which() function, using the subset() function and using filter() function from the dplyr package. All of these functions are different ways to do the same thing. The dplyr package is fast and easy to code, and it is my recommended subsetting method, so let’s start with that. This is especially true when you have to loop an operation or run something that will be run repeatedly.

dplyr

The filter() requires the dplyr package to be loaded in your R environment, and it removes the filter() function from the default stats package. You don’t need to worry about but it does tell you that when you first install and load the package.

#install.packages('dplyr')
library(dplyr) #load the package

#from http://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2015&month=0&season1=2010&ind=1&team=&rost=&age=&filter=&players=&page=2_30
setwd('***PATH***') 
data <- read.csv('FanGraphs Leaderboard.csv') #loads in the data

Aside from the loading the package, you'll have to load the data in as well.

#finds all players who played for the Marlins
data.sub.1 <- filter(data, Team=='Marlins')

#finds all the NL East players
NL.East <- c('Marlins','Nationals','Mets','Braves','Phillies') #makes the division
data.sub.2 <- filter(data, Team %in% NL.East) #finds all players that are in the NL East

#Both of these find players in the NL East and have more than 30 home runs.
data.sub.3 <- filter(data, Team %in% NL.East, HR > 30) #uses multiple arguments
data.sub.3 <- filter(data, Team %in% NL.East & HR > 30) #uses & sign

#Finds players in the NL East or has more than 30 HR
data.sub.4 <- filter(data, Team %in% NL.East | HR > 30)

#Finds players not in the NL East and who have more than 30 home runs.
data.sub.5 <- filter(data, !(Team %in% NL.East), HR > 30)

The filter() function is rather simple to use. The examples above illustrate a few simple examples where you specify the data frame you want to use and create true/false expressions, which filter() uses to find which rows it should keep. The output of the function is saved into a separate variable, so we can reuse the original data frame for other subsets. I put a few other examples in the code to demonstrate how it works.

Built-in Functions

#method 1 -- using a T/F vector
data.sub.1 <- data[data$Team == 'Marlins',]

#method 2 -- which()
data.sub.2 <- data[which(data$Team == 'Marlins'),]

#method 3 -- subset()
data.sub.3 <- subset(data,subset = (Team=='Marlins'))

#other comparison functions
data.sub.4 <- data[data$HR > 30,] #greater than
data.sub.5 <- data[data$HR < 30,] #less than

data.sub.6 <- data[data$AVG > .320 & data$PA > 600,] #duel requirements using AND (&)
data.sub.7 <- data.sub3 <- subset(data, subset = (AVG > .300 & PA > 600)) #using subset()

data.sub.8 <- data[data$HR > 40 | data$SB > 30,] #duel requirements using OR (|)

data.sub.9 <- data[data$Team %in% c('Marlins','Nationals','Mets','Braves','Phillies'),] #finds values in a vector

data.sub.10 <- data[data$Team != '- - -',] #removes players who played for two teams

If you don't want to use the dplyr package, you are able to accomplish the same thing uses the basic functionality of R. #method 1 uses a boolean vector to select rows for the subset. #method 2 uses the which() function. This function finds the index of a boolean vector of True values. Both of these techniques use the original data frame and uses the row index to create a subset.

The subset() function works much like the filter() function, except the syntax is slightly different and you don't have to download a separate package.

Efficiency

While subset works in a similar fashion, it doesn't perform the same way. While some data manipulation might only happen once or a few times throughout a project, many projects require constant subsetting and possibly from a loop. So while the gains might seem insignificant for one run, multiply that difference and it adds up quickly.

I timed how long it would take to run the same [complex] subset of a 500,000 row data frame using the four different techniques.

Time to Subset 500,000 Rows
Subset Method Elapsed Time (sec)
boolean vector 0.87
which() 0.33
subset() 0.81
dplyr filter() 0.21

The dpylr filter() function was by far the quickest, which is why I prefer to use it.

The full code I used to write up this tutorial is available on my GitHub .

References:

Introduction to dplyr. https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

R Bootcamp — A Quick Introduction

R, a statistics programming language environment, is becoming more popular as organizations, governments and businesses have increased their use of data science. In an effort to provide a quick bootcamp to learn the basics of R quickly, I’ve assemble some of the most basic processes to give a new user a quick introduction to the R language.

This post assumes that you have already installed R and have it running correctly on your computer. I recommend getting RStudio to use to write and execute your code. It will make your life much easier.

Getting Started

First R is an interactive programming environment, which means you are able to send commands to its interpreter to tell it what to do.

There are two basic methods to send commands to R. The first is by using the console, which is like your old-school command line computing methods. The second method is more typically used by R coders, and that’s to write a script. An R script isn’t fancy. At its core it’s a text document that contain a collection of R commands. Then when the code is executed it is treated like a collection of individual commands being feed one-by-one into the R interpreter. This differs on how other, more fundamental programming languages work.

R - How R Works

Basics

Comments are probably the best place to start, especially because my code is chock-full of them. A comment is code that is fed to R, but it’s not executed and has no bearing on the function of your script or command. In R comments are lines prefaced with a #.

#comments start with #-signs

9-3 #basic math (this doesn't save this in a variable)

One of the most basic thing you could use R for is a calculator. For instance if we run the code 9-3, R will display a 6 as the result of that code. All of this is rather straight forward. The only operator you might not be familiar with if you are new to coding is the modulus operator, which yields the remainder when you divide the first number by the second. This gets used often when dealing with data. For example, you can get a 0 for even number and 1 for odd number if you take you variable use the modulus operator with the number 2.

#basic operations
#yields numeric value
1+2 #addition
3-2 #subtraction
3*2 #multiplication
4/5 #division
3 %% 2 #modulus (remainder operator)

Beyond the basic math and numeric operations you can do, R has several fundamental data types. NULL and NA are representative of empty objects or missing data. These two data types aren’t the same. NA will fill a position in an vector or data frame. The details are best left for another entry.

#basic data structure

NULL     #empty value
NA       #missing value
9100     #numeric value
'abcdef' #string
TRUE     #boolean
T        #equilvant form
FALSE 
F

Numeric values can have mathematical operations performed on them. Strings are essentially non-numeric values. You can’t add strings together or find the average of a string. In any type of data analysis, you’ll typically have some string data. It can be used to classify entries in categorically such as male/female or Mac/Windows/Linux. R will treat these like factors.

Finally, boolean values (True or False) are binary logical values. They work like normal logic operations you might have learned in math or a logic class with AND (&&) and OR (||) operators. These can be used in conditional statements and various other data manipulation operations such as subsetting.

#logical operators
T && F #and
F || T #or

Now that we covered the basic operations and data types, let’s look at how to store that — variables. To assign a value to a variable it’s rather easy. You can use a simple equation or the traditional R notation using an arrow.

x <- 1 #basic assignment
x = 1

#####Acceptable Variables
x <- 1
X <- 1
X1 <- 1
X.1 <- 1
X_1 <- 1
########################

####UNACCEPTABLE Variables
1X <- 1
X-1 <- 1  #CODE WILL NOT WORK
X,1 <- 1
#######################

Variables must begin with a letter and they are case-sensitive. Periods are acceptable faux separators in variable names, but that doesn’t translate to other programming languages like Python or JavaScript, so that might factor in how you establish naming conventions.

I’ve mentioned vectors a few times already. They are an important data structure within R. A vector is an ordered list of data. Typically, thought of as numeric data, but character (string) vectors are often used in R. The c() operator can create a vector. It’s important that vectors contain the same type of data: boolean, numeric or character. If you mix types it will force values into another type. And you can assign your vectors to variables. In fact, you can store just about any thing in R to a variable.

x.vector <- c() #vector operator
x.vector <- c(1,2,3,4,5,6) #creates a vector (typically numeric)
x.vector <- c(T, 1)
x.list <- list('A',12,'b') #creates a list (not used for numeric operations)

mean(c(1,3,2))

Lists are created with the list() command. They are used more for storage and organization than for data structure. For example you could store the mean, median and range for a set of data in a list. A vector would house the data used to calculated said summary stats. Lists are useful when you begin to write bigger programs and need to shuffle a lot of things around.

The basic statistic operators are listed below. All of these require a vector to operate on.

#basic stats
x.vector <- c(10,11,12,12,10,11,20,9) #puts your data into a vector

mean(x.vector) #takes mean of vector
median(x.vector) #median of vector
max(x.vector) #maximum of vector
min(x.vector) #minimum of vector
range(x.vector) #yields a vector with a range

sd(x.vector) #standard deviation
var(x.vector) #variance

Handling Data

Above we discussed some of the building blocks of basic analysis in R. Beyond introductory Statistics classes, R isn’t very useful unless you can import data. There are many ways to do this since data exists in many different formats. A .csv file is one of the most basic, compatible way data is stored to be used between different analytical tools.

Before loading this data file into R, it’s a good idea to set your working directory. This is where the data file is stored.

#load in data
setwd('**folder path**') #sets your working directory 
                                         #specific to each computer

Next you can use the read.csv() function to ingest a .csv file into R. This call won’t save the data in a variable, it just brings it in as a data frame and show it to you.

read.csv('data_bryant_kobe.csv') #reads the data into R
                                 #does not save it into a variable

data <- read.csv('data_bryant_kobe.csv') #reads the data and saves it
                                         #into a variable called 'data'

Data frames are the primary form of data structure you’ll encounter in R. Data frames are like tables in Excel or SQL in that they are rectangular and have a rigid schema. However, at a data frame’s core are a collection of equal-length vectors.

If you assign the data frame output of the read.csv() function to a variable, you can pass around the data frame to different data manipulation functions or modeling functions. One of the most basic ways to manipulate the data is to access different values within the data frame. Below are several different examples on how to get to values, rows or columns in a data frame.

The basic concept is that data frames can be accessed by row and column number. [row, column] And that an entire row or column can be accessed by omitting the dimension you aren’t trying to retrieve. You can retrieve individual fields (variables) by using the $ sign and using the variable name. This is the method I use most often. It requires you knowing and using the name of the variables, which can make your code easier to read.

#accessing values
row <- 3
column <- 2
data$Age #returns the Age variable column
data[row,column]  #individual value
data[data$Age <= 25,]        #returns entire row
data[,column]     #returns entire column as a vector
data$Age[3]       #returns entire column as a vector

By accessing rows, you can create a subset of data by using a logical argument to filter out your data set.

#creating a quick subset
data.U25 <- data[data$Age < 25,]  #creates an under-25 set
data.O25 <- data[data$Age >= 25,] #creates a 25 and older set

The code above creates two new data frames which separate Kobe Bryant’s season stats into an under-25 data set and a 25 and under data set.

Relationships Between Variables

Correlation is often used to summarize the linear relationship between two variables. Getting the correlation in R is simple. Use the cor() function with two equal length vectors. R uses the corresponding elements in each vector to get a Pearson correlation coefficient.

#correlation between different variables within the subset
cor(data.U25$MP, data.U25$PTS)
cor(data.O25$MP, data.O25$PTS)

cor(data$MP, data$PTS) #correlation with two related vectors from data set
cor(data$MP, data$FTpct)

A simple linear model can be made by using the lm() function. The linear model function requires two things: a formula and a data frame. The formula uses a tilde (~) instead an equal sign. The formula represent the variables you would use in your standard ordinary least squares regression. The data parameter is the data frame which contains all the data.

#create a basic linear model
linear.model <- lm(PTS ~ MP, data=data.U25)
linear.model <- lm(PTS ~ MP + Age, data=data.U25)

The summary() function will take the linear model object and displays information about the coefficients of your linear model.

summary(linear.model)

NOTES: The data set used in this tutorial is from basketball-reference.com.

The full code I used in this tutorial can be found on my GitHub .