I’ve been collecting a lot of data to be able to do my last few posts, and I’d mentioned that I wanted to try more with time series data. A few years ago I got to sit in a lecture at Queensland University of Technology by Sudipto Banerjee on Bayesian spatiotemporal modelling. At the time the material was way too advanced for me, but the idea of analysing data points with time and space treated correctly has always stuck.
As I dug into different things I could do with spatiotemporal , I realised that I needed a lot more understanding of the data itself before I could do fun tricksy things with it. I needed something that would maintain my interest, but also force me to mess around munging spatiotemporal data
An idea born of necessity
In the first year of my postgraduate research, I was really interested in data visualisations. Thankfully at the time a bunch of blogs like FlowingData were starting up, reporting on all types of cool data graphics.’Infographics’ also became a thing, threatening to destroy Data Science in its infancy. But what caught my eye at the time were the visualisations of flight paths like this one.
So now that I have some data and a bit of time and ability, I thought I’d try a more basic version of a spatiotemporal visualisation like this. My problem is that I hate installing extra one-time software for my whims, so the idea of using ImageMagick annoyed me. On top of that, when I tried I couldn’t get it to work so I determined to do what I could using base R and ggplot.
The result
This is probably the first article I can say I’m pretty happy with the result:
The first thing you can see is that Europe, not the US is the true home of theme parks, with Tivoli Gardens appearing in 1843, and remaining in the top 25 theme parks since before there were 25 parks to compete against.
Beyond that, you can also sort of see that there is a ‘contagion’ effect of parks – when one opens in an area, there are usually others opening nearby pretty soon. There’s two reasons I can think of for this. First, once people are travelling to an area to go to a theme park, going to two theme parks probably isn’t out of the question so someone’s bound to move in to capture that cash. Second is that the people opening new parks have to learn to run theme parks somewhere, and if you’re taking a massive risk on opening a $100 million park with a bunch of other people’s money you’ll want to minimise your risk by opening it in a place you understand.
Future stuff
Simply visualising the data turned out to be more than a data munging exercise for me – plotting this spatially as an animation gave some actual insights about how these things have spread over the world. It made me more interested in doing the spatio-temporal clustering as well – it would be really cool to do that then redo this plot with the colours of the points determined by the park’s cluster.
Another direction to explore would be to learn more about how to scrape Wikipedia and fill out my data table with more parks rather than just those that have featured in the TEA reports. I know this is possible and it’s not exactly new, but it’s never come across my radar and web scraping is a pretty necessary tool in the Data Science toolkit.
What applications can you think of for this sort of visualisation? Is there anything else I could add to this one that might improve it? I’d love to hear your thoughts!
The code
Just in case you wanted to do the same, I’ve added the code with comments below. You’ll need to add your own file with a unique name, latitude, longitude and date in each row.
# Load the required libraries library(ggmap) library(ggplot2) library(maptools) library(maps) library(data.table) info <- read.csv("***.csv", stringsAsFactors = FALSE) info <- info[complete.cases(info),] setDT(info) info$opened <- as.Date(info$opened) # S setkey(info, park) setkey(info, opened) # Setup for an animation a_vec <- seq(1840, 2016 , by=1) # Create a vector of the years you will animate over # Create a matrix to hold the 'size' information for the graph B = matrix( rep(0, length(a_vec)*length(info$park)), nrow= length(a_vec), ncol= length(info$park)) for (i in 1:ncol(B)) { for (x in 1: nrow(B)) { #I want to have a big dot when it opens that gets gradually smaller, like the alpha in the flights visualisation. open_date <- as.numeric(year(info$opened[i])) c_year <- a_vec[x] #If the park hasn't opened yet give it no circle if ( open_date < c_year) {B[x,i] <- 0} else # If the park is in its opening year, give it a big circle. if (open_date == c_year) {B[x,i] <- 10} }} # Make the circle fade from size 10 to size 1, then stay at 1 until the end of the matrix for (i in 1:ncol(B)) { for (x in 2: nrow(B)) {if (B[x-1, i] > 1){ B[x,i] <- B[x-1, i] - 1}else if(B[x-1, i] == 1){ B[x,i] <- 1}}} B <- data.frame(B) B <- cbind( a_vec, B) setDT(B) names(B) <- c("years", info$park) #Set the column names to the names of the parks xxx <- melt(B, "years") # Convert to long format # Create a table of locations loc <- data.table("variable" = info$park, "lat"= info$lat, "long"= info$long) #Join the locations to the long table xxx <- merge(xxx, loc, by = "variable", all.x = TRUE) setkey(xxx, years) # Create a ggplot image for each entry in the a_vec vector of years we made at the beginning. for (i in 1: length(a_vec)) {mydata <- xxx[years ==a_vec[i]] # Only graph the rows for year i. mydata <- mydata[mydata$value!=0,] #Don't plot stuff not open yet. #Write the plot to a jpeg file and give it a number to keep the frames in order. jpeg(filename = paste("~/chosenfolder/animation", i, ".jpeg", sep = ""), width = (429*2) , height = (130*2), units = "px") mp <- NULL # Plot a world map in grey and entitle it with the year. mapWorld <- borders("world", colour="gray50", fill="gray50") mp <- ggplot() + mapWorld + theme_bw() + ggtitle(a_vec[i]) # Add the points on the map, using the size vector we spent all that time building matrices to produce. mp <- mp+ geom_point(aes(x=mydata$long, y=mydata$lat) , color = "orange", size = mydata$value/1.5) + ylim(c(0, 60)) plot(mp) dev.off() }