An animation of theme parks opening around the world

parntersdisney

I’ve been collecting a lot of data to be able to do my last few posts, and I’d mentioned that I wanted to try more with time series data. A few years ago I got to sit in a lecture at Queensland University of Technology by Sudipto Banerjee on Bayesian spatiotemporal modelling. At the time the material was way too advanced for me, but the idea of analysing data points with time and space treated correctly has always stuck.

As I dug into different things I could do with spatiotemporal , I realised that I needed a lot more understanding of the data itself before I could do fun tricksy things with it. I needed something that would maintain my interest, but also force me to mess around munging spatiotemporal data

An idea born of necessity

In the first year of my postgraduate research, I was really interested in data visualisations. Thankfully at the time a bunch of blogs like FlowingData were starting up, reporting on all types of cool data graphics.’Infographics’ also became a thing, threatening to destroy Data Science in its infancy. But what caught my eye at the time were the visualisations of flight paths like this one.

So now that I have some data and a bit of time and ability, I thought I’d try a more basic version of a spatiotemporal visualisation like this. My problem is that I hate installing extra one-time software for my whims, so the idea of using ImageMagick annoyed me. On top of that, when I tried I couldn’t get it to work so I determined to do what I could using base R and ggplot.

The result

This is probably the first article I can say I’m pretty happy with the result:

movie

 

The first thing you can see is that Europe, not the US is the true home of theme parks, with Tivoli Gardens appearing in 1843, and remaining in the top 25 theme parks since before there were 25 parks to compete against.

Beyond that, you can also sort of see  that there is a ‘contagion’ effect of parks – when one opens in an area, there are usually others opening nearby pretty soon. There’s two reasons I can think of for this. First, once people are travelling to an area to go to a theme park, going to two theme parks probably isn’t out of the question so someone’s bound to move in to capture that cash. Second is that the people opening new parks have to learn to run theme parks somewhere, and if you’re taking a massive risk on opening a $100 million park with a bunch of other people’s money you’ll want to minimise your risk by opening it in a place you understand.

Future stuff

Simply visualising the data turned out to be more than a data munging exercise for me – plotting this spatially as an animation gave some actual insights about how these things have spread over the world. It made me more interested in doing the spatio-temporal clustering as well – it would be really cool to do that then redo this plot with the colours of the points determined by the park’s cluster.

Another direction to explore would be to learn more about how to scrape Wikipedia and fill out my data table with more parks rather than just those that have featured in the TEA reports. I know this is possible and it’s not exactly new, but it’s never come across my radar and web scraping is a pretty necessary tool in the Data Science toolkit.

What applications can you think of for this sort of visualisation? Is there anything else I could add to this one that might improve it? I’d love to hear your thoughts!

The code

Just in case you wanted to do the same, I’ve added the code with comments below. You’ll need to add your own file with a unique name, latitude, longitude and date in each row.

# Load the required libraries
library(ggmap)
library(ggplot2)
library(maptools)
library(maps)
library(data.table)

info <- read.csv("***.csv", stringsAsFactors = FALSE)

info <- info[complete.cases(info),]
setDT(info)
info$opened <- as.Date(info$opened) # S
setkey(info, park)
setkey(info, opened)


# Setup for an animation 
a_vec <- seq(1840, 2016 , by=1) # Create a vector of the years you will 
animate over

# Create a matrix to hold the 'size' information for the graph
B = matrix( rep(0, length(a_vec)*length(info$park)),
 nrow= length(a_vec), 
 ncol= length(info$park))

for (i in 1:ncol(B))
{
 for (x in 1: nrow(B))
 { #I want to have a big dot when it opens that gets gradually smaller,
    like the alpha in the flights visualisation.
  open_date <- as.numeric(year(info$opened[i]))
  c_year <- a_vec[x]
  #If the park hasn't opened yet give it no circle
  if ( open_date < c_year)
  {B[x,i] <- 0} else
  # If the park is in its opening year, give it a big circle.
  if (open_date == c_year)
  {B[x,i] <- 10}
  }}

# Make the circle fade from size 10 to size 1, then stay at 1 until 
the end of the matrix

for (i in 1:ncol(B))
{ for (x in 2: nrow(B))
  {if (B[x-1, i] > 1){ B[x,i] <- B[x-1, i] - 1}else
   if(B[x-1, i] == 1){ B[x,i] <- 1}}}

B <- data.frame(B)
B <- cbind( a_vec, B)
setDT(B)
names(B) <- c("years", info$park) #Set the column names to the names of 
the parks

xxx <- melt(B, "years") # Convert to long format

# Create a table of locations
loc <- data.table("variable" = info$park,
                   "lat"= info$lat, 
                   "long"= info$long)

#Join the locations to the long table
xxx <- merge(xxx, loc, by = "variable", all.x = TRUE)
setkey(xxx, years)

# Create a ggplot image for each entry in the a_vec vector of years we
 made at the beginning. 
for (i in 1: length(a_vec))
    {mydata <- xxx[years ==a_vec[i]] # Only graph the rows for year i.
     mydata <- mydata[mydata$value!=0,] #Don't plot stuff not open yet.
     #Write the plot to a jpeg file and give it a number to keep the 
      frames in order.
     jpeg(filename = 
     paste("~/chosenfolder/animation", i, ".jpeg", sep = ""),
     width = (429*2) , height = (130*2), units = "px") 
     mp <- NULL 
     # Plot a world map in grey and entitle it with the year.
     mapWorld <- borders("world", colour="gray50", fill="gray50") 
     mp <- ggplot() + mapWorld + theme_bw() + ggtitle(a_vec[i])
     # Add the points on the map, using the size vector we spent all that
       time building matrices to produce.
     mp <- mp+ geom_point(aes(x=mydata$long, y=mydata$lat) ,
     color = "orange", size = mydata$value/1.5) + ylim(c(0, 60))
     plot(mp)
     dev.off()
}

 

Theme park ranks over ten years

I’m interested in understanding the competitive landscape of theme parks, and showing their ranks from year to year is a good way of seeing this. The best way I know of is to use everybody’s favourite chart – the bumps chart!

What’s a bumps chart?

This was invented in Cambridge to keep track of one of the most mental sporting events you’ll ever see – the May Bumps.

may-bumps-2010
The May Bumps (credit Selwyn College)

In true Cambridge style, the May Bumps are a rowing race held every June. Apart from their timing, the series of races involves all the college rowing teams (usually around 20 of them at once) racing down the river Cam at high speeds trying desperately to run into (or ‘bump’) each other. If a crew catches up to the one in front, both crews pull over and in the next race they swap positions for the start. This means that over a week a crew can move from the front to the back of the race, and this tells a story of that year’s Bumps. The original bumps chart hangs in the Cambridge University Union building.

bumps
A bumps chart of a May Bumps series, showing Oriel winning the competition.

Results

The bumps chart I created was based on the Theme Entertainment Association reports published online each year since 2006. The data were read into R, and I used the ggplot2 package to draw a line plot of visitor numbers over the years. The directlabels package was used for the labels.

bumps

BLACKPOOL PLEASURE BEACH BLPB NAGASHIMA SPA LAND NASL
BUSCH GARDENS BUSG OCEAN PARK OCEP
CHIMELONG OCEAN KINGDOM CHOK OCT EAST OCTE
DE EFTELING DEEF PLEASURE BEACH PLEB
DISNEY ANIMAL KINGDOM DIAK PORT AVENTURA PORA
DISNEY CALIFORNIA ADVENTURE DICA SEAWORLD SEAW
DISNEY HOLLYWOOD STUDIOS DIHS SEAWORLD FL SEAF
DISNEYLAND DISN SONGCHENG PARK SONP
DISNEYLAND PARIS DISP SONGCHENG ROMANCE PARK SORP
EPCOT EPCO TIVOLI GARDENS TIVG
EUROPA PARK EURP TOKYO DISNEY SEA TODS
EVERLAND EVER TOKYO DISNEYLAND TOKD
HONG KONG DISNEYLAND HOKD UNIVERSAL STUDIOS FL UNSF
ISLANDS OF ADVENTURE ISOA UNIVERSAL STUDIOS HOLLYWOOD UNSH
KNOTTS BERRY FARM KNBF UNIVERSAL STUDIOS JAPAN UNSJ
LOTTE WORLD LOTW WALT DISNEY STUDIOS PARK AT DISNEYLAND PARIS WDSPADP
MAGIC KINGDOM MAGK YOKOHAMA HAKKEIJIMA SEA PARADISE YHSP

There are a few really noticeable things when we plot out the ranks of parks this way. This first is that Disney dominates the industry, and they keep a tight ship. Their parks don’t compete with each other for audience, and they don’t tend to move up and down relative to each other.

The second noticeable thing about the plot is the recent rise of Universal through the ranks, to finally crack the Disney lockout. This probably explains the buzz within Comcast (Universal’s owners) at the moment, and all their talk about an aggressive growth strategy.

Finally we can see really clearly here that the Asian parks, particularly the Chinese ones, are making a claim in the industry as mega players. Particularly Songcheng and Chimelong mega parks are growing at an incredible rate and are showing no signs of stopping. If the trend continues, it is very possible that our children will be pleading us to take them to China for the rides.

Future stuff

There are a whole lot of problems here around missing data. In particular we only get the top 20 – 25 parks each year and TEA only recently started publishing year-to-year figures recently, so the data are really patchy for some parks. On the other hand, in the true spirit of Data Science, the missingness could probably be used to tell us something as well if we could derive any meaning from the patterns of dropping in and out of the top 25.

I’d also be really interested to aggregate the data in different ways to see other patterns in the rankings. We could aggregate parks by location to see which areas are most popular at the moment, or we could aggregate by owner to look at who’s actually performing the best on a budget level. Looking at ownership companies brings forward whole new dimensions to the data – for example none of the Merlin Entertainment parks feature in the top 25, yet they have appeared in the top ten entertainment companies in income for the last ten years.

Do you think Universal can continue its rise? Will the Chinese parks continue to grow to be larger than the might Magic Kingdom, or will Disney retain it’s seat as the unchallenged leader?