A spatio-temporal clustering of the world’s top theme parks.

elsa
The ‘Mickey’s Friends’ or something show I saw at Magic Kingdom. My niece loves Elsa.

I’ve been playing a lot lately with my dataset of the locations, opening dates and other information about the top theme parks in the world as measured by the Themed Entertainment Association. I mentioned that I wanted to try clustering the parks to see if I can find groups of them within the data. When I built my animation of theme parks opening I thought there might be some sort of ‘contagion’ effect of parks opening, where one opening in an area increased the likelihood of another one opening in the same area within a short time. My idea is that companies and people try to reduce risk by opening parks in areas they understand, and that the risk they’re willing to take in a new market increases as time passes. This second idea comes from my contention that these companies are always trying to open new parks, but won’t do it if the current market is too competitive. Their two options are to build their share of the market they know, like they do in Florida, or to try and find a new market that looks promising. As the home market gets more and more competitive over time, those foreign markets start to look more attractive.

 

K-means clustering

K-means is one of the most popular methods at the moment of grouping data points of any type  – so popular that the algorithm comes packaged in base R. In short, the model makes points on a surface with N-dimensions, where N represents the number of variables you’ve put into the model. This is pretty easy to imagine with two or three variables, but once we get to six or so (having used up colour, size and shape in a graph) you have to start bending your brain in some pretty specific ways to keep up.

Then it picks K points on that surface that minimise the distance between all the data points in the set. The distance between the point that is chosen (or ‘centroid’) is called the ‘within Sum of Squares’, and is used to determine how well the points group the data points. (Here’s another explanation if you’re interested in the details)

clustereg
An example of centroids (represented as stars) grouping data points (represented as circles) on a two-dimensional surface.

 Choosing K: Slap that elbow

The main question that K-means clustering wants you to answer before you start is how many clusters you want it to make. It sounds weird, but the best way I know to do this (without moving to Bayesian mixture modelling) is basically trial and error. Usually with the Big Data I’m used to, this can take a long time unless you do some parallelisation work but with only 30 or so entries this is a pretty trivial task. So basically I run a loop in R building a cluster model with 2 – 15 clusters (more is kinda useless to me) and measure the Within Sums of Squares error of the model at each stage and get ready to slap that elbow.wsserrortheme

You can see from the graph that the error reduces massively from 2 to 3 clusters, but then eases off between 3 and 4 clusters creating an ‘elbow’ in the line plot. That indicates that 3 clusters give a good explanation of the data, and while 4 clusters is slightly better, they don’t explain that much more about the data. When trying to name and describe clusters it always gets more difficult with more groups to describe, so we don’t want to clog up our communication with clusters that don’t really mean much. Looking at this graph I could probably make two justifiable choices – three clusters is the strongest choice but six clusters is probably defensible as well. This is one of the issues with this method – the results massively rely on K, but choosing K is a really subjective procedure.

The code

Here’s some code that does the within sums of squares loop:

# xxx is a data.frame or data.table object of only numbers
 
wss <- NULL
for (i in 2:15) {wss[i] <- sum(kmeans(xxx, centers=i)$withinss)}

plot(wss, type = "l", xlab = "Clusters",ylab = "Within SS Error",
 main = "Error with different number of clusters")

The results

This was another experiment like my adventures in Holt Winters modelling that looks promising but really needs more data. Here are the plots of parks with three and five clusters:

clusters3clusters6

The results of the three cluster model are pasted below. Tivoli stands out on its own as expected due to its opening date being so far before everyone else. The other two groups though I’m struggling to describe the other two groups by anything particular.

Cluster 1 Cluster 2 Cluster 3
TIVOLI GARDENS CHIMELONG OCEAN KINGDOM BUSCH GARDENS
DISNEY ANIMAL KINGDOM DE EFTELING
DISNEY CALIFORNIA ADVENTURE DISNEYLAND
DISNEY HOLLYWOOD STUDIOS EPCOT
DISNEYLAND PARIS EUROPA PARK
HONG KONG DISNEYLAND EVERLAND
ISLANDS OF ADVENTURE MAGIC KINGDOM
LOTTE WORLD NAGASHIMA SPA LAND
OCT EAST OCEAN PARK
SONGCHENG PARK SEAWORLD
SONGCHENG ROMANCE PARK SEAWORLD FL
TOKYO DISNEY SEA TOKYO DISNEYLAND
UNIVERSAL STUDIOS FL UNIVERSAL STUDIOS HOLLYWOOD
UNIVERSAL STUDIOS JAPAN
WALT DISNEY STUDIOS PARK
YOKOHAMA HAKKEIJIMA SEA PARADISE

So I thought the six cluster model might do better in describing the parks. The results of the model are pasted below:

Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6
DISNEY HOLLYWOOD STUDIOS BUSCH GARDENS TIVOLI GARDENS CHIMELONG OCEAN KINGDOM DISNEY ANIMAL KINGDOM DE EFTELING
DISNEYLAND PARIS EUROPA PARK OCT EAST DISNEY CALIFORNIA ADVENTURE DISNEYLAND
LOTTE WORLD EVERLAND SONGCHENG ROMANCE PARK HONG KONG DISNEYLAND NAGASHIMA SPA LAND
UNIVERSAL STUDIOS FL MAGIC KINGDOM ISLANDS OF ADVENTURE SEAWORLD
YOKOHAMA HAKKEIJIMA SEA PARADISE OCEAN PARK SONGCHENG PARK UNIVERSAL STUDIOS HOLLYWOOD
EPCOT SEAWORLD FL TOKYO DISNEY SEA
TOKYO DISNEYLAND UNIVERSAL STUDIOS JAPAN
WALT DISNEY STUDIOS PARK

This one is a bit more descriptive, so I had a go at giving the clusters names. Remember that even though they are clustered by date amongst other variables, the groups aren’t organised chronologically because the algorithm is unsupervised.

Cluster 1 Cluster 2 Cluster 3
The Bright Future The Cold War The Start of it All
Cluster 4 Cluster 5 Cluster 6
The Asian Boom Pre-GFC growth The Classics

The Bright Future

These parks all seem to be built in times where the owner was looking to take a bigger risk because they knew their future was looking good. The reasons for their optimism are probably different by location and owner, but if I had to pick something that made these parks similar, I’d say it was the spirit of optimism in which they were built.

The Cold War

These parks were generally built in places and times where governments were spending a lot of time and attention on trying to show off how awesome they were, especially in the 60’s and 70’s. The Cold War was at it’s height throughout this period, and having a park Magic Kingdom on your books was a massive draw for high-profile defectors and a boon for propaganda. Having said that, Magic Kingdom was notably built with very little Government support, so I’m probably totally off here.

The Start of it All

Tivoli Gardens will always be a unique gem of this theme park world. Long live Tivoli!

The Asian Boom

These are the massive new Asian superparks, with OCT owned by the Chinese Government and the other two heavily sponsored by public interests. With these parks rocketing up the ranks, it’s very possible that this group will grow in the list of top 25 parks in the coming years.

Pre-GFC growth

Most of these parks were built in booming economies that (in retrospect) were growing because of the growing deregulation in financial markets – in the late eighties and early nineties. These were built in a spirit of optimism like the Bright Future parks, but that optimism stemmed from regulatory environments in this case rather than real business growth. A lot of these parks have done less well in the ranks in recent years, possibly as a result of the market adjustment in these economies.

The Classics

These are the parks that really established the theme park industry. After Tivoli Gardens had gestated the idea of amusement parks, Walt Disney introduced the concept to Los Angeles and everything went mental. These parks were mainly those that made up this first wave, riding on the buzz caused by Disneyland.

Stuff I learned

The first and most obvious lesson of this exercise is that K-means clustering is a minefield of subjectivity and over-interpretation. As a statistician I really don’t like having to make a decision without  a solid numerical threshold on which to rely, so slapping the elbow of the within groups error isn’t very nice. The other part of the process is naming an describing the clusters, which is pretty difficult to do from an analytical perspective. In writing my descriptions I had to be pretty creative, and as I wrote I could see all sorts of ways the list didn’t really fit what I was saying. The common excuse of people using this method is ‘you’ll never get it perfect’, but I should at least be able to say why I chose the things I did with more backup than ‘it felt right’.

The second lesson is that as always more data is better. I’ve done this clustering on around thirty parks,  but it might make the clusters more clear if I added more parks and included more variables in the model I train. In addition, I only trained this model on four variables at the moment, while normal Data Science clustering models should contain at least 50 or 60 to really start looking valid.

Things I’ll look at in the future

The next thing I’ll probably do is a different approach to clustering using visitor numbers in combination with the locations of the parks. This would tell me if the different parks are catering to different markets that have unique patterns of attendance, which might contribute to my machine learning approaches.

Another idea is to play with the different results produced by changing K, which gives progressively more detail about the groups as it increases. This is based on the work I saw once at the Australian Statistical conference in a Peter Donnelly lecture where he did this with Genetic data to move back in history and show the gradual introduction of different genetic groups.

What do you think of my attempt at grouping theme parks? Do you think the clusters make sense, or did I just pull a bunch of meaning out of nothing? As always, I’d love to hear any critique or analysis you might have.

The code

Here’s some code in case you want to do something similar:

#Load libaries
library(data.table)
library(ggplot2)

# Load the data
info <- read.csv("~/myspatialdata.csv", stringsAsFactors = FALSE)
info <- info[complete.cases(info),] #Get rid of any empty trailing rows
setDT(info) #Make it a data.table because Data Science
info$opened <- as.Date(info$opened) # Tell R this is a date
setkey(info, park) # Order by park
setkey(info, opened) # Order by Opening date
cols = c("opened", "lat", "long", "operator")

xxx <- info[,cols, with = FALSE] # Select only the columns we'll cluster 
on
xxx$opened <- as.numeric(as.Date(xxx$opened)) #Convert this to a number 
because K-means only takes numbers.
xxx$operator <- as.numeric(as.factor(xxx$operator)) # Same for the 
operator factor.

# Slap that elbow
wss <- NULL
for (i in 2:15) {wss[i] <- sum(kmeans(xxx, centers=i)$withinss)}
plot(wss, type = "l", xlab = "Clusters",ylab = "Within SS Error", 
main = "Error with different number of clusters")

# Create models with 3 and 6 clusters based on the elbow approach.
parksclusterreal <- kmeans(xxx, 3, nstart =10)
parksclusterfun <- kmeans(xxx, 6, nstart =10)

# Add the cluster labels to the data frame
info$cluster <- parksclusterreal$cluster
info$clusterfun <- parksclusterfun$cluster

### Plot the parks by cluster on a world map

# Three cluster model
mp <- NULL
mapWorld <- borders("world", colour="gray10", fill="gray10") 
# create a layer of borders
mp <- ggplot(data = info, aes(x= long, y= lat , 
    color= as.factor(cluster))) + mapWorld + theme_bw()
mp <- mp+ geom_point(size = 5, shape = 5) + ylim(c(0, 60))
    + ggtitle("Clusters of theme parks by location, operator, 
    and opening date") + labs(colour='Cluster')
mp

# Six cluster model
mp <- NULL
mapWorld <- borders("world", colour="gray10", fill="gray10") 
# create a layer of borders
mp <- ggplot(data = info, aes(x= long, y= lat , 
    color= as.factor(clusterfun))) + mapWorld + theme_bw()
mp <- mp+ geom_point(size = 5, shape = 5) + ylim(c(0, 60))
    + ggtitle("Clusters of theme parks by location, operator,
    and opening date")+ labs(colour='Cluster')
mp

An animation of theme parks opening around the world

parntersdisney

I’ve been collecting a lot of data to be able to do my last few posts, and I’d mentioned that I wanted to try more with time series data. A few years ago I got to sit in a lecture at Queensland University of Technology by Sudipto Banerjee on Bayesian spatiotemporal modelling. At the time the material was way too advanced for me, but the idea of analysing data points with time and space treated correctly has always stuck.

As I dug into different things I could do with spatiotemporal , I realised that I needed a lot more understanding of the data itself before I could do fun tricksy things with it. I needed something that would maintain my interest, but also force me to mess around munging spatiotemporal data

An idea born of necessity

In the first year of my postgraduate research, I was really interested in data visualisations. Thankfully at the time a bunch of blogs like FlowingData were starting up, reporting on all types of cool data graphics.’Infographics’ also became a thing, threatening to destroy Data Science in its infancy. But what caught my eye at the time were the visualisations of flight paths like this one.

So now that I have some data and a bit of time and ability, I thought I’d try a more basic version of a spatiotemporal visualisation like this. My problem is that I hate installing extra one-time software for my whims, so the idea of using ImageMagick annoyed me. On top of that, when I tried I couldn’t get it to work so I determined to do what I could using base R and ggplot.

The result

This is probably the first article I can say I’m pretty happy with the result:

movie

 

The first thing you can see is that Europe, not the US is the true home of theme parks, with Tivoli Gardens appearing in 1843, and remaining in the top 25 theme parks since before there were 25 parks to compete against.

Beyond that, you can also sort of see  that there is a ‘contagion’ effect of parks – when one opens in an area, there are usually others opening nearby pretty soon. There’s two reasons I can think of for this. First, once people are travelling to an area to go to a theme park, going to two theme parks probably isn’t out of the question so someone’s bound to move in to capture that cash. Second is that the people opening new parks have to learn to run theme parks somewhere, and if you’re taking a massive risk on opening a $100 million park with a bunch of other people’s money you’ll want to minimise your risk by opening it in a place you understand.

Future stuff

Simply visualising the data turned out to be more than a data munging exercise for me – plotting this spatially as an animation gave some actual insights about how these things have spread over the world. It made me more interested in doing the spatio-temporal clustering as well – it would be really cool to do that then redo this plot with the colours of the points determined by the park’s cluster.

Another direction to explore would be to learn more about how to scrape Wikipedia and fill out my data table with more parks rather than just those that have featured in the TEA reports. I know this is possible and it’s not exactly new, but it’s never come across my radar and web scraping is a pretty necessary tool in the Data Science toolkit.

What applications can you think of for this sort of visualisation? Is there anything else I could add to this one that might improve it? I’d love to hear your thoughts!

The code

Just in case you wanted to do the same, I’ve added the code with comments below. You’ll need to add your own file with a unique name, latitude, longitude and date in each row.

# Load the required libraries
library(ggmap)
library(ggplot2)
library(maptools)
library(maps)
library(data.table)

info <- read.csv("***.csv", stringsAsFactors = FALSE)

info <- info[complete.cases(info),]
setDT(info)
info$opened <- as.Date(info$opened) # S
setkey(info, park)
setkey(info, opened)


# Setup for an animation 
a_vec <- seq(1840, 2016 , by=1) # Create a vector of the years you will 
animate over

# Create a matrix to hold the 'size' information for the graph
B = matrix( rep(0, length(a_vec)*length(info$park)),
 nrow= length(a_vec), 
 ncol= length(info$park))

for (i in 1:ncol(B))
{
 for (x in 1: nrow(B))
 { #I want to have a big dot when it opens that gets gradually smaller,
    like the alpha in the flights visualisation.
  open_date <- as.numeric(year(info$opened[i]))
  c_year <- a_vec[x]
  #If the park hasn't opened yet give it no circle
  if ( open_date < c_year)
  {B[x,i] <- 0} else
  # If the park is in its opening year, give it a big circle.
  if (open_date == c_year)
  {B[x,i] <- 10}
  }}

# Make the circle fade from size 10 to size 1, then stay at 1 until 
the end of the matrix

for (i in 1:ncol(B))
{ for (x in 2: nrow(B))
  {if (B[x-1, i] > 1){ B[x,i] <- B[x-1, i] - 1}else
   if(B[x-1, i] == 1){ B[x,i] <- 1}}}

B <- data.frame(B)
B <- cbind( a_vec, B)
setDT(B)
names(B) <- c("years", info$park) #Set the column names to the names of 
the parks

xxx <- melt(B, "years") # Convert to long format

# Create a table of locations
loc <- data.table("variable" = info$park,
                   "lat"= info$lat, 
                   "long"= info$long)

#Join the locations to the long table
xxx <- merge(xxx, loc, by = "variable", all.x = TRUE)
setkey(xxx, years)

# Create a ggplot image for each entry in the a_vec vector of years we
 made at the beginning. 
for (i in 1: length(a_vec))
    {mydata <- xxx[years ==a_vec[i]] # Only graph the rows for year i.
     mydata <- mydata[mydata$value!=0,] #Don't plot stuff not open yet.
     #Write the plot to a jpeg file and give it a number to keep the 
      frames in order.
     jpeg(filename = 
     paste("~/chosenfolder/animation", i, ".jpeg", sep = ""),
     width = (429*2) , height = (130*2), units = "px") 
     mp <- NULL 
     # Plot a world map in grey and entitle it with the year.
     mapWorld <- borders("world", colour="gray50", fill="gray50") 
     mp <- ggplot() + mapWorld + theme_bw() + ggtitle(a_vec[i])
     # Add the points on the map, using the size vector we spent all that
       time building matrices to produce.
     mp <- mp+ geom_point(aes(x=mydata$long, y=mydata$lat) ,
     color = "orange", size = mydata$value/1.5) + ylim(c(0, 60))
     plot(mp)
     dev.off()
}