A spatio-temporal clustering of the world’s top theme parks.

elsa
The ‘Mickey’s Friends’ or something show I saw at Magic Kingdom. My niece loves Elsa.

I’ve been playing a lot lately with my dataset of the locations, opening dates and other information about the top theme parks in the world as measured by the Themed Entertainment Association. I mentioned that I wanted to try clustering the parks to see if I can find groups of them within the data. When I built my animation of theme parks opening I thought there might be some sort of ‘contagion’ effect of parks opening, where one opening in an area increased the likelihood of another one opening in the same area within a short time. My idea is that companies and people try to reduce risk by opening parks in areas they understand, and that the risk they’re willing to take in a new market increases as time passes. This second idea comes from my contention that these companies are always trying to open new parks, but won’t do it if the current market is too competitive. Their two options are to build their share of the market they know, like they do in Florida, or to try and find a new market that looks promising. As the home market gets more and more competitive over time, those foreign markets start to look more attractive.

 

K-means clustering

K-means is one of the most popular methods at the moment of grouping data points of any type  – so popular that the algorithm comes packaged in base R. In short, the model makes points on a surface with N-dimensions, where N represents the number of variables you’ve put into the model. This is pretty easy to imagine with two or three variables, but once we get to six or so (having used up colour, size and shape in a graph) you have to start bending your brain in some pretty specific ways to keep up.

Then it picks K points on that surface that minimise the distance between all the data points in the set. The distance between the point that is chosen (or ‘centroid’) is called the ‘within Sum of Squares’, and is used to determine how well the points group the data points. (Here’s another explanation if you’re interested in the details)

clustereg
An example of centroids (represented as stars) grouping data points (represented as circles) on a two-dimensional surface.

 Choosing K: Slap that elbow

The main question that K-means clustering wants you to answer before you start is how many clusters you want it to make. It sounds weird, but the best way I know to do this (without moving to Bayesian mixture modelling) is basically trial and error. Usually with the Big Data I’m used to, this can take a long time unless you do some parallelisation work but with only 30 or so entries this is a pretty trivial task. So basically I run a loop in R building a cluster model with 2 – 15 clusters (more is kinda useless to me) and measure the Within Sums of Squares error of the model at each stage and get ready to slap that elbow.wsserrortheme

You can see from the graph that the error reduces massively from 2 to 3 clusters, but then eases off between 3 and 4 clusters creating an ‘elbow’ in the line plot. That indicates that 3 clusters give a good explanation of the data, and while 4 clusters is slightly better, they don’t explain that much more about the data. When trying to name and describe clusters it always gets more difficult with more groups to describe, so we don’t want to clog up our communication with clusters that don’t really mean much. Looking at this graph I could probably make two justifiable choices – three clusters is the strongest choice but six clusters is probably defensible as well. This is one of the issues with this method – the results massively rely on K, but choosing K is a really subjective procedure.

The code

Here’s some code that does the within sums of squares loop:

# xxx is a data.frame or data.table object of only numbers
 
wss <- NULL
for (i in 2:15) {wss[i] <- sum(kmeans(xxx, centers=i)$withinss)}

plot(wss, type = "l", xlab = "Clusters",ylab = "Within SS Error",
 main = "Error with different number of clusters")

The results

This was another experiment like my adventures in Holt Winters modelling that looks promising but really needs more data. Here are the plots of parks with three and five clusters:

clusters3clusters6

The results of the three cluster model are pasted below. Tivoli stands out on its own as expected due to its opening date being so far before everyone else. The other two groups though I’m struggling to describe the other two groups by anything particular.

Cluster 1 Cluster 2 Cluster 3
TIVOLI GARDENS CHIMELONG OCEAN KINGDOM BUSCH GARDENS
DISNEY ANIMAL KINGDOM DE EFTELING
DISNEY CALIFORNIA ADVENTURE DISNEYLAND
DISNEY HOLLYWOOD STUDIOS EPCOT
DISNEYLAND PARIS EUROPA PARK
HONG KONG DISNEYLAND EVERLAND
ISLANDS OF ADVENTURE MAGIC KINGDOM
LOTTE WORLD NAGASHIMA SPA LAND
OCT EAST OCEAN PARK
SONGCHENG PARK SEAWORLD
SONGCHENG ROMANCE PARK SEAWORLD FL
TOKYO DISNEY SEA TOKYO DISNEYLAND
UNIVERSAL STUDIOS FL UNIVERSAL STUDIOS HOLLYWOOD
UNIVERSAL STUDIOS JAPAN
WALT DISNEY STUDIOS PARK
YOKOHAMA HAKKEIJIMA SEA PARADISE

So I thought the six cluster model might do better in describing the parks. The results of the model are pasted below:

Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6
DISNEY HOLLYWOOD STUDIOS BUSCH GARDENS TIVOLI GARDENS CHIMELONG OCEAN KINGDOM DISNEY ANIMAL KINGDOM DE EFTELING
DISNEYLAND PARIS EUROPA PARK OCT EAST DISNEY CALIFORNIA ADVENTURE DISNEYLAND
LOTTE WORLD EVERLAND SONGCHENG ROMANCE PARK HONG KONG DISNEYLAND NAGASHIMA SPA LAND
UNIVERSAL STUDIOS FL MAGIC KINGDOM ISLANDS OF ADVENTURE SEAWORLD
YOKOHAMA HAKKEIJIMA SEA PARADISE OCEAN PARK SONGCHENG PARK UNIVERSAL STUDIOS HOLLYWOOD
EPCOT SEAWORLD FL TOKYO DISNEY SEA
TOKYO DISNEYLAND UNIVERSAL STUDIOS JAPAN
WALT DISNEY STUDIOS PARK

This one is a bit more descriptive, so I had a go at giving the clusters names. Remember that even though they are clustered by date amongst other variables, the groups aren’t organised chronologically because the algorithm is unsupervised.

Cluster 1 Cluster 2 Cluster 3
The Bright Future The Cold War The Start of it All
Cluster 4 Cluster 5 Cluster 6
The Asian Boom Pre-GFC growth The Classics

The Bright Future

These parks all seem to be built in times where the owner was looking to take a bigger risk because they knew their future was looking good. The reasons for their optimism are probably different by location and owner, but if I had to pick something that made these parks similar, I’d say it was the spirit of optimism in which they were built.

The Cold War

These parks were generally built in places and times where governments were spending a lot of time and attention on trying to show off how awesome they were, especially in the 60’s and 70’s. The Cold War was at it’s height throughout this period, and having a park Magic Kingdom on your books was a massive draw for high-profile defectors and a boon for propaganda. Having said that, Magic Kingdom was notably built with very little Government support, so I’m probably totally off here.

The Start of it All

Tivoli Gardens will always be a unique gem of this theme park world. Long live Tivoli!

The Asian Boom

These are the massive new Asian superparks, with OCT owned by the Chinese Government and the other two heavily sponsored by public interests. With these parks rocketing up the ranks, it’s very possible that this group will grow in the list of top 25 parks in the coming years.

Pre-GFC growth

Most of these parks were built in booming economies that (in retrospect) were growing because of the growing deregulation in financial markets – in the late eighties and early nineties. These were built in a spirit of optimism like the Bright Future parks, but that optimism stemmed from regulatory environments in this case rather than real business growth. A lot of these parks have done less well in the ranks in recent years, possibly as a result of the market adjustment in these economies.

The Classics

These are the parks that really established the theme park industry. After Tivoli Gardens had gestated the idea of amusement parks, Walt Disney introduced the concept to Los Angeles and everything went mental. These parks were mainly those that made up this first wave, riding on the buzz caused by Disneyland.

Stuff I learned

The first and most obvious lesson of this exercise is that K-means clustering is a minefield of subjectivity and over-interpretation. As a statistician I really don’t like having to make a decision without  a solid numerical threshold on which to rely, so slapping the elbow of the within groups error isn’t very nice. The other part of the process is naming an describing the clusters, which is pretty difficult to do from an analytical perspective. In writing my descriptions I had to be pretty creative, and as I wrote I could see all sorts of ways the list didn’t really fit what I was saying. The common excuse of people using this method is ‘you’ll never get it perfect’, but I should at least be able to say why I chose the things I did with more backup than ‘it felt right’.

The second lesson is that as always more data is better. I’ve done this clustering on around thirty parks,  but it might make the clusters more clear if I added more parks and included more variables in the model I train. In addition, I only trained this model on four variables at the moment, while normal Data Science clustering models should contain at least 50 or 60 to really start looking valid.

Things I’ll look at in the future

The next thing I’ll probably do is a different approach to clustering using visitor numbers in combination with the locations of the parks. This would tell me if the different parks are catering to different markets that have unique patterns of attendance, which might contribute to my machine learning approaches.

Another idea is to play with the different results produced by changing K, which gives progressively more detail about the groups as it increases. This is based on the work I saw once at the Australian Statistical conference in a Peter Donnelly lecture where he did this with Genetic data to move back in history and show the gradual introduction of different genetic groups.

What do you think of my attempt at grouping theme parks? Do you think the clusters make sense, or did I just pull a bunch of meaning out of nothing? As always, I’d love to hear any critique or analysis you might have.

The code

Here’s some code in case you want to do something similar:

#Load libaries
library(data.table)
library(ggplot2)

# Load the data
info <- read.csv("~/myspatialdata.csv", stringsAsFactors = FALSE)
info <- info[complete.cases(info),] #Get rid of any empty trailing rows
setDT(info) #Make it a data.table because Data Science
info$opened <- as.Date(info$opened) # Tell R this is a date
setkey(info, park) # Order by park
setkey(info, opened) # Order by Opening date
cols = c("opened", "lat", "long", "operator")

xxx <- info[,cols, with = FALSE] # Select only the columns we'll cluster 
on
xxx$opened <- as.numeric(as.Date(xxx$opened)) #Convert this to a number 
because K-means only takes numbers.
xxx$operator <- as.numeric(as.factor(xxx$operator)) # Same for the 
operator factor.

# Slap that elbow
wss <- NULL
for (i in 2:15) {wss[i] <- sum(kmeans(xxx, centers=i)$withinss)}
plot(wss, type = "l", xlab = "Clusters",ylab = "Within SS Error", 
main = "Error with different number of clusters")

# Create models with 3 and 6 clusters based on the elbow approach.
parksclusterreal <- kmeans(xxx, 3, nstart =10)
parksclusterfun <- kmeans(xxx, 6, nstart =10)

# Add the cluster labels to the data frame
info$cluster <- parksclusterreal$cluster
info$clusterfun <- parksclusterfun$cluster

### Plot the parks by cluster on a world map

# Three cluster model
mp <- NULL
mapWorld <- borders("world", colour="gray10", fill="gray10") 
# create a layer of borders
mp <- ggplot(data = info, aes(x= long, y= lat , 
    color= as.factor(cluster))) + mapWorld + theme_bw()
mp <- mp+ geom_point(size = 5, shape = 5) + ylim(c(0, 60))
    + ggtitle("Clusters of theme parks by location, operator, 
    and opening date") + labs(colour='Cluster')
mp

# Six cluster model
mp <- NULL
mapWorld <- borders("world", colour="gray10", fill="gray10") 
# create a layer of borders
mp <- ggplot(data = info, aes(x= long, y= lat , 
    color= as.factor(clusterfun))) + mapWorld + theme_bw()
mp <- mp+ geom_point(size = 5, shape = 5) + ylim(c(0, 60))
    + ggtitle("Clusters of theme parks by location, operator,
    and opening date")+ labs(colour='Cluster')
mp