clustering – Theme Park Analytics

elsa — The ‘Mickey’s Friends’ or something show I saw at Magic Kingdom. My niece loves Elsa.

I’ve been playing a lot lately with my dataset of the locations, opening dates and other information about the top theme parks in the world as measured by the Themed Entertainment Association. I mentioned that I wanted to try clustering the parks to see if I can find groups of them within the data. When I built my animation of theme parks opening I thought there might be some sort of ‘contagion’ effect of parks opening, where one opening in an area increased the likelihood of another one opening in the same area within a short time. My idea is that companies and people try to reduce risk by opening parks in areas they understand, and that the risk they’re willing to take in a new market increases as time passes. This second idea comes from my contention that these companies are always trying to open new parks, but won’t do it if the current market is too competitive. Their two options are to build their share of the market they know, like they do in Florida, or to try and find a new market that looks promising. As the home market gets more and more competitive over time, those foreign markets start to look more attractive.

K-means clustering

K-means is one of the most popular methods at the moment of grouping data points of any type – so popular that the algorithm comes packaged in base R. In short, the model makes points on a surface with N-dimensions, where N represents the number of variables you’ve put into the model. This is pretty easy to imagine with two or three variables, but once we get to six or so (having used up colour, size and shape in a graph) you have to start bending your brain in some pretty specific ways to keep up.

Then it picks K points on that surface that minimise the distance between all the data points in the set. The distance between the point that is chosen (or ‘centroid’) is called the ‘within Sum of Squares’, and is used to determine how well the points group the data points. (Here’s another explanation if you’re interested in the details)

clustereg — An example of centroids (represented as stars) grouping data points (represented as circles) on a two-dimensional surface.

Choosing K: Slap that elbow

The main question that K-means clustering wants you to answer before you start is how many clusters you want it to make. It sounds weird, but the best way I know to do this (without moving to Bayesian mixture modelling) is basically trial and error. Usually with the Big Data I’m used to, this can take a long time unless you do some parallelisation work but with only 30 or so entries this is a pretty trivial task. So basically I run a loop in R building a cluster model with 2 – 15 clusters (more is kinda useless to me) and measure the Within Sums of Squares error of the model at each stage and get ready to slap that elbow. wsserrortheme

You can see from the graph that the error reduces massively from 2 to 3 clusters, but then eases off between 3 and 4 clusters creating an ‘elbow’ in the line plot. That indicates that 3 clusters give a good explanation of the data, and while 4 clusters is slightly better, they don’t explain that much more about the data. When trying to name and describe clusters it always gets more difficult with more groups to describe, so we don’t want to clog up our communication with clusters that don’t really mean much. Looking at this graph I could probably make two justifiable choices – three clusters is the strongest choice but six clusters is probably defensible as well. This is one of the issues with this method – the results massively rely on K, but choosing K is a really subjective procedure.

The code

Here’s some code that does the within sums of squares loop:

# xxx is a data.frame or data.table object of only numbers
 
wss <- NULL
for (i in 2:15) {wss[i] <- sum(kmeans(xxx, centers=i)$withinss)}

plot(wss, type = "l", xlab = "Clusters",ylab = "Within SS Error",
 main = "Error with different number of clusters")

The results

This was another experiment like my adventures in Holt Winters modelling that looks promising but really needs more data. Here are the plots of parks with three and five clusters:

clusters3 clusters6

The results of the three cluster model are pasted below. Tivoli stands out on its own as expected due to its opening date being so far before everyone else. The other two groups though I’m struggling to describe the other two groups by anything particular.

Cluster 1	Cluster 2	Cluster 3
TIVOLI GARDENS	CHIMELONG OCEAN KINGDOM	BUSCH GARDENS
	DISNEY ANIMAL KINGDOM	DE EFTELING
	DISNEY CALIFORNIA ADVENTURE	DISNEYLAND
	DISNEY HOLLYWOOD STUDIOS	EPCOT
	DISNEYLAND PARIS	EUROPA PARK
	HONG KONG DISNEYLAND	EVERLAND
	ISLANDS OF ADVENTURE	MAGIC KINGDOM
	LOTTE WORLD	NAGASHIMA SPA LAND
	OCT EAST	OCEAN PARK
	SONGCHENG PARK	SEAWORLD
	SONGCHENG ROMANCE PARK	SEAWORLD FL
	TOKYO DISNEY SEA	TOKYO DISNEYLAND
	UNIVERSAL STUDIOS FL	UNIVERSAL STUDIOS HOLLYWOOD
	UNIVERSAL STUDIOS JAPAN
	WALT DISNEY STUDIOS PARK
	YOKOHAMA HAKKEIJIMA SEA PARADISE

So I thought the six cluster model might do better in describing the parks. The results of the model are pasted below:

Cluster 1	Cluster 2	Cluster 3	Cluster 4	Cluster 5	Cluster 6
DISNEY HOLLYWOOD STUDIOS	BUSCH GARDENS	TIVOLI GARDENS	CHIMELONG OCEAN KINGDOM	DISNEY ANIMAL KINGDOM	DE EFTELING
DISNEYLAND PARIS	EUROPA PARK		OCT EAST	DISNEY CALIFORNIA ADVENTURE	DISNEYLAND
LOTTE WORLD	EVERLAND		SONGCHENG ROMANCE PARK	HONG KONG DISNEYLAND	NAGASHIMA SPA LAND
UNIVERSAL STUDIOS FL	MAGIC KINGDOM			ISLANDS OF ADVENTURE	SEAWORLD
YOKOHAMA HAKKEIJIMA SEA PARADISE	OCEAN PARK			SONGCHENG PARK	UNIVERSAL STUDIOS HOLLYWOOD
EPCOT	SEAWORLD FL			TOKYO DISNEY SEA
TOKYO DISNEYLAND				UNIVERSAL STUDIOS JAPAN
				WALT DISNEY STUDIOS PARK

This one is a bit more descriptive, so I had a go at giving the clusters names. Remember that even though they are clustered by date amongst other variables, the groups aren’t organised chronologically because the algorithm is unsupervised.

Cluster 1	Cluster 2	Cluster 3
The Bright Future	The Cold War	The Start of it All
Cluster 4	Cluster 5	Cluster 6
The Asian Boom	Pre-GFC growth	The Classics

The Bright Future

These parks all seem to be built in times where the owner was looking to take a bigger risk because they knew their future was looking good. The reasons for their optimism are probably different by location and owner, but if I had to pick something that made these parks similar, I’d say it was the spirit of optimism in which they were built.

The Cold War

These parks were generally built in places and times where governments were spending a lot of time and attention on trying to show off how awesome they were, especially in the 60’s and 70’s. The Cold War was at it’s height throughout this period, and having a park Magic Kingdom on your books was a massive draw for high-profile defectors and a boon for propaganda. Having said that, Magic Kingdom was notably built with very little Government support, so I’m probably totally off here.

The Start of it All

Tivoli Gardens will always be a unique gem of this theme park world. Long live Tivoli!

The Asian Boom

These are the massive new Asian superparks, with OCT owned by the Chinese Government and the other two heavily sponsored by public interests. With these parks rocketing up the ranks, it’s very possible that this group will grow in the list of top 25 parks in the coming years.

Pre-GFC growth

Most of these parks were built in booming economies that (in retrospect) were growing because of the growing deregulation in financial markets – in the late eighties and early nineties. These were built in a spirit of optimism like the Bright Future parks, but that optimism stemmed from regulatory environments in this case rather than real business growth. A lot of these parks have done less well in the ranks in recent years, possibly as a result of the market adjustment in these economies.

The Classics

These are the parks that really established the theme park industry. After Tivoli Gardens had gestated the idea of amusement parks, Walt Disney introduced the concept to Los Angeles and everything went mental. These parks were mainly those that made up this first wave, riding on the buzz caused by Disneyland.

Stuff I learned

The first and most obvious lesson of this exercise is that K-means clustering is a minefield of subjectivity and over-interpretation. As a statistician I really don’t like having to make a decision without a solid numerical threshold on which to rely, so slapping the elbow of the within groups error isn’t very nice. The other part of the process is naming an describing the clusters, which is pretty difficult to do from an analytical perspective. In writing my descriptions I had to be pretty creative, and as I wrote I could see all sorts of ways the list didn’t really fit what I was saying. The common excuse of people using this method is ‘you’ll never get it perfect’, but I should at least be able to say why I chose the things I did with more backup than ‘it felt right’.

The second lesson is that as always more data is better. I’ve done this clustering on around thirty parks, but it might make the clusters more clear if I added more parks and included more variables in the model I train. In addition, I only trained this model on four variables at the moment, while normal Data Science clustering models should contain at least 50 or 60 to really start looking valid.

Things I’ll look at in the future

The next thing I’ll probably do is a different approach to clustering using visitor numbers in combination with the locations of the parks. This would tell me if the different parks are catering to different markets that have unique patterns of attendance, which might contribute to my machine learning approaches.

Another idea is to play with the different results produced by changing K, which gives progressively more detail about the groups as it increases. This is based on the work I saw once at the Australian Statistical conference in a Peter Donnelly lecture where he did this with Genetic data to move back in history and show the gradual introduction of different genetic groups.

What do you think of my attempt at grouping theme parks? Do you think the clusters make sense, or did I just pull a bunch of meaning out of nothing? As always, I’d love to hear any critique or analysis you might have.

The code

Here’s some code in case you want to do something similar:

#Load libaries
library(data.table)
library(ggplot2)

# Load the data
info <- read.csv("~/myspatialdata.csv", stringsAsFactors = FALSE)
info <- info[complete.cases(info),] #Get rid of any empty trailing rows
setDT(info) #Make it a data.table because Data Science
info$opened <- as.Date(info$opened) # Tell R this is a date
setkey(info, park) # Order by park
setkey(info, opened) # Order by Opening date
cols = c("opened", "lat", "long", "operator")

xxx <- info[,cols, with = FALSE] # Select only the columns we'll cluster 
on
xxx$opened <- as.numeric(as.Date(xxx$opened)) #Convert this to a number 
because K-means only takes numbers.
xxx$operator <- as.numeric(as.factor(xxx$operator)) # Same for the 
operator factor.

# Slap that elbow
wss <- NULL
for (i in 2:15) {wss[i] <- sum(kmeans(xxx, centers=i)$withinss)}
plot(wss, type = "l", xlab = "Clusters",ylab = "Within SS Error", 
main = "Error with different number of clusters")

# Create models with 3 and 6 clusters based on the elbow approach.
parksclusterreal <- kmeans(xxx, 3, nstart =10)
parksclusterfun <- kmeans(xxx, 6, nstart =10)

# Add the cluster labels to the data frame
info$cluster <- parksclusterreal$cluster
info$clusterfun <- parksclusterfun$cluster

### Plot the parks by cluster on a world map

# Three cluster model
mp <- NULL
mapWorld <- borders("world", colour="gray10", fill="gray10") 
# create a layer of borders
mp <- ggplot(data = info, aes(x= long, y= lat , 
    color= as.factor(cluster))) + mapWorld + theme_bw()
mp <- mp+ geom_point(size = 5, shape = 5) + ylim(c(0, 60))
    + ggtitle("Clusters of theme parks by location, operator, 
    and opening date") + labs(colour='Cluster')
mp

# Six cluster model
mp <- NULL
mapWorld <- borders("world", colour="gray10", fill="gray10") 
# create a layer of borders
mp <- ggplot(data = info, aes(x= long, y= lat , 
    color= as.factor(clusterfun))) + mapWorld + theme_bw()
mp <- mp+ geom_point(size = 5, shape = 5) + ylim(c(0, 60))
    + ggtitle("Clusters of theme parks by location, operator,
    and opening date")+ labs(colour='Cluster')
mp

Category: clustering

A spatio-temporal clustering of the world’s top theme parks.

K-means clustering

Choosing K: Slap that elbow

The code

The results

The Bright Future

The Cold War

The Start of it All

The Asian Boom

Pre-GFC growth

The Classics

Stuff I learned

Things I’ll look at in the future

The code