
I’ve been playing a lot lately with my dataset of the locations, opening dates and other information about the top theme parks in the world as measured by the Themed Entertainment Association. I mentioned that I wanted to try clustering the parks to see if I can find groups of them within the data. When I built my animation of theme parks opening I thought there might be some sort of ‘contagion’ effect of parks opening, where one opening in an area increased the likelihood of another one opening in the same area within a short time. My idea is that companies and people try to reduce risk by opening parks in areas they understand, and that the risk they’re willing to take in a new market increases as time passes. This second idea comes from my contention that these companies are always trying to open new parks, but won’t do it if the current market is too competitive. Their two options are to build their share of the market they know, like they do in Florida, or to try and find a new market that looks promising. As the home market gets more and more competitive over time, those foreign markets start to look more attractive.
K-means clustering
K-means is one of the most popular methods at the moment of grouping data points of any type – so popular that the algorithm comes packaged in base R. In short, the model makes points on a surface with N-dimensions, where N represents the number of variables you’ve put into the model. This is pretty easy to imagine with two or three variables, but once we get to six or so (having used up colour, size and shape in a graph) you have to start bending your brain in some pretty specific ways to keep up.
Then it picks K points on that surface that minimise the distance between all the data points in the set. The distance between the point that is chosen (or ‘centroid’) is called the ‘within Sum of Squares’, and is used to determine how well the points group the data points. (Here’s another explanation if you’re interested in the details)

Choosing K: Slap that elbow
The main question that K-means clustering wants you to answer before you start is how many clusters you want it to make. It sounds weird, but the best way I know to do this (without moving to Bayesian mixture modelling) is basically trial and error. Usually with the Big Data I’m used to, this can take a long time unless you do some parallelisation work but with only 30 or so entries this is a pretty trivial task. So basically I run a loop in R building a cluster model with 2 – 15 clusters (more is kinda useless to me) and measure the Within Sums of Squares error of the model at each stage and get ready to slap that elbow.
You can see from the graph that the error reduces massively from 2 to 3 clusters, but then eases off between 3 and 4 clusters creating an ‘elbow’ in the line plot. That indicates that 3 clusters give a good explanation of the data, and while 4 clusters is slightly better, they don’t explain that much more about the data. When trying to name and describe clusters it always gets more difficult with more groups to describe, so we don’t want to clog up our communication with clusters that don’t really mean much. Looking at this graph I could probably make two justifiable choices – three clusters is the strongest choice but six clusters is probably defensible as well. This is one of the issues with this method – the results massively rely on K, but choosing K is a really subjective procedure.
The code
Here’s some code that does the within sums of squares loop:
# xxx is a data.frame or data.table object of only numbers wss <- NULL for (i in 2:15) {wss[i] <- sum(kmeans(xxx, centers=i)$withinss)} plot(wss, type = "l", xlab = "Clusters",ylab = "Within SS Error", main = "Error with different number of clusters")
The results
This was another experiment like my adventures in Holt Winters modelling that looks promising but really needs more data. Here are the plots of parks with three and five clusters:
The results of the three cluster model are pasted below. Tivoli stands out on its own as expected due to its opening date being so far before everyone else. The other two groups though I’m struggling to describe the other two groups by anything particular.
Cluster 1 | Cluster 2 | Cluster 3 |
TIVOLI GARDENS | CHIMELONG OCEAN KINGDOM | BUSCH GARDENS |
DISNEY ANIMAL KINGDOM | DE EFTELING | |
DISNEY CALIFORNIA ADVENTURE | DISNEYLAND | |
DISNEY HOLLYWOOD STUDIOS | EPCOT | |
DISNEYLAND PARIS | EUROPA PARK | |
HONG KONG DISNEYLAND | EVERLAND | |
ISLANDS OF ADVENTURE | MAGIC KINGDOM | |
LOTTE WORLD | NAGASHIMA SPA LAND | |
OCT EAST | OCEAN PARK | |
SONGCHENG PARK | SEAWORLD | |
SONGCHENG ROMANCE PARK | SEAWORLD FL | |
TOKYO DISNEY SEA | TOKYO DISNEYLAND | |
UNIVERSAL STUDIOS FL | UNIVERSAL STUDIOS HOLLYWOOD | |
UNIVERSAL STUDIOS JAPAN | ||
WALT DISNEY STUDIOS PARK | ||
YOKOHAMA HAKKEIJIMA SEA PARADISE |
So I thought the six cluster model might do better in describing the parks. The results of the model are pasted below:
Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Cluster 5 | Cluster 6 |
DISNEY HOLLYWOOD STUDIOS | BUSCH GARDENS | TIVOLI GARDENS | CHIMELONG OCEAN KINGDOM | DISNEY ANIMAL KINGDOM | DE EFTELING |
DISNEYLAND PARIS | EUROPA PARK | OCT EAST | DISNEY CALIFORNIA ADVENTURE | DISNEYLAND | |
LOTTE WORLD | EVERLAND | SONGCHENG ROMANCE PARK | HONG KONG DISNEYLAND | NAGASHIMA SPA LAND | |
UNIVERSAL STUDIOS FL | MAGIC KINGDOM | ISLANDS OF ADVENTURE | SEAWORLD | ||
YOKOHAMA HAKKEIJIMA SEA PARADISE | OCEAN PARK | SONGCHENG PARK | UNIVERSAL STUDIOS HOLLYWOOD | ||
EPCOT | SEAWORLD FL | TOKYO DISNEY SEA | |||
TOKYO DISNEYLAND | UNIVERSAL STUDIOS JAPAN | ||||
WALT DISNEY STUDIOS PARK |
This one is a bit more descriptive, so I had a go at giving the clusters names. Remember that even though they are clustered by date amongst other variables, the groups aren’t organised chronologically because the algorithm is unsupervised.
Cluster 1 | Cluster 2 | Cluster 3 |
The Bright Future | The Cold War | The Start of it All |
Cluster 4 | Cluster 5 | Cluster 6 |
The Asian Boom | Pre-GFC growth | The Classics |
The Bright Future
These parks all seem to be built in times where the owner was looking to take a bigger risk because they knew their future was looking good. The reasons for their optimism are probably different by location and owner, but if I had to pick something that made these parks similar, I’d say it was the spirit of optimism in which they were built.
The Cold War
These parks were generally built in places and times where governments were spending a lot of time and attention on trying to show off how awesome they were, especially in the 60’s and 70’s. The Cold War was at it’s height throughout this period, and having a park Magic Kingdom on your books was a massive draw for high-profile defectors and a boon for propaganda. Having said that, Magic Kingdom was notably built with very little Government support, so I’m probably totally off here.
The Start of it All
Tivoli Gardens will always be a unique gem of this theme park world. Long live Tivoli!
The Asian Boom
These are the massive new Asian superparks, with OCT owned by the Chinese Government and the other two heavily sponsored by public interests. With these parks rocketing up the ranks, it’s very possible that this group will grow in the list of top 25 parks in the coming years.
Pre-GFC growth
Most of these parks were built in booming economies that (in retrospect) were growing because of the growing deregulation in financial markets – in the late eighties and early nineties. These were built in a spirit of optimism like the Bright Future parks, but that optimism stemmed from regulatory environments in this case rather than real business growth. A lot of these parks have done less well in the ranks in recent years, possibly as a result of the market adjustment in these economies.
The Classics
These are the parks that really established the theme park industry. After Tivoli Gardens had gestated the idea of amusement parks, Walt Disney introduced the concept to Los Angeles and everything went mental. These parks were mainly those that made up this first wave, riding on the buzz caused by Disneyland.
Stuff I learned
The first and most obvious lesson of this exercise is that K-means clustering is a minefield of subjectivity and over-interpretation. As a statistician I really don’t like having to make a decision without a solid numerical threshold on which to rely, so slapping the elbow of the within groups error isn’t very nice. The other part of the process is naming an describing the clusters, which is pretty difficult to do from an analytical perspective. In writing my descriptions I had to be pretty creative, and as I wrote I could see all sorts of ways the list didn’t really fit what I was saying. The common excuse of people using this method is ‘you’ll never get it perfect’, but I should at least be able to say why I chose the things I did with more backup than ‘it felt right’.
The second lesson is that as always more data is better. I’ve done this clustering on around thirty parks, but it might make the clusters more clear if I added more parks and included more variables in the model I train. In addition, I only trained this model on four variables at the moment, while normal Data Science clustering models should contain at least 50 or 60 to really start looking valid.
Things I’ll look at in the future
The next thing I’ll probably do is a different approach to clustering using visitor numbers in combination with the locations of the parks. This would tell me if the different parks are catering to different markets that have unique patterns of attendance, which might contribute to my machine learning approaches.
Another idea is to play with the different results produced by changing K, which gives progressively more detail about the groups as it increases. This is based on the work I saw once at the Australian Statistical conference in a Peter Donnelly lecture where he did this with Genetic data to move back in history and show the gradual introduction of different genetic groups.
What do you think of my attempt at grouping theme parks? Do you think the clusters make sense, or did I just pull a bunch of meaning out of nothing? As always, I’d love to hear any critique or analysis you might have.
The code
Here’s some code in case you want to do something similar:
#Load libaries library(data.table) library(ggplot2) # Load the data info <- read.csv("~/myspatialdata.csv", stringsAsFactors = FALSE) info <- info[complete.cases(info),] #Get rid of any empty trailing rows setDT(info) #Make it a data.table because Data Science info$opened <- as.Date(info$opened) # Tell R this is a date setkey(info, park) # Order by park setkey(info, opened) # Order by Opening date cols = c("opened", "lat", "long", "operator") xxx <- info[,cols, with = FALSE] # Select only the columns we'll cluster on xxx$opened <- as.numeric(as.Date(xxx$opened)) #Convert this to a number because K-means only takes numbers. xxx$operator <- as.numeric(as.factor(xxx$operator)) # Same for the operator factor. # Slap that elbow wss <- NULL for (i in 2:15) {wss[i] <- sum(kmeans(xxx, centers=i)$withinss)} plot(wss, type = "l", xlab = "Clusters",ylab = "Within SS Error", main = "Error with different number of clusters") # Create models with 3 and 6 clusters based on the elbow approach. parksclusterreal <- kmeans(xxx, 3, nstart =10) parksclusterfun <- kmeans(xxx, 6, nstart =10) # Add the cluster labels to the data frame info$cluster <- parksclusterreal$cluster info$clusterfun <- parksclusterfun$cluster ### Plot the parks by cluster on a world map # Three cluster model mp <- NULL mapWorld <- borders("world", colour="gray10", fill="gray10") # create a layer of borders mp <- ggplot(data = info, aes(x= long, y= lat , color= as.factor(cluster))) + mapWorld + theme_bw() mp <- mp+ geom_point(size = 5, shape = 5) + ylim(c(0, 60)) + ggtitle("Clusters of theme parks by location, operator, and opening date") + labs(colour='Cluster') mp # Six cluster model mp <- NULL mapWorld <- borders("world", colour="gray10", fill="gray10") # create a layer of borders mp <- ggplot(data = info, aes(x= long, y= lat , color= as.factor(clusterfun))) + mapWorld + theme_bw() mp <- mp+ geom_point(size = 5, shape = 5) + ylim(c(0, 60)) + ggtitle("Clusters of theme parks by location, operator, and opening date")+ labs(colour='Cluster') mp