Record numbers at Universal Studios Japan: The continued rise of Universal, or a story of the Asian Boom?

 

universal-studios-japan-main-entrance-night
 The Universal Studios Japan main entrance credit: Travelcaffeine.com

Today Universal Studios Japan released a report showing that they had received a record number of visitors last month. The news led me to wonder – was this new record the result of Universal Studios’ meteoric rise as of late, or was it more a symptom of the renewed interest in Asian theme parks in the last few years?

Pulling apart the causes of things with multivariate regression

One of the most basic tools in the Data Scientist toolkit is multivariate regression. Not only is this a useful model in its own right, but I’ve also used its output as a component of other models in the past.  Basically it looks at how much the change in each predictor explains the change in the outcome and gives each variable a weighting. It only works when you have linear data, but people tend to use it as a starting point for pretty much every question with a bunch of predictors and a continuous outcome.

Is the Universal Studios Japan record because it is Universal, or because it’s in Asia?

To answer this question I ran a multivariate regression on  annual park visitor numbers using dummy variables indicating whether the park was Universal owned, and whether it was in Asia. After a decent amount of messing around in ggplot, I managed to produce these two plots:

universalregression
Black is not Universal, red is Universal
asiaregression
Black is not Asia, red is Asia

In these two plots we can see that the Universal parks are catching up to the non-Universal parks, while the Asian parks still aren’t keeping pace with the non-Asian parks. So far this is looking good for the Universal annual report!

This is confirmed by the regression model, the results of which are pasted below:

Coefficients:
Estimate Std. Error t value p-value
(Intercept) 7831953 773691 10.123 2.00E-16
year 126228 125587 1.005 0.3158
universal -3522019 1735562 -2.029 0.0435
asia -1148589 1228394 -0.935 0.3507
universal*asia 3044323 3341146 0.911 0.3631
year*universal 234512 280112 0.837 0.4033
year*asia 31886 193528 0.165 0.8693
year*universal*asia 267672 536856 0.499 0.6185

In this we can see that firstly, only Universal ownership has a significant effect in the model. But you can also see the Estimate of the effect is negative, which is confusing until you control for time, which is the year*universal row of the table.  We can see here that for each consecutive year, we expect a Universal park to gain 234512 more visitors than a non-Universal park. On the other hand, we’d only expect and Asian park to have 31866 more visitors than a non-Asian park for each consecutive year over the dataset. This suggests that being a Universal Park is far more responsible for Universal Studios Japan’s record visitor numbers than it’s location. However, the model fit for this is really bad : .02 , which suggests I’m doing worse than stabbing in the dark in reality.

Lessons learned

The main thing I learned is that it’s really complicated to get you head around interpreting multivariate regression. Despite it being one of the things you learn in first year statistics, and something I’ve taught multiple times, it still boggles the brain to work in many dimensions of data.

The second thing I learned is that I need to learn more about the business structure of the theme park industry to be able to provide valuable insights based on models from the right variables. Having such a terrible model fit usually says there’s something major I’ve forgotten, so getting a bit more knowledgable about how things are done in these areas would give me an idea of the variables I need to add to increase my accuracy.

Future things to do

The first thing to do here would be to increase my dataset with more parks and more variables – I think even after a small number of posts I’m starting to hit the wall with what I can do analytically.

Second thing I want to try is to go back to the Random Forest model I made that seemed to be predicting things pretty well. I should interrogate that model to get the importance of the variables (a pretty trivial task in R), which would confirm or deny that ownership is more important than being in Asia.

What do you think? Are my results believable? Is this truly the result of the excellent strategic and marketing work done by Universal in recent years, or is it just luck that they’re in the right place at the right time? One thing is certain: the theme park world is changing players, and between Universal’s charge to the top and the ominous growth of the Chinese megaparks, Disney is going to have a run for its money in the next few years.

 

A spatio-temporal clustering of the world’s top theme parks.

elsa
The ‘Mickey’s Friends’ or something show I saw at Magic Kingdom. My niece loves Elsa.

I’ve been playing a lot lately with my dataset of the locations, opening dates and other information about the top theme parks in the world as measured by the Themed Entertainment Association. I mentioned that I wanted to try clustering the parks to see if I can find groups of them within the data. When I built my animation of theme parks opening I thought there might be some sort of ‘contagion’ effect of parks opening, where one opening in an area increased the likelihood of another one opening in the same area within a short time. My idea is that companies and people try to reduce risk by opening parks in areas they understand, and that the risk they’re willing to take in a new market increases as time passes. This second idea comes from my contention that these companies are always trying to open new parks, but won’t do it if the current market is too competitive. Their two options are to build their share of the market they know, like they do in Florida, or to try and find a new market that looks promising. As the home market gets more and more competitive over time, those foreign markets start to look more attractive.

 

K-means clustering

K-means is one of the most popular methods at the moment of grouping data points of any type  – so popular that the algorithm comes packaged in base R. In short, the model makes points on a surface with N-dimensions, where N represents the number of variables you’ve put into the model. This is pretty easy to imagine with two or three variables, but once we get to six or so (having used up colour, size and shape in a graph) you have to start bending your brain in some pretty specific ways to keep up.

Then it picks K points on that surface that minimise the distance between all the data points in the set. The distance between the point that is chosen (or ‘centroid’) is called the ‘within Sum of Squares’, and is used to determine how well the points group the data points. (Here’s another explanation if you’re interested in the details)

clustereg
An example of centroids (represented as stars) grouping data points (represented as circles) on a two-dimensional surface.

 Choosing K: Slap that elbow

The main question that K-means clustering wants you to answer before you start is how many clusters you want it to make. It sounds weird, but the best way I know to do this (without moving to Bayesian mixture modelling) is basically trial and error. Usually with the Big Data I’m used to, this can take a long time unless you do some parallelisation work but with only 30 or so entries this is a pretty trivial task. So basically I run a loop in R building a cluster model with 2 – 15 clusters (more is kinda useless to me) and measure the Within Sums of Squares error of the model at each stage and get ready to slap that elbow.wsserrortheme

You can see from the graph that the error reduces massively from 2 to 3 clusters, but then eases off between 3 and 4 clusters creating an ‘elbow’ in the line plot. That indicates that 3 clusters give a good explanation of the data, and while 4 clusters is slightly better, they don’t explain that much more about the data. When trying to name and describe clusters it always gets more difficult with more groups to describe, so we don’t want to clog up our communication with clusters that don’t really mean much. Looking at this graph I could probably make two justifiable choices – three clusters is the strongest choice but six clusters is probably defensible as well. This is one of the issues with this method – the results massively rely on K, but choosing K is a really subjective procedure.

The code

Here’s some code that does the within sums of squares loop:

# xxx is a data.frame or data.table object of only numbers
 
wss <- NULL
for (i in 2:15) {wss[i] <- sum(kmeans(xxx, centers=i)$withinss)}

plot(wss, type = "l", xlab = "Clusters",ylab = "Within SS Error",
 main = "Error with different number of clusters")

The results

This was another experiment like my adventures in Holt Winters modelling that looks promising but really needs more data. Here are the plots of parks with three and five clusters:

clusters3clusters6

The results of the three cluster model are pasted below. Tivoli stands out on its own as expected due to its opening date being so far before everyone else. The other two groups though I’m struggling to describe the other two groups by anything particular.

Cluster 1 Cluster 2 Cluster 3
TIVOLI GARDENS CHIMELONG OCEAN KINGDOM BUSCH GARDENS
DISNEY ANIMAL KINGDOM DE EFTELING
DISNEY CALIFORNIA ADVENTURE DISNEYLAND
DISNEY HOLLYWOOD STUDIOS EPCOT
DISNEYLAND PARIS EUROPA PARK
HONG KONG DISNEYLAND EVERLAND
ISLANDS OF ADVENTURE MAGIC KINGDOM
LOTTE WORLD NAGASHIMA SPA LAND
OCT EAST OCEAN PARK
SONGCHENG PARK SEAWORLD
SONGCHENG ROMANCE PARK SEAWORLD FL
TOKYO DISNEY SEA TOKYO DISNEYLAND
UNIVERSAL STUDIOS FL UNIVERSAL STUDIOS HOLLYWOOD
UNIVERSAL STUDIOS JAPAN
WALT DISNEY STUDIOS PARK
YOKOHAMA HAKKEIJIMA SEA PARADISE

So I thought the six cluster model might do better in describing the parks. The results of the model are pasted below:

Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6
DISNEY HOLLYWOOD STUDIOS BUSCH GARDENS TIVOLI GARDENS CHIMELONG OCEAN KINGDOM DISNEY ANIMAL KINGDOM DE EFTELING
DISNEYLAND PARIS EUROPA PARK OCT EAST DISNEY CALIFORNIA ADVENTURE DISNEYLAND
LOTTE WORLD EVERLAND SONGCHENG ROMANCE PARK HONG KONG DISNEYLAND NAGASHIMA SPA LAND
UNIVERSAL STUDIOS FL MAGIC KINGDOM ISLANDS OF ADVENTURE SEAWORLD
YOKOHAMA HAKKEIJIMA SEA PARADISE OCEAN PARK SONGCHENG PARK UNIVERSAL STUDIOS HOLLYWOOD
EPCOT SEAWORLD FL TOKYO DISNEY SEA
TOKYO DISNEYLAND UNIVERSAL STUDIOS JAPAN
WALT DISNEY STUDIOS PARK

This one is a bit more descriptive, so I had a go at giving the clusters names. Remember that even though they are clustered by date amongst other variables, the groups aren’t organised chronologically because the algorithm is unsupervised.

Cluster 1 Cluster 2 Cluster 3
The Bright Future The Cold War The Start of it All
Cluster 4 Cluster 5 Cluster 6
The Asian Boom Pre-GFC growth The Classics

The Bright Future

These parks all seem to be built in times where the owner was looking to take a bigger risk because they knew their future was looking good. The reasons for their optimism are probably different by location and owner, but if I had to pick something that made these parks similar, I’d say it was the spirit of optimism in which they were built.

The Cold War

These parks were generally built in places and times where governments were spending a lot of time and attention on trying to show off how awesome they were, especially in the 60’s and 70’s. The Cold War was at it’s height throughout this period, and having a park Magic Kingdom on your books was a massive draw for high-profile defectors and a boon for propaganda. Having said that, Magic Kingdom was notably built with very little Government support, so I’m probably totally off here.

The Start of it All

Tivoli Gardens will always be a unique gem of this theme park world. Long live Tivoli!

The Asian Boom

These are the massive new Asian superparks, with OCT owned by the Chinese Government and the other two heavily sponsored by public interests. With these parks rocketing up the ranks, it’s very possible that this group will grow in the list of top 25 parks in the coming years.

Pre-GFC growth

Most of these parks were built in booming economies that (in retrospect) were growing because of the growing deregulation in financial markets – in the late eighties and early nineties. These were built in a spirit of optimism like the Bright Future parks, but that optimism stemmed from regulatory environments in this case rather than real business growth. A lot of these parks have done less well in the ranks in recent years, possibly as a result of the market adjustment in these economies.

The Classics

These are the parks that really established the theme park industry. After Tivoli Gardens had gestated the idea of amusement parks, Walt Disney introduced the concept to Los Angeles and everything went mental. These parks were mainly those that made up this first wave, riding on the buzz caused by Disneyland.

Stuff I learned

The first and most obvious lesson of this exercise is that K-means clustering is a minefield of subjectivity and over-interpretation. As a statistician I really don’t like having to make a decision without  a solid numerical threshold on which to rely, so slapping the elbow of the within groups error isn’t very nice. The other part of the process is naming an describing the clusters, which is pretty difficult to do from an analytical perspective. In writing my descriptions I had to be pretty creative, and as I wrote I could see all sorts of ways the list didn’t really fit what I was saying. The common excuse of people using this method is ‘you’ll never get it perfect’, but I should at least be able to say why I chose the things I did with more backup than ‘it felt right’.

The second lesson is that as always more data is better. I’ve done this clustering on around thirty parks,  but it might make the clusters more clear if I added more parks and included more variables in the model I train. In addition, I only trained this model on four variables at the moment, while normal Data Science clustering models should contain at least 50 or 60 to really start looking valid.

Things I’ll look at in the future

The next thing I’ll probably do is a different approach to clustering using visitor numbers in combination with the locations of the parks. This would tell me if the different parks are catering to different markets that have unique patterns of attendance, which might contribute to my machine learning approaches.

Another idea is to play with the different results produced by changing K, which gives progressively more detail about the groups as it increases. This is based on the work I saw once at the Australian Statistical conference in a Peter Donnelly lecture where he did this with Genetic data to move back in history and show the gradual introduction of different genetic groups.

What do you think of my attempt at grouping theme parks? Do you think the clusters make sense, or did I just pull a bunch of meaning out of nothing? As always, I’d love to hear any critique or analysis you might have.

The code

Here’s some code in case you want to do something similar:

#Load libaries
library(data.table)
library(ggplot2)

# Load the data
info <- read.csv("~/myspatialdata.csv", stringsAsFactors = FALSE)
info <- info[complete.cases(info),] #Get rid of any empty trailing rows
setDT(info) #Make it a data.table because Data Science
info$opened <- as.Date(info$opened) # Tell R this is a date
setkey(info, park) # Order by park
setkey(info, opened) # Order by Opening date
cols = c("opened", "lat", "long", "operator")

xxx <- info[,cols, with = FALSE] # Select only the columns we'll cluster 
on
xxx$opened <- as.numeric(as.Date(xxx$opened)) #Convert this to a number 
because K-means only takes numbers.
xxx$operator <- as.numeric(as.factor(xxx$operator)) # Same for the 
operator factor.

# Slap that elbow
wss <- NULL
for (i in 2:15) {wss[i] <- sum(kmeans(xxx, centers=i)$withinss)}
plot(wss, type = "l", xlab = "Clusters",ylab = "Within SS Error", 
main = "Error with different number of clusters")

# Create models with 3 and 6 clusters based on the elbow approach.
parksclusterreal <- kmeans(xxx, 3, nstart =10)
parksclusterfun <- kmeans(xxx, 6, nstart =10)

# Add the cluster labels to the data frame
info$cluster <- parksclusterreal$cluster
info$clusterfun <- parksclusterfun$cluster

### Plot the parks by cluster on a world map

# Three cluster model
mp <- NULL
mapWorld <- borders("world", colour="gray10", fill="gray10") 
# create a layer of borders
mp <- ggplot(data = info, aes(x= long, y= lat , 
    color= as.factor(cluster))) + mapWorld + theme_bw()
mp <- mp+ geom_point(size = 5, shape = 5) + ylim(c(0, 60))
    + ggtitle("Clusters of theme parks by location, operator, 
    and opening date") + labs(colour='Cluster')
mp

# Six cluster model
mp <- NULL
mapWorld <- borders("world", colour="gray10", fill="gray10") 
# create a layer of borders
mp <- ggplot(data = info, aes(x= long, y= lat , 
    color= as.factor(clusterfun))) + mapWorld + theme_bw()
mp <- mp+ geom_point(size = 5, shape = 5) + ylim(c(0, 60))
    + ggtitle("Clusters of theme parks by location, operator,
    and opening date")+ labs(colour='Cluster')
mp

An animation of theme parks opening around the world

parntersdisney

I’ve been collecting a lot of data to be able to do my last few posts, and I’d mentioned that I wanted to try more with time series data. A few years ago I got to sit in a lecture at Queensland University of Technology by Sudipto Banerjee on Bayesian spatiotemporal modelling. At the time the material was way too advanced for me, but the idea of analysing data points with time and space treated correctly has always stuck.

As I dug into different things I could do with spatiotemporal , I realised that I needed a lot more understanding of the data itself before I could do fun tricksy things with it. I needed something that would maintain my interest, but also force me to mess around munging spatiotemporal data

An idea born of necessity

In the first year of my postgraduate research, I was really interested in data visualisations. Thankfully at the time a bunch of blogs like FlowingData were starting up, reporting on all types of cool data graphics.’Infographics’ also became a thing, threatening to destroy Data Science in its infancy. But what caught my eye at the time were the visualisations of flight paths like this one.

So now that I have some data and a bit of time and ability, I thought I’d try a more basic version of a spatiotemporal visualisation like this. My problem is that I hate installing extra one-time software for my whims, so the idea of using ImageMagick annoyed me. On top of that, when I tried I couldn’t get it to work so I determined to do what I could using base R and ggplot.

The result

This is probably the first article I can say I’m pretty happy with the result:

movie

 

The first thing you can see is that Europe, not the US is the true home of theme parks, with Tivoli Gardens appearing in 1843, and remaining in the top 25 theme parks since before there were 25 parks to compete against.

Beyond that, you can also sort of see  that there is a ‘contagion’ effect of parks – when one opens in an area, there are usually others opening nearby pretty soon. There’s two reasons I can think of for this. First, once people are travelling to an area to go to a theme park, going to two theme parks probably isn’t out of the question so someone’s bound to move in to capture that cash. Second is that the people opening new parks have to learn to run theme parks somewhere, and if you’re taking a massive risk on opening a $100 million park with a bunch of other people’s money you’ll want to minimise your risk by opening it in a place you understand.

Future stuff

Simply visualising the data turned out to be more than a data munging exercise for me – plotting this spatially as an animation gave some actual insights about how these things have spread over the world. It made me more interested in doing the spatio-temporal clustering as well – it would be really cool to do that then redo this plot with the colours of the points determined by the park’s cluster.

Another direction to explore would be to learn more about how to scrape Wikipedia and fill out my data table with more parks rather than just those that have featured in the TEA reports. I know this is possible and it’s not exactly new, but it’s never come across my radar and web scraping is a pretty necessary tool in the Data Science toolkit.

What applications can you think of for this sort of visualisation? Is there anything else I could add to this one that might improve it? I’d love to hear your thoughts!

The code

Just in case you wanted to do the same, I’ve added the code with comments below. You’ll need to add your own file with a unique name, latitude, longitude and date in each row.

# Load the required libraries
library(ggmap)
library(ggplot2)
library(maptools)
library(maps)
library(data.table)

info <- read.csv("***.csv", stringsAsFactors = FALSE)

info <- info[complete.cases(info),]
setDT(info)
info$opened <- as.Date(info$opened) # S
setkey(info, park)
setkey(info, opened)


# Setup for an animation 
a_vec <- seq(1840, 2016 , by=1) # Create a vector of the years you will 
animate over

# Create a matrix to hold the 'size' information for the graph
B = matrix( rep(0, length(a_vec)*length(info$park)),
 nrow= length(a_vec), 
 ncol= length(info$park))

for (i in 1:ncol(B))
{
 for (x in 1: nrow(B))
 { #I want to have a big dot when it opens that gets gradually smaller,
    like the alpha in the flights visualisation.
  open_date <- as.numeric(year(info$opened[i]))
  c_year <- a_vec[x]
  #If the park hasn't opened yet give it no circle
  if ( open_date < c_year)
  {B[x,i] <- 0} else
  # If the park is in its opening year, give it a big circle.
  if (open_date == c_year)
  {B[x,i] <- 10}
  }}

# Make the circle fade from size 10 to size 1, then stay at 1 until 
the end of the matrix

for (i in 1:ncol(B))
{ for (x in 2: nrow(B))
  {if (B[x-1, i] > 1){ B[x,i] <- B[x-1, i] - 1}else
   if(B[x-1, i] == 1){ B[x,i] <- 1}}}

B <- data.frame(B)
B <- cbind( a_vec, B)
setDT(B)
names(B) <- c("years", info$park) #Set the column names to the names of 
the parks

xxx <- melt(B, "years") # Convert to long format

# Create a table of locations
loc <- data.table("variable" = info$park,
                   "lat"= info$lat, 
                   "long"= info$long)

#Join the locations to the long table
xxx <- merge(xxx, loc, by = "variable", all.x = TRUE)
setkey(xxx, years)

# Create a ggplot image for each entry in the a_vec vector of years we
 made at the beginning. 
for (i in 1: length(a_vec))
    {mydata <- xxx[years ==a_vec[i]] # Only graph the rows for year i.
     mydata <- mydata[mydata$value!=0,] #Don't plot stuff not open yet.
     #Write the plot to a jpeg file and give it a number to keep the 
      frames in order.
     jpeg(filename = 
     paste("~/chosenfolder/animation", i, ".jpeg", sep = ""),
     width = (429*2) , height = (130*2), units = "px") 
     mp <- NULL 
     # Plot a world map in grey and entitle it with the year.
     mapWorld <- borders("world", colour="gray50", fill="gray50") 
     mp <- ggplot() + mapWorld + theme_bw() + ggtitle(a_vec[i])
     # Add the points on the map, using the size vector we spent all that
       time building matrices to produce.
     mp <- mp+ geom_point(aes(x=mydata$long, y=mydata$lat) ,
     color = "orange", size = mydata$value/1.5) + ylim(c(0, 60))
     plot(mp)
     dev.off()
}

 

Using machine learning to improve predictions of visitor numbers

shintoepcot
The torii at EPCOT with the globe thing in the background

I wrote previously about using the Holt Winters model for time series analysis, particularly to predict the number of visitors to two of the world’s top theme parks next year. I am using annual data from the last ten or so years (which is all that’s available from the Themed Entertainment Association at this point), and unfortunately we could see quite easily that this sort of frequency of data (i.e. annual) was too sparse to make a decent prediction.

So the data are horrible, what are you going to do?

This kind of annoyed me -it takes ages to put together all this data in the first place and the results were disappointing. So I started thinking about other ways I could potentially model this using other data as well, and it was pretty easy to get general information about all these parks like their location, opening date and company ownership. I can imagine that parks that are close to each other are probably serving a similar crowd, and are subject to the same factors. Same with park ownership – the parent companies of these parks each have their own strategies, and parks with the same owner probably share in each other’s success or failures. But to allow for these sort of assumptions, I needed some way of adding this information to my model and let it use this sort of stuff to inform its predictions.

Machine Learning to the rescue

In current Data Science, Machine Learning is sort of a go to when the normal models fail. It allows us to take a vast array of complex information and use algorithms to learn patterns in the data and make some pretty amazing predictions. In this case we don’t really have Big Data like we would at a major corporation, but given that the numbers are pretty stable and we’re only trying to predict a few cases, it’s possible that this approach could improve our predictions.

Machine what now?

I know, it’s both a confusing and kind of ridiculous name. The whole idea started when Computer Scientists, Mathematicians and Statisticians started using computers to run equations millions of times over, using the results of each round, or ‘iteration’ of the calculation updating the next. It started with doing some pretty basic models, like linear and logistic regression over and over, testing the results and adjusting the weights of each factor in the model to improve them each time. Soon people started using these as building blocks in more complicated models, like Decision Trees, that evolved into Random Forests (which are the result of thousands or millions of decision trees). The sophistication of the building blocks improves daily, as does the ability to stack these blocks into more and more complex combinations of models. The winners of many Kaggle  competitions now take the most sophisticated of methods, and combine them for ridiculously accurate predictions of everything from rocket fuel usage to credit card risk. In this article I’m going to use one of the most popular algorithms, the Random Forest. I like these because they can be used for both numeric and categorical data, and do pretty well on both.

The results

This time we actually started getting pretty close to a decent model. Below you can see the graph of predicted and actual (labeled as ‘value’) visitor numbers for each park in 2015:

MLerrors.jpeg

It’s not too far off in a lot of cases, and pretty much everywhere it’s predicting just below what really happened, except for in the case of Disneyland Paris. In a few cases I’m way off, like for Universal Studios Japan, which could possibly due to the stellar performance of all the Universal parks recently. So with this information in hand, here’s my predictions for 2016:

DISNEY ANIMAL KINGDOM 10262808.79
DISNEY CALIFORNIA ADVENTURE 7859777.858
DISNEY HOLLYWOOD STUDIOS 10161975.17
DISNEYLAND 15850608.32
DISNEYLAND PARIS 11303153.4
EPCOT 11048540.24
EUROPA PARK 4600339.552
EVERLAND 7108378.079
HONG KONG DISNEYLAND 6508497.992
ISLANDS OF ADVENTURE 7419398.232
MAGIC KINGDOM 17124831.22
NAGASHIMA SPA LAND 5305896.091
OCEAN PARK 6860359.451
SEAWORLD FL 5440392.711
TIVOLI GARDENS 4249590.638
TOKYO DISNEY SEA 13529866.78
TOKYO DISNEYLAND 15279509.39
UNIVERSAL STUDIOS FL 7079618.369
UNIVERSAL STUDIOS HOLLYWOOD 5956300.006
UNIVERSAL STUDIOS JAPAN 9611463.005

If you want to see how these relate to my 2015 predictions, here’s a graph:

predictionsparks

 

Future stuff

As usual, I can still see a whole lot of things I can do to improve this model. At the moment there’s only two variables ‘moving’ with each row – the date and the visitor number. I could add a few more features to my model to improve things – the GDP of the country that park is in for example.

Second, Random Forests are notoriously bad at predicting time series data. In this case I converted the year of the data into a numeric vector rather than a date, adding 1 to the variable for the prediction. Given that each entry for each park was an even number of days apart (365 each row) I think that’s fair, but maybe I can’t treat annual entries that way. But to be fair, there doesn’t seem to be many models particularly good at predicting time series. There are suggestions of using artificial neural networks , but these aren’t particularly noted in time-series or spatio-temporal modelling. I think ‘Data Science’ needs to draw a bit more from Statistics in this case, and I’ll probably look in that direction for improved results in future. Given that it’s annual data I have the advantage of having a long time to process my model, so things likeMCMC using STAN might be promising here.

Finally, I need to get more practice at using ggplot2 for pretty graphs. I know a few tricks but my coding chops really aren’t up to building things with the right labels in the right places, especially when there are really long names. In this article I spent ages trying to fit the names of the parks into the first graph, but in the end I really couldn’t figure it out without making it really ugly. I’d love to be able to add my predictions as extensions on a line plot of the observed data, but that seems like epic level ggplot ninja-ing.

I’ll probably continue to attempt improving my predictions because it makes me feel like a wizard, but at this point I’ll most likely try this by playing with different models rather than ‘feature engineering’, which is most popular in Kaggle.

I’m always keen to hear people’s feedback and I’d love to improve my analyses based on people’s suggestions. Do you think my estimates are accurate, or is there something major I’ve missed?

 

Theme park ranks over ten years

I’m interested in understanding the competitive landscape of theme parks, and showing their ranks from year to year is a good way of seeing this. The best way I know of is to use everybody’s favourite chart – the bumps chart!

What’s a bumps chart?

This was invented in Cambridge to keep track of one of the most mental sporting events you’ll ever see – the May Bumps.

may-bumps-2010
The May Bumps (credit Selwyn College)

In true Cambridge style, the May Bumps are a rowing race held every June. Apart from their timing, the series of races involves all the college rowing teams (usually around 20 of them at once) racing down the river Cam at high speeds trying desperately to run into (or ‘bump’) each other. If a crew catches up to the one in front, both crews pull over and in the next race they swap positions for the start. This means that over a week a crew can move from the front to the back of the race, and this tells a story of that year’s Bumps. The original bumps chart hangs in the Cambridge University Union building.

bumps
A bumps chart of a May Bumps series, showing Oriel winning the competition.

Results

The bumps chart I created was based on the Theme Entertainment Association reports published online each year since 2006. The data were read into R, and I used the ggplot2 package to draw a line plot of visitor numbers over the years. The directlabels package was used for the labels.

bumps

BLACKPOOL PLEASURE BEACH BLPB NAGASHIMA SPA LAND NASL
BUSCH GARDENS BUSG OCEAN PARK OCEP
CHIMELONG OCEAN KINGDOM CHOK OCT EAST OCTE
DE EFTELING DEEF PLEASURE BEACH PLEB
DISNEY ANIMAL KINGDOM DIAK PORT AVENTURA PORA
DISNEY CALIFORNIA ADVENTURE DICA SEAWORLD SEAW
DISNEY HOLLYWOOD STUDIOS DIHS SEAWORLD FL SEAF
DISNEYLAND DISN SONGCHENG PARK SONP
DISNEYLAND PARIS DISP SONGCHENG ROMANCE PARK SORP
EPCOT EPCO TIVOLI GARDENS TIVG
EUROPA PARK EURP TOKYO DISNEY SEA TODS
EVERLAND EVER TOKYO DISNEYLAND TOKD
HONG KONG DISNEYLAND HOKD UNIVERSAL STUDIOS FL UNSF
ISLANDS OF ADVENTURE ISOA UNIVERSAL STUDIOS HOLLYWOOD UNSH
KNOTTS BERRY FARM KNBF UNIVERSAL STUDIOS JAPAN UNSJ
LOTTE WORLD LOTW WALT DISNEY STUDIOS PARK AT DISNEYLAND PARIS WDSPADP
MAGIC KINGDOM MAGK YOKOHAMA HAKKEIJIMA SEA PARADISE YHSP

There are a few really noticeable things when we plot out the ranks of parks this way. This first is that Disney dominates the industry, and they keep a tight ship. Their parks don’t compete with each other for audience, and they don’t tend to move up and down relative to each other.

The second noticeable thing about the plot is the recent rise of Universal through the ranks, to finally crack the Disney lockout. This probably explains the buzz within Comcast (Universal’s owners) at the moment, and all their talk about an aggressive growth strategy.

Finally we can see really clearly here that the Asian parks, particularly the Chinese ones, are making a claim in the industry as mega players. Particularly Songcheng and Chimelong mega parks are growing at an incredible rate and are showing no signs of stopping. If the trend continues, it is very possible that our children will be pleading us to take them to China for the rides.

Future stuff

There are a whole lot of problems here around missing data. In particular we only get the top 20 – 25 parks each year and TEA only recently started publishing year-to-year figures recently, so the data are really patchy for some parks. On the other hand, in the true spirit of Data Science, the missingness could probably be used to tell us something as well if we could derive any meaning from the patterns of dropping in and out of the top 25.

I’d also be really interested to aggregate the data in different ways to see other patterns in the rankings. We could aggregate parks by location to see which areas are most popular at the moment, or we could aggregate by owner to look at who’s actually performing the best on a budget level. Looking at ownership companies brings forward whole new dimensions to the data – for example none of the Merlin Entertainment parks feature in the top 25, yet they have appeared in the top ten entertainment companies in income for the last ten years.

Do you think Universal can continue its rise? Will the Chinese parks continue to grow to be larger than the might Magic Kingdom, or will Disney retain it’s seat as the unchallenged leader?

Predictions of Disney and Universal visitor numbers

africaak
Africa area of Disney’s Animal Kingdom

When thinking about theme parks, one of the most obvious questions is how to predict the number of visitors expected for the coming years. This is not easy to do, but even an approximate answer would help in planning ride maintenance and staffing levels.

Why is this so difficult?

There are a bunch of reasons it’s difficult to predict visitor numbers to any large attraction.

First, all theme parks around the world are subject to global economics – if a park attracts lots of visitors from an area that happens to have a war or a recession then all bets are off.

Second, in places like Orlando where there is a high concentration of parks the number of visitors at a specific park depends heavily on the popularity of other parks in the area.

Finally, when we are talking about a global audience, there are any number of issues that can arise that destroy a park’s precious season. In 2010 when Icelandic Volcano Eyjafjallajökull erupted unexpectedly, Danish park Tivoli Gardens saw a drop of 20,000 visitors.

How is it done?

When forecasting pretty much anything, the go-to method is called the Holt-Winters model. There is a whole lot of clever maths behind this, but what you need to know is that it looks at data collected over time (annually in our case), placing more importance on values it saw more recently than on the ones it saw a long time ago.

The data come from the Themed Entertainment Association annual reports, which are sort of canonical for the theme park industry. In this set we go back as far as their published reports allow – to 2006. This isn’t a particularly long time, especially considering that all we get is annual data, but at least we might be able to get some idea of what we could expect.

Who cares?

We have data for the top 22 or so parks for that time (the bottom few tend  to drop off every couple of years), but to show what we’re doing we’ll just look at the two major competitors in the theme park industry – Disney’s Magic Kingdom, and the first non-Disney competitor Universal Studios Florida. This is interesting because Universal has recently announced an aggressive new strategy, likely based on the success of its recent Harry Potter attractions. But can Universal expect its rise to continue, or will Magic Kingdom maintain it’s unbeatable position?

The results

Well, it doesn’t look particularly good for Universal’s strategy. Here are plots of Holt-Winter’s fitting of visitor numbers to the Magic Kingdom and Universal Studios:

universalhwmkhw

We can see that both parks are steady, but Universal Studios performs massively below Magic Kingdom. The redline shows the fitted Holt-Winters model, and to be honest I’m not that happy with it. Really we’re just predicting the value from the previous year, so I’m interested to see how it does with forecasting.

To see how the two parks might do against each other into the future, we use the Holt-Winters model to predict the next ten years of visitors:

universalforecast

mkforecast

We can see here that our (dumb) Holt-Winters model is predicting the Magic Kingdom to sustain its massive lead over Universal Studios. We can see this in the 80% confidence intervals for both parks at the ten year period – between 7 and 12.16 million visitors for Universal, and between 18.5 and 22.4 million for the Magic Kingdom. This isn’t even close to an overlap, and suggests that Universal has next to no chance of overtaking the Disney powerhouse.

The lessons

The main thing I learned from this exercise is that the Holt-Winters model is best suited to data that is more frequent than annual. The power of the model comes from estimating seasonal variations, so with monthly or even quarterly data our predictions would become a lot more interesting.

I also learned that Universal Studios may have been a little excitable by their recent success. It’s been many years since they were able to crack the Disney fortress of top ranks, and the Harry Potter world attraction seems to have had a bigger effect than they realise even at this point.

Future stuff

There is is whole lot more I’m intending to do with this data. Most immediately I’d like to be able to try and improve my forecasts by adding in information about the parks, such as their location. As I mentioned at the top of the article, the success of parks in places like Orlando and arguably the Benelux region are highly dependent on the performance of their competitors, so a model would likely be able to gain a lot of information from the performance of nearby parks.

I also want to see if there are groupings of parks according to their visitor numbers over time. Seeing different clusters of parks by this metric would suggest they are catering to different populations, and might indicate which parks were truly competing against each other.

This was fun to do, and a great experience to play around with some time series data. Hope you learned something!