Today Universal Studios Japan released a report showing that they had received a record number of visitors last month. The news led me to wonder – was this new record the result of Universal Studios’ meteoric rise as of late, or was it more a symptom of the renewed interest in Asian theme parks in the last few years?
Pulling apart the causes of things with multivariate regression
One of the most basic tools in the Data Scientist toolkit is multivariate regression. Not only is this a useful model in its own right, but I’ve also used its output as a component of other models in the past. Basically it looks at how much the change in each predictor explains the change in the outcome and gives each variable a weighting. It only works when you have linear data, but people tend to use it as a starting point for pretty much every question with a bunch of predictors and a continuous outcome.
Is the Universal Studios Japan record because it is Universal, or because it’s in Asia?
To answer this question I ran a multivariate regression on annual park visitor numbers using dummy variables indicating whether the park was Universal owned, and whether it was in Asia. After a decent amount of messing around in ggplot, I managed to produce these two plots:
In these two plots we can see that the Universal parks are catching up to the non-Universal parks, while the Asian parks still aren’t keeping pace with the non-Asian parks. So far this is looking good for the Universal annual report!
This is confirmed by the regression model, the results of which are pasted below:
|Estimate||Std. Error||t value||p-value|
In this we can see that firstly, only Universal ownership has a significant effect in the model. But you can also see the Estimate of the effect is negative, which is confusing until you control for time, which is the year*universal row of the table. We can see here that for each consecutive year, we expect a Universal park to gain 234512 more visitors than a non-Universal park. On the other hand, we’d only expect and Asian park to have 31866 more visitors than a non-Asian park for each consecutive year over the dataset. This suggests that being a Universal Park is far more responsible for Universal Studios Japan’s record visitor numbers than it’s location. However, the model fit for this is really bad : .02 , which suggests I’m doing worse than stabbing in the dark in reality.
The main thing I learned is that it’s really complicated to get you head around interpreting multivariate regression. Despite it being one of the things you learn in first year statistics, and something I’ve taught multiple times, it still boggles the brain to work in many dimensions of data.
The second thing I learned is that I need to learn more about the business structure of the theme park industry to be able to provide valuable insights based on models from the right variables. Having such a terrible model fit usually says there’s something major I’ve forgotten, so getting a bit more knowledgable about how things are done in these areas would give me an idea of the variables I need to add to increase my accuracy.
Future things to do
The first thing to do here would be to increase my dataset with more parks and more variables – I think even after a small number of posts I’m starting to hit the wall with what I can do analytically.
Second thing I want to try is to go back to the Random Forest model I made that seemed to be predicting things pretty well. I should interrogate that model to get the importance of the variables (a pretty trivial task in R), which would confirm or deny that ownership is more important than being in Asia.
What do you think? Are my results believable? Is this truly the result of the excellent strategic and marketing work done by Universal in recent years, or is it just luck that they’re in the right place at the right time? One thing is certain: the theme park world is changing players, and between Universal’s charge to the top and the ominous growth of the Chinese megaparks, Disney is going to have a run for its money in the next few years.