Analysing Florida theme park incidents: The road is long and full of regex

I’ve been interested in finding incident reports since I started writing this blog. In a world that seems so inherently dangerous but sells itself on being safe, I’ve been really curious what the data actually said. In this article I’m going to tell you how I found the incidents that have been reported to the Florida Government, cleaned most of it and converted it to a spreadsheet that I could actually analyse with some basic plots. Get in touch through the contact page if you’d like a copy of the original data or the final cleaned version to play with.

Finding the data

pikachu
Credit: sina.com.hk

Some Google searching managed to turn up some reports from WDWinfo and the Orlando Sentinel (not available in EU) which seemed to link to a Florida government site that has one document that appears to be continually updated through the same link.

I thought this was a bit of luck! If they were just updating the same page then I could theoretically set up a script to check it each quarter and update my data sheet. Unfortunately that fell apart really quickly as I realised it was a pdf file and pretty much impossible to read with my current skills. So I decided to do something unforgivable and just copypaste the whole thing into a spreadsheet – definitely not scalable! Little did I know that scalability would be thrown out the window very quickly as I started working with it in Python.

Pandas, but not that kind

To get this data into some sort of shape I decided to use the regex functions provided by re and pandas modules in Python. This decision was mainly because

panda
Credit: fanpop.com

Python is much faster dealing with strings than R, and pandas is a really useful (and R-like) module to make data handling even more simple.

I tried to just read it in using pandas at first, but there were way too many random commas for it to handle. That left my only option being importing the whole thing as strings, stick all the strings together and figure out how to split it myself. Thankfully there was at least a tiny bit of standardisation in the file – each line

started with a date that had a space in front of it. After a long time figuring out the regex for a date, I just started reading through regex tutorials which was really boring but really useful! From there I started pulling out whatever information I could using any patterns I could see.

import pandas as pd
import re
import csv

# I use the csv module here to read in the file because pandas was doing too much formatting for me
i = ""
with open("unformatted_incidents.csv", 'rb') as csvfile:
    incidentreader = csv.reader(csvfile)
    for row in incidentreader:
        for item in row:
            i = i + " " + str(item)

# This giant clumsy regex is to get rid of all the theme park names that snuck into the copypaste
i = re.sub(r"Wet.{1,10}Wild:|Disney:|Universal:|Sea World:|Busch Gardens:|Disney World:|Legoland:|None [Rr]eported|/{0,1}MGM:{0,1}|Epcot,{0,1}|USF|Adventure Island|Magic Kingdom", "", i)

# Now I split it on the space before each date it sees - you'll see this didn't quite work in the end.
splitlist = re.split(r' (?=[0-9]{1,2}/[0-9]{1,2}/{0,1}[0-9]{2})', i)

#Convert my list of lists into a one-column data frame
incidents = pd.DataFrame(splitlist, columns = ['a'])

# Each date has a space after it so I split on that space to get a date column
incidents['date'], incidents['stuff'] = incidents['a'].str.split(' ', 1).str
# The age of the person is the only digits left in the strings
incidents["age"] = incidents.stuff.str.extract("(\d+)", expand = True)
# The gender of the person is always in similar positions so I do a positive lookbehind to find them
incidents["gender"] = incidents.stuff.str.extract("((?<=year old ).|(?<=yo).)", expand = True)
# I look for any words before the age of the person, that's usually the ride (not always though!)
incidents["ride"] = incidents.stuff.str.extract("(.* (?=[0-9]))", expand = True)
# incidents = incidents.drop(['a'], axis = 1)
incidents.drop(incidents.index[0], inplace=True)
print incidents
incidents.to_csv("incidents.csv")

This script gave me a relatively clean dataset of around 560 incidents from 2003 – 2018 that I could at least import into R. I was celebrating at this stage thinking the pain was over, but little did I know what was to come…

Cleaning and plotting in R

scaredNow that I had something I could load, it was time to have some fun with graphs. But before I could do that, I needed to actually examine the data a bit more. This all looked fine at first – most things had a date and age and gender, but when I looked at the levels of the ‘rides’ column my blood turned cold as I realised how human generated this data really was.  I had 206 rides in the set, but as I started scrolling through them, almost all of them had duplicates with different spelling, capitalisations and punctuation. Spiderman was both “Spider Man” and “Spider-Man”. And don’t even talk about the Rip Ride Rockit and the million spellings they’ve used over the years in the report. This meant a LOT of dumb and non-scalable coding to clean it up:

library(data.table)
library(ggplot2)
incidents <- fread("~/Data/dis_incidents.csv")
incidents <- incidents[, V1 := NULL][,date := as.POSIXct(date, format = "%m/%d/%y"), ][, ride := as.factor(ride),][, condition := grepl("pre[-| |e]", incidents$stuff), ][, year := year(date)][!is.na(date)][year < 2019]

levels(incidents$ride) <- trimws(levels(incidents$ride), which = "both")
levels(incidents$ride) <-gsub(",|;|/.", "", levels(incidents$ride))

levels(incidents$ride) <- tolower(levels(incidents$ride))
levels(incidents$ride)[levels(incidents$ride)%like% "rock" & !levels(incidents$ride)%like% "rip"] <- "rock n rollercoaster"
levels(incidents$ride)[levels(incidents$ride)%like% "soar"] <- "soarin"
levels(incidents$ride)[levels(incidents$ride)%like% "under"] <- "under the sea jtlm"
levels(incidents$ride)[levels(incidents$ride)%like% "storm"] <- "storm slides"
levels(incidents$ride)[levels(incidents$ride)%like% "transformers"] <- "transformers"
levels(incidents$ride)[levels(incidents$ride)%like% "mission"] <- "mission space"
levels(incidents$ride)[levels(incidents$ride)%like% "hulk"] <- "incredible hulk coaster"
levels(incidents$ride)[levels(incidents$ride)%like% "sim"] <- "the simpsons"
levels(incidents$ride)[levels(incidents$ride)%like% "men"] <- "men in black"
levels(incidents$ride)[levels(incidents$ride)%like% "kil"] <- "kilimanjaro Safaris"
levels(incidents$ride)[levels(incidents$ride)%like% "tom"] <- "tomorrowland speedway"
levels(incidents$ride)[levels(incidents$ride)%like% "harry potter" & levels(incidents$ride)%like% "escape"] <- "hp escape from gringotts"
levels(incidents$ride)[levels(incidents$ride)%like% "harry potter" & levels(incidents$ride)%like% "forbid"] <- "hp forbidden journey"
levels(incidents$ride)[levels(incidents$ride)%like% "pirate"] <- "pirates of the caribbean"
levels(incidents$ride)[levels(incidents$ride)%like% "honey"] <- "honey i shrunk the kids"
levels(incidents$ride)[levels(incidents$ride)%like% "caro-"] <- "caro-seuss-el"
levels(incidents$ride)[levels(incidents$ride)%like% "buzz"] <- "bl spaceranger spin"

levels(incidents$ride)[levels(incidents$ride) %like% "everest"] <- "expedition everest" 
levels(incidents$ride)[levels(incidents$ride) %like% "astro"] <- "astro orbiter" 
levels(incidents$ride)[levels(incidents$ride) %like% "typhoon"|levels(incidents$ride) %like% "wave pool"|levels(incidents$ride) %like% "surf pool" ] <- "typhoon lagoon" 
levels(incidents$ride)[levels(incidents$ride) %like% "tob"] <- "toboggan racer" 
levels(incidents$ride)[levels(incidents$ride) %like% "progress"] <- "carousel of progress" 
levels(incidents$ride)[levels(incidents$ride) %like% "rip" & !levels(incidents$ride) %like% "saw"] <- "rip ride rockit" 
levels(incidents$ride)[levels(incidents$ride) %like% "knee"] <- "knee ski"
levels(incidents$ride)[levels(incidents$ride) %like% "spider"] <- "spiderman"
levels(incidents$ride)[levels(incidents$ride) %like% "seas"] <- "seas w nemo and friends"
levels(incidents$ride)[levels(incidents$ride) %like% "terror"] <- "tower of terror"
levels(incidents$ride)[levels(incidents$ride) %like% "dinos"] <- "ak dinosaur"
levels(incidents$ride)[levels(incidents$ride) %like% "bliz"] <- "blizzard beach"
levels(incidents$ride)[levels(incidents$ride) %like% "space m"] <- "space mountain"
levels(incidents$ride)[levels(incidents$ride) %like% "drag" & levels(incidents$ride) %like% "chal"] <- "dragon challenge"
levels(incidents$ride)[levels(incidents$ride) %like% "drag" & levels(incidents$ride) %like% "chal" |levels(incidents$ride) %like% "duel" ] <- "dragon challenge"
levels(incidents$ride)[levels(incidents$ride) %like% "dragon coas"] <- "dragon coaster"
levels(incidents$ride)[levels(incidents$ride) %like% "rapid"& levels(incidents$ride) %like% "roa"] <- "roa rapids"
levels(incidents$ride)[levels(incidents$ride) %like% "riverboat" | levels(incidents$ride) %like% "liberty"] <- "liberty riverboat"
levels(incidents$ride)[levels(incidents$ride) %like% "jurassic"] <- "camp jurassic"
levels(incidents$ride)[levels(incidents$ride) %like% "seven"] <- "seven dwarves mine train"
levels(incidents$ride)[levels(incidents$ride) %like% "prince"] <- "prince charming carousel"
levels(incidents$ride)[levels(incidents$ride) %like% "toy"] <- "toy story mania"
levels(incidents$ride)[levels(incidents$ride) %like% "peter"] <- "peter pans flight"
levels(incidents$ride)[levels(incidents$ride) %like% "mayd"] <- "mayday falls"
levels(incidents$ride)[levels(incidents$ride) %like% "crush"] <- "crush n gusher"
levels(incidents$ride)[levels(incidents$ride) %like% "test track"] <- "test track"
levels(incidents$ride)[levels(incidents$ride) %like% "manta"] <- "manta"
levels(incidents$ride)[levels(incidents$ride) %like% "despic"] <- "dm minion mayhem"
levels(incidents$ride)[levels(incidents$ride) %like% "passage"] <- "flight of passage"
levels(incidents$ride)[levels(incidents$ride) %like% "mummy"] <- "revenge of the mummy"
levels(incidents$ride) <- gsub("e\\.t\\.", "et", levels(incidents$ride))

ridesort <- incidents[, .N, by = ride][1:10]
ridesort$ride <- factor(ridesort$ride, levels = ridesort$ride[order(-ridesort$N)])
ggplot(data = incidents[!is.na(age)], aes(age)) + geom_histogram(breaks=seq(0, 95, by =5), col=" blue", fill="black") + ggtitle("Florida theme park reported incidents by age") + xlab("Age") + ylab("Incidents") + scale_x_continuous(breaks = seq(0, 100, by = 5))




ggplot(data = incidents[!gender == ""][ride %in% c("expedition everest", "prince charming carousel", "typhoon lagoon")], aes(gender)) + geom_histogram(breaks=seq(0, 95, by =2), col=" blue", fill="black", stat = "count") + theme(legend.position="none") + ggtitle("Florida theme park reported incidents by gender") + xlab("Gender") + ylab("Incidents") + facet_wrap( ~ ride)

ggplot(data = incidents[!gender == ""], aes(x= year)) + geom_histogram(col=" blue", fill = "black", binwidth = 1) + theme(legend.position="none") + ggtitle("Florida theme park reported incidents by year") + xlab("Year") + ylab("Incidents") + xlim(c(2002, 2018))

ridesort$ride <- factor(ridesort$ride, levels = ridesort$ride[order(-ridesort$N)])
ggplot(data = ridesort, aes(x = ride, y = N)) + geom_col(col = "blue", fill = "black") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle("Florida reported incidents by ride")

Now I’m totally willing to hear about any way I could have done this better, but to be honest with the exception of a few &’s and |’s I don’t see how it could have been much shorter. The problem is that when you have humans typing data themselves and submitting it to the government as a document, there is very little control over standardisation. Having cut my 206 rides down to 112 by merging duplicates, I can actually get some interesting graphs and numbers.

Results

The first thing I did was to check out how many people in the list had pre-existing conditions. About 14.8% of people had pre-existing conditions reported, which only tells us that being healthy generally doesn’t really protect you from a theme park accident.

Demographics

From here the next interesting things were to look at incidents by the very few demographic variables I was able to get:

gender_incidents

This could have been interesting to see, but it’s pretty much as you’d expect – men and women have about the same number of incidents as each other. This seems to play out across all the rides except a couple of them. Here’s the three with the biggest differences:

ride_gender_incidents It probably strikes you looking at the middle graph what’s happening here – some rides are definitely favoured by one gender. I’m assuming here that the Prince Charming Carousel at Magic Kingdom is probably favoured by young girls, and not that there’s some witch blocking the prince from future suitors in order to maintain some curse of course. Having said that, the other two do surprise me a bit – I didn’t really think that Expedition Everest would be so heavily favoured by males. My hypothesis is that Animal Kingdom (which hosts the ride) is really not heavy on thrill rides, so the park itself is probably less aimed at males who anecdotally prefer thrill rides more. I can definitely imagine a scenario where a family split up for an hour, with Mum and the girls going to look at the animals while Dad and the boys go ride the rollercoaster with the broken Yeti (fix the Yeti!). If I’m right, Expedition Everest is not really a ‘boy ride’ like it appears, it’s just the least female-friendly ride in the park.

 

age_incidents

This one is a lot more interesting to me because there looks like a really clear spike at age 40-45. I really expected this one to have a smoother curve, but once again I think we’re victims of selection bias. If you think about who goes to parks, it’s still generally families with children (although my bet is that will change soon). So these 40 and above people are most likely parents, and before 40 in the US you’re unlikely to have kids of theme park age. So rather than the spike being interesting, it becomes more interesting to me to wonder why there’s so few kids – after all, they’re riding just as much if not more! My only conclusion then is that the average for under 20’s compared to the over 40’s is really an expression of how resilient kids are compared to their parents.

The next interesting part of this graph to me is the spike between 60 and 65. I think after 65 you’re really much less likely to be going to theme parks at all, so this spike might really mean something. While we really don’t have enough evidence to make a call, I’d definitely be thinking about a quieter holiday location once I get to 60.

Reporting over time

One of the biggest things I noticed when looking at the original report was that the format and standards for reporting have evolved a lot over time. In the beginning you can see that reporting was a real afterthought, and very little information was provided. Most of the work I had to do in Regex was for the first two years of the dataset, so my suspicion is that they weren’t reporting everything. This seems to play out when you graph it:

year_incidents

It really looks like a few things could be happening here. The first interesting thing is that more people seem to hurt themselves from 2010 onwards, but then it cuts out to 2009 levels in 2016. This could mean that parks reacted to a bad 2015 by reinforcing their safety standards, which would be a good news story. The not so good news is that in 2017 the numbers seem to climb again. There’s a few stories of how bad 2015 was, but Orlando Sentinel being blocked for me doesn’t help supplying them to you.

Incidents by ride

I wanted to save my opinion of the best for last – the incidents by ride. As with all of these graphs, the raw counts of incidents are heavily affected by popularity, which I can’t really get directly (although I’m working on it!).

Rrde_incidents

I’ve only plotted the top ten rides here, but I might show the full graph in a later article. On its own this is pretty interesting to me, because it backs up a lot of folklore about Space Mountain and Harry Potter and the Forbidden Journey being particularly intense. My first reaction was that HPFJ must be horribly built or something, but the fact that it has won awards doesn’t imply that’s correct. Then I spoke to a friend from Florida who told me that the ride is famous for people throwing up on it as it swings them around, which makes it far more likely that the high number of incidents is more due to minor incidents than anything. Space Mountain, on the other hand is an old dark coaster – a breed of ride notorious for knocking people about because the darkness means you can’t brace for the turns. Without being able to actually extract the incidents themselves well yet it’s difficult to tell for sure, but a glance at the raw data tells me that the injuries are a bit more serious on this ride. Mission Mars also has a reputation for being intense – a lot of people seem to at least get disoriented on this ride and one incident is even a 4 year old boy who died during the experience.

Expedition Everest and the case of the Yeti

The standout ride again for me is Expedition Everest, and I really can’t explain what’s going on there. It’s not known as a particularly dangerous or intense ride, and I haven’t seen any majorly new technology on it.

yeti
Credit: WDWnews

My only clue is the the broken Yeti. I know it’s a long shot, but the Bayesian in me is saying the fact that the Yeti only worked for a short time before staying motionless to this day indicates that something went wrong in the design process for this ride. After all, if such a major set piece turned out to operate unexpectedly, what else is operating unexpectedly? Combine it with an unusually high number of incidents for a largely outdoor rollercoaster with no inversions or particularly tight curves and there is a weak sign here that someone designed this thing wrong. Having said that, there is a reverse section of the ride (which has the same effect as a dark coaster), so it could be just that section that’s the problem.  I know absolutely nothing about Disney maintenance schedules of their organisational structure, but it might be possible that Animal Kingdom doesn’t have the same number of staff assigned to structural engineering tasks as other parks. Obviously if I knew anything about their operations my hypothesis would be a lot better, but I still think this is an interesting data point.

What I learned and what I’ll do next

The first thing I’ve learned is that the Florida Government needs to start making theme parks standardise their reports and preferably submitting them as publicly available csv files. If they were doing this with drop down menus and some sort of field validation, getting reports out would be a hell of a lot easier.

Another thing I learned (while I didn’t get to make graphs of it particularly) was that it seems getting on and off a ride is more dangerous than actually being on the ride itself. While I only got to eyeball the incidents in this iteration of the analysis, I was surprised to see how many said things like ‘tripped entering vehicle, fractured ankle’ or something similar. While I don’t get the inside knowledge the Disney or Universal Data Scientists do, I’m not surprised that entering and exiting ride vehicles has so many humans involved in it, and if it were me I’d be making sure these people were permanent employees with excellent training to particularly look after people over 40.

That brings me to another thing – if you’re a parent the clear message here is that your kids will be fine and you should be far more worried about your own safety than theirs. They might be running around like maniacs, but just because the place is dangerous for you doesn’t mean it is for them!

Now that I’ve finally gained a new dataset, my next thing will probably be to try and combine it with some of the other datasets I have, like the theme park visitor numbers or (hopefully) ride wait times to indicate ride popularity. If I can line these things up it might control for the growth of the industry overall, and then I’ll be able to tell if things really are getting more dangerous or if it’s just a side-effect of increasing visitors. In addition, I’m going to try and see if I can extract a few types of incidents and do a post on which rides seem to be most deadly (which would be sad) or most sickening (which would be more fun).

If you have more ideas on where I can get data or what I could do better in my code I’d love to hear your suggestions in the comments.

What are people saying about amusement parks? A Twitter sentiment analysis using Python.

One of the quintessential tasks of open data is sentiment analysis. A very common example of this is using tweets from Twitter’s streaming API. In this article I’m going to show you how to capture Twitter data live, make sense of it and do some basic plots based on the NLTK sentiment analysis library.

What is sentiment analysis?

minnieThe result of sentiment analysis is as it sounds – it returns an estimation of whether a piece of text is generally happy, neutral, or sad. The magic behind this is a Python library known as NLTK – the Natural Language Toolkit. The smart people that wrote this package took what is known about Natural Language Processing in the literature and have packaged it for dummies like me to use. In short, it has a database of commonly used positive and negative words that it checks against and does a basic vote count – positives are 1 and negatives are -1, with the final result being positive or negative. You can get really smart about how exactly you build the database, but in this article I’m just going to stick with the stock library that it comes with.

Asking politely for your data

Twitter is really open with their data, and it’s worth being nice in return. That means telling them who you are before you start crawling through their servers. Thankfully, they’ve made this really easy as well.tweety

Surf over to the Twitter Apps site, sign in (or create an account if you need to, you luddite) and click on the ‘Create new app’ button. Don’t freak out – I know you’re not an app developer! We just need to do this to create an API key. Now click on the app you just created, then on the ‘Keys and Access Tokens’ tab. You’ll see four strings of letters – Your consumer key, consumer secret, access key ad access secret. Copy and paste these and store them somewhere only you can get to – off line on your local drive. If you make these public (by publishing them on github for example) you’ll have to disable them immediately and get new ones. Don’t underestimate how much a hacker with your key can completely screw you and Twitter and everyone on it – with you taking all the blame.

Now the serious, scary stuff is over we can get to streaming some data! The first thing we’ll need to do is create a file that captures the tweets we’re interested in – in our case anything mentioning Disney, Universal or Efteling. I expect that there’ll be a lot more for Disney and Universal given they have multiple parks globally, but I’m kind of interested to see how the Efteling tweets do just smashing them into the NLTK work flow.

Here’s the Python code you’ll need to start streaming your tweets:

# I adapted all this stuff from http://adilmoujahid.com/posts/2014/07/twitter-analytics/ - check out Adil's blog if you get a chance!

#Import the necessary methods from tweepy library
import re
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream

#Variables that contains the user credentials to access Twitter API
access_token = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
access_token_secret = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
consumer_key = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
consumer_secret = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"


#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):

    def on_data(self, data):
        print data
        return True

    def on_error(self, status):
        print status



if __name__ == '__main__':

    #This handles Twitter authentification and the connection to Twitter Streaming API
    l = StdOutListener()
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    stream = Stream(auth, l)
    #This line filter Twitter Streams to capture data by the keywords commonly used in amusement park tweets.
    stream.filter(track= [ "#Disneyland", "#universalstudios", "#universalstudiosFlorida", "#UniversalStudiosFlorida", "#universalstudioslorida", "#magickingdom", "#Epcot","#EPCOT","#epcot", "#animalkingdom", "#AnimalKingdom", "#disneyworld", "#DisneyWorld", "Disney's Hollywood Studios", "#Efteling", "#efteling", "De Efteling", "Universal Studios Japan", "#WDW", "#dubaiparksandresorts", "#harrypotterworld", "#disneyland", "#UniversalStudios", "#waltdisneyworld", "#disneylandparis", "#tokyodisneyland", "#themepark"])

If you’d prefer, you can download this from my Github repo instead here. To be able to use it you’ll need to install the tweepy package using:

pip install tweepy

The only  other thing you have to do is enter the strings you got from Twitter in your previous step and you’ll have it running. To save this to a file, you can use the terminal (cmd in windows) by running:

python theme_park_tweets.py > twitter_themeparks.txt

For a decent body of text to analyse I ran this for about 24 hours. You’ll see how much I got back for that time and can make your own judgment. When you’re done hit Ctrl-C to kill the script, then open up the file and see what you’ve got.

Yaaaay! Garble!

So you’re probably pretty excited by now – we’ve streamed data live and captured it! You’ve probably been dreaming for the last 24 hours about all the cool stuff you’re going to do with it. Then you get this:

{"created_at":"Sun May 07 17:01:41 +0000 2017","id":861264785677189
120,"id_str":"861264785677189120","text":"RT @CCC_DisneyUni: I have
n't been to #PixieHollow in awhile! Hello, #TinkerBell! #Disney #Di
sneylandResort #DLR #Disneyland\u2026 ","source":"\u003ca href=\"ht
tps:\/\/disneyduder.com\" rel=\"nofollow\"\u003eDisneyDuder\u003c\/
a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_t
o_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user
_id_str":null,"in_reply_to_screen_name":null,"user":{"id":467539697
0,"id_str":"4675396970","name":"Disney Dude","screen_name":"DisneyDu
der","location":"Disneyland, CA","url":null,"description":null,"pro
tected":false,"verified":false,"followers_count":1237,"friends_coun
t":18,"listed_count":479,"favourites_count":37104,"statuses_count":
37439,"created_at":"Wed Dec 30 00:41:42 +0000 2015","utc_offset":nu
ll,"time_zone":null,"geo_enabled":false,"lang":"en","contributors_e
...

huhSo, not quite garble maybe, but still ‘not a chance’ territory. What we need is something that can make sense of all of this, cut out the junk, and arrange it how we need it for sentiment analysis.

To do this we’re going to employ a second Python script that you can find here. We use a bunch of other Python packages here that you might also need to install with pip – json, pandas, matplotlib, and TextBlob (which contains the NLTK libraries I mentioned before). If you don’t want to go to Github (luddite), the code you’ll need is here:

import json
import pandas as pd
import matplotlib.pyplot as plt
from textblob import TextBlob
import re

# These functions come from https://github.com/adilmoujahid/Twitter_Analytics/blob/master/analyze_tweets.py and http://www.geeksforgeeks.org/twitter-sentiment-analysis-using-python//

def extract_link(text):
    """
    This function removes any links in the tweet - we'll put them back more cleanly later
    """
    regex = r'https?://[^\s<>"]+|www\.[^\s<>"]+'
    match = re.search(regex, text)
    if match:
        return match.group()
    return ''

def word_in_text(word, text):
    """
    Use regex to figure out which park or ride they're talking about.
    I might use this in future in combination with my wikipedia scraping script.
    """
    word = word.lower()
    text = text.lower()
    match = re.search(word, text, re.I)
    if match:
        return True
    return False

def clean_tweet(tweet):
        '''
        Utility function to clean tweet text by removing links, special characters
        using simple regex statements.
        '''
        return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

def get_tweet_sentiment(tweet):
    '''
    Utility function to classify sentiment of passed tweet
    using textblob's sentiment method
    '''
    # create TextBlob object of passed tweet text
    analysis = TextBlob(clean_tweet(tweet))
    # set sentiment
    if analysis.sentiment.polarity > 0:
        return 'positive'
    elif analysis.sentiment.polarity == 0:
        return 'neutral'
    else:
        return 'negative'

# Load up the file generated from the Twitter stream capture.
# I've assumed it's loaded in a folder called data which I won't upload because git.
tweets_data_path = '../data/twitter_themeparks.txt'

tweets_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
    try:
        tweet = json.loads(line)
        tweets_data.append(tweet)
    except:
        continue
# Check you've created a list that actually has a length. Huzzah!
print len(tweets_data)

# Turn the tweets_data list into a Pandas DataFrame with a wide section of True/False for which park they talk about
# (Adaped from https://github.com/adilmoujahid/Twitter_Analytics/blob/master/analyze_tweets.py)
tweets = pd.DataFrame()
tweets['user_name'] = map(lambda tweet: tweet['user']['name'] if tweet['user'] != None else None, tweets_data)
tweets['followers'] = map(lambda tweet: tweet['user']['followers_count'] if tweet['user'] != None else None, tweets_data)
tweets['text'] = map(lambda tweet: tweet['text'], tweets_data)
tweets['retweets'] = map(lambda tweet: tweet['retweet_count'], tweets_data)
tweets['disney'] = tweets['text'].apply(lambda tweet: word_in_text(r'(disney|magickingdom|epcot|WDW|animalkingdom|hollywood)', tweet))
tweets['universal'] = tweets['text'].apply(lambda tweet: word_in_text(r'(universal|potter)', tweet))
tweets['efteling'] = tweets['text'].apply(lambda tweet: word_in_text('efteling', tweet))
tweets['link'] = tweets['text'].apply(lambda tweet: extract_link(tweet))
tweets['sentiment'] = tweets['text'].apply(lambda tweet: get_tweet_sentiment(tweet))

# I want to add in a column called 'park' as well that will list which park is being talked about, and add an entry for 'unknown'
# I'm 100% sure there's a better way to do this...
park = []
for index, tweet in tweets.iterrows():
    if tweet['disney']:
        park.append('disney')
    else:
        if tweet['universal']:
            park.append('universal')
        else:
            if tweet['efteling']:
                park.append('efteling')
            else:
                park.append('unknown')

tweets['park'] = park

# Create a dataset that will be used in a graph of tweet count by park
parks = ['disney', 'universal', 'efteling']
tweets_by_park = [tweets['disney'].value_counts()[True], tweets['universal'].value_counts()[True], tweets['efteling'].value_counts()[True]]
x_pos = list(range(len(parks)))
width = 0.8
fig, ax = plt.subplots()
plt.bar(x_pos, tweets_by_park, width, alpha=1, color='g')

# Set axis labels and ticks
ax.set_ylabel('Number of tweets', fontsize=15)
ax.set_title('Tweet Frequency: disney vs. universal vs. efteling', fontsize=10, fontweight='bold')
ax.set_xticks([p + 0.4 * width for p in x_pos])
ax.set_xticklabels(parks)
# You need to write this for the graph to actually appear.
plt.show()

# Create a graph of the proportion of positive, negative and neutral tweets for each park
# I have to do two groupby's here because I want proportion within each park, not global proportions.
sent_by_park = tweets.groupby(['park', 'sentiment']).size().groupby(level = 0).transform(lambda x: x/x.sum()).unstack()
sent_by_park.plot(kind = 'bar' )
plt.title('Tweet Sentiment proportions by park')
plt.show()

The Results

If you run this in your terminal, it spits out how many tweets you recorded overall, then gives these two graphs:

tweet_freqtweet_sent

So you can see from the first graph that out of the tweets I could classify with my dodgy regex skills, Disney was by far the most talked about, followed by Universal a long way. This is possibly to do with genuine popularity of the parks and the enthusiasm of their fans, but it’s probably more to do with the variety of hashtags and keywords people use for Universal compared to Disney. In retrospect I should have added a lot more of the Universal brands as keywords – things like Marvel or NBC. Efteling words didn’t really pick up much at all which isn’t really surprising – most of the tweets would be in Dutch and I really don’t know what keywords they’re using to mark them. I’m not even sure how many Dutch people use Twitter!

The second graph shows something relatively more interesting – Disney parks seem to come out on top in terms of the proportion of positive tweets as well. This is somewhat surprising – after all Universal and Efteling should elicit the same levels of positive sentiment – but I really don’t trust these results at this point.  For one, there’s a good number of tweets I wasn’t able classify despite filtering the terms in the initial script. This is probably to do with my regex skills, but I’m happy that I’ve proved the point and done something useful in this article. Second, there’s far too many neutral tweets in the set, and while I know most tweets are purely informative (“Hey, an event happened!”) this is still too high for me to not be suspicious. When I dig into the tweets themselves I can find ones that are distinctly negative(“Two hours of park time wasted…”) that get classed as neutral. It seems that the stock NLTK library might not be all that was promised.

Stuff I’ll do next time

There’s a few things I could do here to improve my analysis. First, I need to work out what went wrong with my filtering and sorting terms that I ended up with so many unclassified tweets. There should be none, and I need to work out a way for both files to read from the same list.

Second, I should start digging into the language libraries in Python and start learning my own from collected data. This is basically linguistic machine learning, but it requires that I go through and rate the tweets myself – not really something I’m going to do. I need to figure out a way to label the data reliably then build my own libraries to learn from.

Finally, all this work could be presented a lot better in an interactive dashboard that runs off live data. I’ve had some experience with RShiny, but I don’t really want to switch software at this point as it would mean a massive slowdown in processing. Ideally I would work out a javascript solution that I can post on here.

Let me know how you go and what your results are. I’d love to see what things you apply this code to. A lot of credit goes to Adil Moujahid and Nikhil Kumar, upon whose code a lot of this is based. Check out their profiles on github when you get a chance.

Thanks for reading, see you next time 🙂