
I’ve been pretty quiet on this blog for the last few weeks, because as I mentioned a few times I was hitting the limit of what I could do with the data I could collect manually. Manual data collection is one of my most hated tasks since working as a researcher in the Social Sciences. Back then we had to encode thousands of surveys manually, but in a scenario where the outcome was within a set range of parameters (their answers had to add up to 100 for example). They insisted at the time on manually checking the input , and (groan) colour coding the spreadsheets by hand when there looked like there was a problem. It was the first time I had used conditional formatting in Excel to automate such an arduous task, and I remember everyone’s suspicion that I had finished so quickly.
Nowadays I work in a tech company dealing with the proverbial ‘Big Data’ that everyone goes on about. In these scenarios manual coding or checking of your data is not just arduous, it’s absolutely impossible so automation of your task is a necessity.
Grey Data
A recent article I read interviewing someone from Gartner stated that more than 99% of the information on the Internet is ‘grey data’. By this they mean unstructured, unformatted data with themes and meanings hidden beneath layers of language, aesthetics, semiotics and code. Say I want to find out what people think about Universal theme parks in the universe of WordPress blogs. It’s pretty rare that the site itself is tagged with any metadata telling a machine ‘in this blog I’m talking about theme parks and how I feel about Universal’. However, if I can use a script that reads all the blogs that contain the words ‘theme park’ and ‘Universal’, I’d be somewhere closer to finding out how people feel about Universal Theme Parks generally. On top of this, all these blogs probably have memes about Universal attractions and IP, they all use specific fonts and layouts, they’ll all use images of the Universal offerings. If I were able to read these and classify them into something shorter and more standardised, I’d be able to learn a lot more about what people are saying.
From little things, big things grow
As someone with more of an analytical background than a data engineering one, I’ve always been afraid of building my own datasets. In statistics we keep telling each other that we’re specialists, but the reality of the Data Science world is that specialists are just not needed yet – if you’re going to make your bones in the industry you’re going to have to get generalist skills including querying MySQL and Hadoop, and using Spark and Python. As such, the project I’ve undertaken is to start scraping Wikipedia (to begin with) and see if I can build a bit of a database of theme park knowledge that I can query, or analyse in R.
Scraping isn’t the hard part
So I started looking around online and found a few resources on scraping Wikipedia, but they were either outdated or simply didn’t seem to work. There was also the option of dbpedia, which uses the Linked Data Standards to try and build a sort of dynamic relational database online by scraping the less standardised site. This option sounded really useful, but it looks like they’re still very much trying to flesh out WikiDB and it’s unlikely they’ll get to theme park lists any time soon. So, it looks like I’m stuck with StackOverflow threads on what to do.
The first stuff I started looking at told me to use BeautifulSoup, which I had never heard of. In short, the way I use it is as a Python module that handles specific http requests for tables. It seems to be able to parse out the site code and use the standard flags to identify where the table starts and finishes, and then assign the table to an object in Python that you can do things to.
from bs4 import BeautifulSoup import re import urllib2 import csv # Define the page you want to scrape and set up BeautifulSoup to do its magic wiki = "http://en.wikipedia.org/wiki/List_of_Disney_theme_park_attractions" header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia req = urllib2.Request(wiki,headers=header) page = urllib2.urlopen(req) soup = BeautifulSoup(page, "lxml") attraction = [] full_details = []
But then all you have is a bunch of jargon that looks like this:
<td><a class=”mw-redirect” href=”/wiki/WEDway_people_mover” title=”WEDway people mover”>WEDway people mover</a> (aka Tomorrowland Transit Authority)</td>, <td bgcolor=”#FF8080″>Tomorrowland</td>, <td></td>, <td bgcolor=”#80FF80″>Tomorrowland</td>…
Which I can see has the right information in it, but really isn’t what I’m after for analysis. I need to be able to loop through all these, find the rows and figure out where a cell starts and finishes. Thankfully, BeautifulSoup recognises how these are flagged in our jargon string, so I can loop over the rows and cells in the table. Once I can do this, I’ll be able to make some sort of data frame that stores all this information in a concise and easily analysable format.
Learning to read what you’ve got
If you’re planning to try and scrape a Wikipedia table, you’re going to have to spend a reasonable amount of time staring at the page you want to scrape to figure out what the code means (I’m sure the time spent here reduces greatly with a little bit of coding skill), to see how they’ve encoded the information we want.
In my case, each column of the table represents a ride in one of Disney’s theme parks, and the row represents the ride. The first column of the table is the name of the ride, and when that ride is in that park, the date and region of the ride is written in that cell. Very easy to read, but difficult to get into the sort of ‘long’ formats (with individual columns for park, ride and features) that R and Python like to use.
The first thing I want to do is get the names of the parks that each ride is attached to. To do this, I define a function that looks for cells that have the specific formatting the Park names are listed in, and returns all the park names in a list that I’ll use later (I still haven’t learned to make WordPress respect indentation, so you’ll have to do that yourself):
def get_park_names(table): ''' get all the names of the parks in the table - they all have a unique style so I use that to identify them. ''' park = [] for row in table.findAll("tr"): for cell in row: a = str(cell) if 'style="width:7.14%"' in a: m = re.search('(?<=title=")(.*)(?=">)', a) park.append(m.group(0)) return park
I also want to be able to tell if the ride is still open or not, which is encoded in my table with background colour:
def get_open_status(cell): ''' find out whether the ride is still open or not based on the background color of the cell ''' statuses = ["extinct", "planned", "operating"] status = "" if 'FF8080' in cell: status = statuses[0] else: if 'FFFF80' in cell: status = statuses[1] else: if '80FF80' in cell: status = statuses[2] else: if 'FFA500' in cell: status = statuses[0] return status
Finally, I need to tie all this together, so I loop through the table cells and look for cells that aren’t empty. It gets the name of the park from the string using regex and puts it into a dict with park, ride name, and status then finally puts all the dicts into a list:
# We can do this for one table or many - you can just uncomment this line and unindent the outer for loop #table = soup.find("table", { "class" : "wikitable"} ) tables = soup.findAll("table", { "class" : "wikitable"}) for table in tables: ## Get a list of all the names of the attractions park = get_park_names(table) for row in table.findAll("tr"): cells = row.findAll("td") #For each "tr", assign each "td" to a variable. if len(cells) > 11: # I just counted the columns on the page to get this a = str(cells[0]) # Making it a string allows regex if "href=" in a: # Do this if the row has a link in it b = re.search('(?<=title=")(.*)(?=")', a) if b is not None: # If there is no title in the row (like when the ride has no link) regex will return 'none' # some of the rows are subheadings, but they all contain 'list of' in the string if "List of" not in b.group(0): attraction.append(b.group(0)) a = b.group(0) else: d = re.search("(?<=title=')(.*)(?=')", a) # There is a lack of standardization in the table regarding quotations. if "List of" not in d.group(0): attraction.append(d.group(0)) a = d.group(0) else: # The cells with no links just have the name e = re.search('(?<=>)(.*)(?=<)', a) attraction.append(e.group(0)) a = e.group(0) x = 0 # Make a counter for c in cells[1:]: if len(c) > 0: # loop through the cells in each row that aren't blank c = str(c) s = get_open_status(c) #use the function I defined above if "List of" not in c: qqq = {"park": park[x], "ride":a, "status": s} #throw it all into a dict full_details.append(qqq) # I make a list of dicts because it seems like a useful format x = x + 1
So, not really knowing what I want to do with all this new data yet, my final move in the script is to write the whole thing to a csv file:
keys = full_details[0].keys() with open('parkrides.csv', 'wb') as output_file: dict_writer = csv.DictWriter(output_file, keys) dict_writer.writeheader() dict_writer.writerows(full_details)
And there you have it! A reasonably clean csv that I can read into R or whatever else and start doing some analyses.
Things I learned
The first thing I learned from this exercise is not to feel too dumb when embarking on a new automation task – it looks like there is a tonne of resources available but it will take all your critical literacy skills to figure out which ones are actually true. Or you can just copy and paste their code to realise it doesn’t work. This is really a frustrating experience for someone starting out, especially when you’re led to believe it’s easy. My advice here is to keep looking until you find something that works – it’s usually not the highest hit on Google, but it’s there.
The second thing I learned is that regex is a harsh mistress. Even once you’ve managed to figure out how the whole thing works, you have to do a lot of squinting to figure out what you’re going to tell it to do. Here I don’t think there’s much more one can do except practice more.
Future stuff
There is a whole bunch of things I’m planning to do now I can do this. This first thing will be to try and build some set visualisations to look at which parks are most similar to each other. Disney port their successful rides from one park to the next, so it’s really interesting to see what the overlap is, as it shows what they might be expecting their audiences in that area to like. Rides that feature in more parks could be seen as more universally popular, while rides that only ever go to one park are probably more popular with a local audience. In particular, I’d expect Disneyland Paris to have a smaller overlap with other Disney parks based on my previous clustering of theme park audiences which said that Disneyland Paris caters to a more local audience.