We live in a world where the technology exists that the government or other technically sophisticated group is able to monitor and analyze a substantial fraction of the communications of the world's population, or can track their movements throughout the day, or keep tabs on their financial transactions.
And that world is called World of Warcraft.
While the NSA has been capturing and analyzing international phone calls and electronic communications, with far less press coverage I’ve spent much of the last year collecting and helping to analyze data scraped from World of Warcraft as part of the largest quantitative study of virtual worlds to date.
We've run into a series of problems trying to scrape information from five of WoW's servers -- some expected, some not -- and developed some rules of thumb in the process.
A Brief History of Peeping
World of Warcraft is not PlayOn’s first entry into listening in on a virtual world. Early in 2004, Nic Ducheneaut and Bob Moore placed bots in the cantinas and starports of one server for Star Wars Galaxies, and collected the chat logs from those environments. My own descent into PlayOn was to help analyze the gigabyte of chat logs that had been collected.
Coming to WoW, we realized that Blizzard had opened up the game programming interface so that most of a player’s in-game actions can be automated, but still preventing game-playing bots. The /who command, combined with enough patience and a small matter of programming, enables a bot which can take a census of the entire logged-in world. Worlds, in fact, one faction at a time.
Starting last April with a census bot, we’ve collected 190 million sightings of the form
Magtheridon,01/01/06 21:16:51,Deputura,59,Un,st,y,Dire Maul,
Magtheridon,01/01/06 21:16:51,Onlysurface,28,Ta,er,y,Warsong Gulch,TerrifyingPulsar
(That is, a 59 Undead Priest and a 28 Tauren Hunter.) Most of the analysis on the PlayOn blog, and the data in Nic’s earlier post is derived from the simple information kinds of information shown above. Since then we’ve added scrapers for gender, zone chat, guild rank, and pvp rank, (while failing repeatedly to add a scraper for economic data), chugging away on 6 dilapidated PCs spanning 5.5 WoW realms. And we learned a lot about scraping virtual worlds over this period.
Things Take Longer and Cost More
Virtual worlds conspire against the would-be analyst, so that the “small matter of programming” mentioned earlier became a burgeoning monstrosity. The SMOP that is needed to test whether we can extract some type of information from the virtual world is usually not enough to function 24/7 in the face of different network or VW conditions. A lot of our code isn’t so much concerned with scraping data, but determining if the server is up, or if a request is taking too long, or if the addon is possibly wedged.
Our scripts are very bad at dealing with situations humans can account for. Resilient software in the presence of connectivity and game issues is hard. Good grief, we have software fragments which try to dynamically optimize the amount of time to wait between receiving /who results and sending the next request.
We deal with a semi-supported part of the game. The best documentation for the WoW API has been created by the modding community, but is still spotty in places, and wrong in others. Someone writing scraping code can easily become the expert -- outside Blizzard -- on some arcane portion of the game.
By far the most difficult problem we’ve faced is that the game itself is not static, but presents a moving target. Even our interest in and understanding of certain data changes over time. Patch day is a mad scramble to get the scraping software working again. Maybe the login code changed. Or, for example in the most recent release, information on whether a character was grouped or not disappeared from the API, breaking our software in the process.
Several of these problems leak over from the data scraping into the analysis. It’s simply a fact of life that we have holes in our data. Any analysis we set up has to deal with a day or week missing here, or a few hours missing there. It’s surprising how much this can complicate matters. And however it came about, analysis parsers must deal with slightly different data from different eras.
We can only analyze what is available to the players. As a result, we often have to make a trade-off to find a reasonable proxy. For example, we would like to know when people are grouped to quest together. The best we’ve been able to do to estimate this is to identify guildmates in the same zone, both in some group. We realize that it’s not accurate, but we have to make do. More troublesome are situations where we have no good proxy. We can develop predictors of character abandonment, but we have no way to know if the player switched to another character, or another game.
Tips for the Impenitent
If my confession has not dissuaded you from scraping and analyzing intelligence from virtual world, here are some suggestions for help:
- Expect to spend a lot of the software development time handling exceptional cases like servers down, lag, and time-outs.
- For something that runs repeatedly, try very hard to find a way to have another process detect when the scraper is wedged, in order to exit and restart.
- Be patient. Set your timeouts, as much as possible, for the worst times of day, even if they run slower than necessary during other times.
- As a new scraper comes online, start doing some analysis of the data early, even though you may need weeks of data until the results are truly relevant. By beginning the analysis early, you will expose problems in the data collection that can be corrected.
- Files containing lines of comma-separated values (.csv files) are easy to output as text, and can be easily read by spreadsheet programs. Including columnar headings or scraper version info at the top can allow future changes to the scrapers with minimal change to the analysis parsers.
- Convince Nick Yee that he wants to help analyze your data. This is important.
- Generate log files to record significant events in the scrapers, such as logging on or off, dealing with time-outs, reaching major milestones, etc. These can be invaluable for tracking long-term issues.
- If you have log files with sufficient information, consider writing a monitor script to send you email when your scrapers have been offline for too long.
- Iterate between your game intuition and your analysis of the data to determine what to analyze next, what changes to make to the scraper, and what information you can proxy for information not yet available.
- Have fun. It’s a brave new world.