Skip to content

Project Onix: System Architecture

July 20, 2016

So in my last post I announced plans to rewrite my Smogon usage stats scripts as “Project Onix,” a robust, performant and extensible platform for performing Pokemon analyses. Today I’d like to go into a little more detail about what “Project Onix” is actually going to look like.

 

Architecture Diagram.png

At its core, Onix’s goal is the same as for the original Smogon Usage Stats project: take logs from Pokemon Showdown (or Pokemon Online or NetBattle or whatever simulator we’re using on a given day) and process them into monthly reports. The old Smogon Usage Stats project did this in two stages:

  1. Read in logs, pull out relevant information, calculate derived quantities (stalliness and team-tags), structure it and dump it into a collection of processed files
  2. Read in those processed files and count stuff up to produce the monthly usage stats (including moveset statistics, metagame analyses and checks/counters analyses).

This wasn’t a bad design, but working with files meant that there was a tradeoff between performance and flexibility—anything that got pulled from the logs, or any quantities that were derived would slow down the stat-counting. Consequently, I ended up only pulling information that was going to find its way into a report. That meant, for example, no record of actual move usage or when a Pokemon mega-evolved. And doing pre-processing the way I did meant that if I wanted to change the way something was calculated (say, change the threshold for what constitutes a “baton pass” team), the only way to generate an updated report would be to go back to the logs and start from scratch.

Moving forward, my plan is to segment the workflow more cleanly and, in doing so, add significant flexibility while (hopefully) improving performance. Onix will consist of three subsystems:

1. Collection

The role of this portion is solely to read in data, at this point from simulator logs, but one could imagine alternative data sources (other sims, battle videos…). The aim is to perform little-to-no analysis. There will be some data cleansing here (combining appearance-only formes and equivalent nature/IV/EV spreads), but primarily the goal is just to transform the data into structures that will be easier to process later on.

The collection system will output to a set of “databases,” though I use that term loosely. It could be SQLite tables, it could be MongoDB collections, or I could still be doing file I/O, just in a more segmented way. The goal is to keep the data segregated, so individual analyses can be performed by accessing only the data they need.

The collection system has another focus: completeness. Instead of just pulling the information from the logs that I know I’ll need later on, the goal here is to pull all the battle-relevant data (read: not nicknames, not cosmetic choices, and not chat logs) to structure and process, whether I think I’ll need it or not. Basically, one should, in principle, be able to re-create a Pokemon Showdown battle log (or replay) from the data in the databases (up to nicknames and chat logs).

Why am I doing this? Why take up the CPU cycles and the disk space generating data that I don’t have any plans to use? If I come up with an analysis later on, why not just worry about it then and generate it from the logs? The answer is this: hopefully, this system will be not just for me. Every so often, I get a request from a university researcher or a hobbyist programmer wanting to do some sort of analysis. For the most part, the reports I generate are not sufficient (nor are the intermediate processed files). So right now, if I want to support their projects, I have to give them the PS logs, which are not optimally structured, and which are not anonymized, meaning I have to worry about privacy concerns each and every time I share the data. With this new system, I could give researchers controlled access to the database, letting them only access what they need to while ensuring anonymity by design.

2. Enrichment

The collection system will do a little bit of data cleansing (mainly in the name of anonymizing and normalizing), the idea being that the steps the collection system performs are steps that will need to be done regardless of use-case. The enrichment subsystem, on the other hand, is geared specifically towards supporting reports. This is where stuff like stalliness and team tags will get computed. It’s also where megas would be combined with base formes, if we went back to counting the old way, and where “matchups” will get parsed from the structured battle logs. Note that the original databases created by the Collection system are left alone: any new information will go in a new table (or collection or file…).

There’s a very real question with enrichment, and that’s: when do you do it? The old way was to do it at the collection step, but you could just as easily do it at report-time. There are definite advantages to performing enrichment as late as possible, namely that it gives you longer to change anything, but the trade-off is that reporting is that waiting until reporting-time means that it takes longer to generate the reports (no one likes it when the stats go up over a week after the month ends). It’s possible that with efficient DB structures (and by leveraging parallel or cluster computing—more on this another day), report-generation might not be very time-consuming, but we’ll have to see, and so it makes sense to keep this subsystem idependent.

3. Reporting

The final step is actually generating the reports. Currently how this is done is by going through and reading gigantic processed intermediate files that contain not just the data needed for a specific report, but the data that will go into all the reports (though the intermediate files for the detailed moveset reports are housed separately). This means that all the reports have to be generated together for a given metagame, resulting in a much-larger-than-necessary memory footprint, to avoid iterating through the files multiple times. Under the Onix architecture, each report will only access the resources it needs. Ideally, report-generation will also be fast, thanks to optimizations done at the Collection and Enrichment steps. If we go the database route, then an entire usage ranking report could be generated from a single, fast-running SQL query.

There’s one other addition to the reporting subsystem that I’m really excited about (assuming I can pull it off): rather than simply rely on static reports like I do now, what I’d really like to do is expose a public API (and maybe set up a simple web app) to provide much more specific usage stats than I currently have now. Imagine an interface, for example, where you can ask, “What percentage of Sceptiles that have the ability Contrary carry the move Leaf Storm?” or “What percentage of Heatrans that are on the same team as a Landorus-Therian carry Stealth Rock? Oh, and use a baseline of 1760 instead of 1695,” or even, “What percentage of Latios switch out against a Ferrothorn?” This kind of tool could be incredibly powerful and would encourage exactly the sort of analytical thinking I’d love to see more of in the Pokemon community. Plus, it sounds like a really fun project.

 

Next time I’ll dig into data types and talk specifically about how Onix will represent battles.

Advertisements

From → Uncategorized

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: