eric the espeon, a Smogon user and stats junkie from way back when, made a detailed post in the stalliness discussion thread, and I just posted back a reply to his questions and criticisms.
In the end, I revised the metric a bit further, but before I get into that, I want to point your attention towards my github repository, where I now host my team analyzer (which contains the stalliness algorithm) as a separate file. If you navigate your way over to this folder, you can find an example of how to use the team analyzer script. Feel free to fork my repository, modify my team analyzer, and tell me if you come up with better results. If you ask me nicely, I’ll even provide you with importables of the RMT archive.
After some careful thought and a LOT of testing and re-testing, I made some revisions to my stalliness metric (namely adjusting some key moveset modifications), and the end result is something that I’m pretty happy with.
Before I get into the nitty-gritty of exactly what I changed, I’d like to show off the results:
From the feedback I got after posting my previous results, I started to wonder if stalliness wasn’t working better simply because of an outlier problem. Even full stall teams usually have one offensive member, and offensive teams will often have some utility Pokemon. Do these “outliers” throw off the combined stalliness? Easy enough to check.
In the Smogon forums thread where I discuss my stalliness metric, I asked users to submit their own teams to by analyzed by my metric. A user by the name of alkinesthetase linked me to Smogon’s RMT Archive index, which contains importable versions of dozens of teams, in various tiers and playstyles. I ran my algorithm against this dataset in an attempt to come up with “cutoffs” for stall vs. semi-stall vs. balance/bulky offense vs. offense vs. heavy offense. Below are the results, both for bias (Innocent Criminal’s metric) and stalliness (my own).
As nice as it was to define a metric for stall that made physical sense (at least to me), what would be even NICER would be to see that this metric actually *predicts* something.
So what should my stall score predict? How about the length of a battle?
I’ve spent a lot of time thinking about Challenge Cup.
Central to generating random Pokemon for CC is giving each Pokemon a random EV spread. But generating random EVs is a lot harder than generating random IVs because of the requirement that total EVs cannot exceed 510 (let’s also assume that you don’t want to allow any under-trained sets, so make 510 the minimum EV count as well).
