Measuring Stall

August 16, 2012

A while back, fellow Smogonite Innocent Criminal coded up a script for me that pulled data from some specialized Pokemon Online logs he’d had us keep and generated moveset analyses and metagame analyses. Now we get most of our logs from PS, and I have to re-create his work.

So by moveset analyses, I mean things like what moves were used most frequently, most common EV spreads, that sort of thing, and that’s about 50% done and fairly straightforward.

His metagame analyses were a bit more subjective. A large component was things like identifying weather teams, pseudo-weather (Trick Room) teams and Baton Pass teams, but another component was figuring out the breakdown of offense vs. stall.

Looking through Innocent Criminal’s code, it looks like his method for measuring offensiveness/stalliness was to look solely at the EV spreads of each Pokemon. His metric, called “bias” was calculated as follows:

$bias=EV_{atk}+EV_{spa}-(EV_{hp}+EV_{def}+EV_{spd}).$

Note that speed plays no role in bias.

A team is then categorized by summing the bias of each pokemon. If the team has a collective bias…

…of at least 1200 (or at least 600 with at least one screener on the team), it’s classified as “heavy offense;”
…between 600 and 1199–inclusive–(without a screener), it’s classified as “offense;”
…between -1500 and 599, it’s either “bulky offense” or “balanced” based on whether the team has more walls or bulky-setup Pokemon/tanks;
…between -2500 and -1501, it’s classified as “semi-stall;”
…less than 2500, it’s classified as “full stall.”

Using EVs alone is a nice rough way of determining the intent of the teambuilder–whether they were going for power or bulk–but it fails to take into account that there are some Pokemon that are just naturally more offensive, and others that are just naturally stallier: 252 HP / 252 Def Deoxys-A will still be much more effective as a sweeper than as a wall. There’s also more to stall than bulk. Innocent Criminal only used movesets to classify a Pokemon’s “role” (wall vs. tank, that sort of thing), and didn’t directly tie any specific moves (besides screens) to a team’s stalliness classification, but there are some moves, like Toxic, like Wish, like phazing moves, that are critical to stall strategies and should play some role in team-classification.

With all this in mind, I began working on my own “stalliness” metric.

My metric, which I’ll call “stalliness” is broken into two parts: a stat component and a moveset component.

Stat Component

Rather than look at EVs alone, I decided to look at the Pokemon’s actual battle-stats. Recall that the formula for calculating a Pokemon’s stat is

$stat=\left[\dfrac{lvl}{100}\times\left(IV+2*Base+EV/4\right)+x\right]\times Nature$

(x=5 unless the stat is HP, in which case, x=10).

Also recall the damage formula (assuming Single-battle, non-crit):

$-\Delta HP = \left[\dfrac{2 lvl + 10}{250}\times\dfrac{atk}{def}\times power+2\right]\times STAB\times Type \times rand,$

where Type is the resisted/super-effective modifier, and rand is a random number between 0.85 and 1 (inclusive). It turns out that this formula is actually only an approximation, but it’ll do for our purposes.

I propose to measure “stalliness” based on the the number of hits of a (non-STAB) base-120* neutrally effective move it would take for a Pokemon to KO itself, or, more precisely, its mirror (ignoring items, abilities, status and actual movesets, and assuming the Pokemon is using its stronger attack stat against its stronger defense stat).

*or, if you prefer, base-80 STAB

As an example, consider Mew (100-base Pokemon are always fun for these kinds of calcs). With no EV investment and a neutral nature, Lv. 100 Mew ends up with stats of 341/236/236/236/236/236.

Now we have Mew use Psyshock (or, if you prefer, Fire Blast) against itself, and we assume that we get mid-damage ( $rand = (1+0.85)/2=0.925$ ):

$-\dfrac{\Delta HP}{HP}=\dfrac{(210/250\times 120 +2)\times 0.925}{341} = \dfrac{95.09}{341}\simeq 0.279$

(we’re also not worrying about rounding–again, this is supposed to be an approximate figure).

So that means it would take four Psyshocks (on average, barring crits) for no-EV Mew to KO itself.

Now let’s look at Modest 252-SpA Mew (SpA stat of 328):

$-\dfrac{\Delta HP}{HP}=\dfrac{(210/250\times 328/236 \times 120+2) \times 0.925}{341} \simeq \dfrac{131.4}{341}\simeq 0.385,$

and we’re down to three-hit KO range.

Finally, let’s consider 252-HP/252-Def Impish (+Def nature) Mew (HP=404):

$-\dfrac{\Delta HP}{HP}=\dfrac{(210/250\times 236/328 \times 120+2) \times 0.925}{404} \simeq \dfrac{68.94}{404}\simeq 0.171.$

It now will take six hits to deliver the KO.

Now that you have a general idea of how these calculations work, let me go ahead and explicitly define an initial version of my stat-based stall metic:

$m=-\log_2\left[\dfrac{\left(\dfrac{2 lvl+10}{250}\times \dfrac{\max\{Atk,SpA\}}{\max\{Def,SpD\}}\times 120 + 2\right)\times 0.925}{HP}\right].$

Note the lack of rounding. I’ve thrown in a logarithm for reasons that will become more clear when I move to looking at full teams (the negative out front is so that higher score indicates greater stalliness). Returning to our examples, Modest max-SpA Mew has a metric of 1.38, and Max Defense Mew has a metric of 2.55. Some other example scores:

Max Attack +Nature Deoxys-A: -1.36
Max SpD Blissey: 4.00
Adamant Offensive Dragonite: 1.01
Bulky Dragon Dance Dragonite (Adamant): 1.48
Scarf Little Cup Mienfoo: 0.40
Shedinja breaks this metric, so we must manually assign it a score of 0.0 (Shedinja OHKOes itself)

The first thing I notice is that it’s very rare to get a metric less than zero (which corresponds to the Pokemon being able to OHKO itself). Even non-Eviolite Mienfoo–one of the most offensive Pokemon in Little Cup, which is itself a very offense-heavy tier–stays above zero (note that one Scarf Mienfoo *will* OHKO another non-Eviolite Mienfoo with Hi Jump Kick, which has 62.5% more power of this hypothetical 120-BP non-STAB move I’m forcing everyone to use). So clearly I’ll need to do some renormalization.

The second thing I notice is that going from a purely offensive set to a bulky offensive set raises Dragonite’s offensiveness by about 0.5 (going from Scarf to Bulky Mienfoo raises its stalliness by another 0.4 without factoring in Eviolite). From Mew, I further note that going from purely offensive to purely defensive raises stalliness by about 1. This gives me some idea of the “scale” of the metric–how much certain changes affect the score. With that in mind, I’m ready to proceed to the next part.

(Before I do, let me just throw up one more value: a metric of 1.58 corresponds to a three-hit KO using my hypothetical move)

Moveset Component

By “moveset” what I actually mean is everything non-stat related. That means items, abilities and moves. The theory is that each move/item/ability a Pokemon has will raise or lower its stalliness. I’m going to assume that each move/item/ability does its modification independently (“non-interaction,” if you will) and that they modify the metric by adding or subtracting from it (remember that the metric is the negative logarithm of how many hits it takes for a Pokemon to KO itself, so adding to or subtracting from this value is the equivalent of dividing or multiplying that number of hits).

Some of these modifications will be couched in theory, which I will explain. The rest will be judgement calls.

The abilities Pure Power and Huge Power subtract 1 from the metric. These abilities double the user’s attack. If we factored this into our calculations of the initial metric, it would roughly halve the number of turns required to KO. Taking the log (base-2), this would result in a metric of 1 less than the reported metric.
Choice items subtract 0.5 from the metric. Similar to the above, the log of 3/2 is 0.584. I’m rounding to 0.5 for simplicity. You’ll also note that I’m lumping Choice Scarf in with Band and Specs. This is purely a judgement call, based on my experience that Choice Scarfed Pokemon are about as offensive as other Choiced Pokemon.
Life Orb subtracts 0.5 from the metric. $\log_21.3=0.379$ , which is a little more than half as much as what I calculated for Choice items. So why does Life Orb yield the same modification as Choice Band/Specs? Because being able to switch up moves means you’re more likely to be able to choose a super-effective move. Plus, recoil is pretty much antithetical to stall.
Leftovers do nothing to the metric. I based this decision on my observations of the differences in metric between bulky- and fully-offensive sets. In some ways, Leftovers are the “anti-Life Orb” in that it adds health where Life Orb takes it away, but the difference between a Life Orb and Leftovers set shouldn’t be a whopping 1.0. Fine then, you might suggest, split the difference and have it be Life Orb -0.25, Leftovers +0.25. The two problems with this are that (1) I truly believe that Life Orb should have the same effect as Choice items, and (2) in my experience, Leftovers is the item you throw on your Pokemon when you don’t have anything better to give it. I see plenty of Leftovers Pokemon who run offensive (even heavily offensive) sets. On the other hand, you rarely see a bulky Pokemon go with Life Orb.
Eviolite adds 0.5 to the metric, using the reverse calculation as for Choice items.
The move Stealth Rock does nothing to the metric. Offensive teams like Rocks because they break sashes, make possible certain 1HKOs and 2HKOs and are quick to set up. Stall teams like Rocks because they’re an entry hazard that hit all (non-Magic Guard) Pokemon, which makes them great for shuffle-heavy strategies.
The move Spikes adds 0.5 to the metric. Spikes take a while to set up and work out better if you’re in it for the long haul.
The move Toxic Spikes adds 0.5 to the metric. Toxic Spikes are really anti-stall more than they are stall, but you’re really only going to worry about setting up Toxic Spikes if you’re planning for this battle to take a while. The reason I’m not assigning more of an effect to Toxic Spikes is that one layer of the hazard only results in regular poisoning, which isn’t very stally.
The move Rapid Spin does nothing to the metric. Offensive teams benefit from having hazards removed as much as stall teams.
The move Toxic adds 1 to the metric. On the other hand, going for straight-up Toxic indicates that you’re in this for the long haul.
The move Will-o-Wisp adds 1 to the metric. Even though I just said two lines up that regular poisoning isn’t worthy of a full point, burn has the added effect of crippling offensive threats, stopping sweeps and often times allowing one’s Pokemon to recover-stall. Speaking of…
Any moves that do nothing but restore health (Recover, Wish, Synthesis, Rest, Leech Seed but not Drain Punch or Pain Split) add 1 to the metric. This should be fairly self-explanatory. In a similar vein,
The ability Regenerator adds 0.5 to the metric. It’s less simply because it recovers less health.
Heal Bell and Aromatherapy add 0.5 to the metric. It’s true that hyper-offense teams use these moves to get rid of otherwise-crippling status ailments, but it’s rare that you see a Pokemon with a cleric move sweeping on its own.
The abilities Chlorophyll, Flare Boost, Guts, Hustle, Moxie, Reckless, Sand Rush, Solar Power, Speed Boost, Swift Swim, Technician, Tinted Lens, Toxic Boost, and Moody (where allowed) subtract 0.5 from the metric. These abilities all either boost stats, assist in setting up sweeps (Sand Veil), raise net damage, or similar.
The abilities Arena Trap, Magnet Pull, and Shadow Tag subtract 1 from the metric. Trapping is essential for dealing with annoying walls or taking out threats to sweeping.
The abilities Dry Skin, Filter, Hydration, Ice Body, Intimidate, Iron Barbs, Marvel Scale, Natural Cure, Magic Guard, Multiscale, Poison Heal, Rain Dish, Rough Skin, Solid Rock, Thick Fat, and Unaware add 0.5 to the metric. These abilities either reduce damage, restore health or deal passive damage. Pressure would be on this list as well, except most of the Pokemon that get it don’t really have a choice about it.
The abilities Slow Start and Truant add 1 to the metric. It’s not so much about adding stalliness as about subtracting offensiveness.
The item Light Clay subtracts 1 from the metric. Screens are immensely important for hyper offense and, due to their temporary nature, are rarely used in stall.
Any move boosting attack, special attack, speed or evasion (where allowed) subtracts 1 from the metric. Which is more offensive? Banded Dragonite or DD Dragonite? I’d say Dragon Dance, since choice-locked Pokemon are usually more for revenging than for sweeping. Set-up sweepers are the heart of heavy offense, and here is where I try to emphasize that. Note that, based on the above section, Bulky DDnite ends up having about the same metric as Banded Dragonite. I think this is correct.
Substitute subtracts 0.5 from the metric. Substitute is mostly used to allow its user to set up for a sweep, and the 25% health cost means that it doesn’t really work great with stall (which relies on a lot of switching anyways). There are Prankster-Sub-Recover strategies, but in that case, the net effect is in favor of stall (+0.5).
The move Protect (and variants) adds 1 to the metric. From a mathematical standpoint, it’ll take you at least twice as many turns to KO this Pokemon.
The move Endeavor subtracts 1 from the metric. The thing with FEAR is that someone’s getting KOed.
The move Super Fang subtracts 0.5 from the metric. Great move for wallbreaking.
The move Trick subtracts 0.5 from the metric, as Trick is usually used for shutting down walls.
The move Psycho Shift adds 0.5 to the metric. Psycho Boost is almost always used to get a burn on toxic onto an opponent. Perhaps then it should add 1 to the metric, but the extra 0.5 comes from Magic Guard (any Pokemon using Psycho Shift without Magic Guard should not get a full boost).
Phazing moves (including Haze, Roar, Dragon Tail and Circle Throw) and Paralysis moves that do not directly do damage (Thunder Wave and Stun Spore but not Body Slam) and Confusion moves that do not directly do damage (Swagger but not DynamicPunch) add 0.5 to the metric. Phazing is key to prevent heavy offense teams from demolishing their opponents. This is especially true for stall teams. Stall teams also often times use shuffling moves to rack up residual damage, and thus you might think that such moves should have more of an effect on the metric, but if I did that, I’d essentially be counting the hazards twice. Paralysis shuts down most sweepers, and Confusion usually forces a swithc.
Sleep-inducing moves that do nothing else (read: not Relic Song) subtract 0.5 from the metric. Sleep is a temporary status. While not uncommon on stall teams, it works best on offensive teams to allow a Pokemon to set up while the sleeping Pokemon is switched out (or napping).
The item Red Card add 0.5 from the metric. Red Card is a bit gimmicky, but it is useful for phazing. Still, it’s a one-time use item and thus it really doesn’t help out with stall in the long-term. So why am I including it here? So that I can add the following rule:
One-time use items subtract 0.5 from the metric. The idea here is that consumption is antithetical to stall. Stall teams are often in pretty much the exact same position 50 turns in as they are 25 turns in. It’s what makes stall so annoying. There is an exception to this reasoning: Harvest and Recycle. See below. Note that this negates the effect of Red Card, which I believe is well and good.
Harvest and Recycle add 1 to the metric. This more than negates the move rule for pokemon with Harvest and Recycle. If you’re using this ability to get back a Sitrus Berry, that’s puts you essentially on par with Regenerator Pokemon (+0.5), and if you’re using it for Chesto-Resto (God, what an annoying strategy), the boost from Rest means that the net effect is +1.5.
Damaging moves with negative additional effects for the user (such as recoil, stat drops, confusion…) subtract 0.5 from the metric. Stat-drop moves are usually counter to hyper-offense but are very present in regular offense, due to their immense base powers (usually the only reason a user uses such moves).
Suicide moves (e.g. Explosion, Final Gambit, Healing Wish) subtract 1 from the metric.
In tiers where they’re allowed, OHKO moves subtract 1 from the metric.

One further thing to note: individual modifications do not stack (that is, if a Pokemon has Rest and Recover–God knows why–or Agility and Swords Dance, the modification only applies once).

Phew.

Putting it All Together

Let’s return to our examples to see how the moveset component changes those scores.

Let’s have Deoxys-A know Psycho Boost and be holding a Life Orb. That lowers its stall score so far to -2.36
Blissey has Natural Cure knows Wish, Protect, Heal Bell and Toxic. Its score is now 8.00
Adamant Offensive Dragonite has Multiscale and knows Dragon Dance and Outrage. Its score is now 0.01
Bulky Offensive Dragonite runs the same set, except with Roost. Its score holds steady at 1.48
Scarf LC Mienfoo knows Hi Jump Kick. Regenerator + Scarf + HJK lowers its score to -0.10
Shedinja is holding a Focus Sash and runs Swords Dance and Protect. It ends up with a score of -0.5

The Bulky DDnite result leads me to my last modification: to get the final stall score, subtract 1.58, which normalizes the score to be centered around the three-hit KO rather than the one-hit KO (which is what the initial score of zero meant).

All that remains now is to combine individual Pokemon scores into a team score and classify the team as a whole. Combining into teams is easy–linearly average the scores. This is where the log nature of this measure comes in handy, as otherwise a single stally Pokemon (who would, say, take 16 turns to KO itself *cough cough Blissey*) would dominate the team metric (the equivalent would be do do a multiplicative mean of the non-log’d scores).

But now where are the cutoffs for hyper offense vs. offense vs. balanced vs. semi-stall vs. full stall? I’m actually going to take a cop-out and not define them. Instead, I’d rather present these results as a spectrum, a nice graph that shows the distribution of stall scores for a given tier. I’m curious to see if it really is a spectrum or whether there are clumps around particular scores. Should be interesting! Expect it in the next month or two!

From → Uncategorized

4 Comments

DoughBoy permalink

Good Stuff Antar. It is interesting to note that your metric would work better in the hackmons tier, where 252 EV’s in every stat would make everything more defensive (2 x’s vs 3 -x’s) when in reality it is a more offensive tier. How do you think your metric would fair in the 6 moves metagame: http://www.smogon.com/forums/showthread.php?t=3470224.

Reply
- antar1011 permalink
  
  Hmm… interesting question! I don’t really know too much about how the six-move metagame plays, but presumably this metric should still work. I’ll have to do some testing once I finish coding up everything.
  
  Reply

Measuring Stall

Trackbacks & Pingbacks

Leave a comment Cancel reply

Recent Posts

Categories

Archives

Measuring Stall

Share this:

Trackbacks & Pingbacks

Leave a comment Cancel reply

Recent Posts

Categories

Archives