Predicting The International 2017

13 min readJul 28, 2017

As the capstone of each season of professional DotA 2, The International 2017 has once again delivered an unprecedented prize pool of over $22 million. The impact this tournament continues to have on this generation of eSports enthusiasts cannot be understated, and neither can the determination of each of the 18 participating teams.

Competition in DotA 2 is categorized by change. This year, following the groundbreaking release of patch 7.0 and beyond, the eSporting arena itself has completely evolved. Strategies that both carry some teams to an early exit, and others into the history books, would have been unthinkable even just a year ago.

Keeping up with the ever-changing competitive trends [meta] of DotA 2 can be both thrilling and intensive. Thankfully, for the first time ever, comprehensive international gameplay analysis is made possible via the new STRATZ API for DotA 2.

Powered by an exclusive data center, the STRATZ API was able to analyze every one of the 26,671 professional DotA 2 matches in minutes. Using that data, we’ve been able to create a dynamically updated, fully simulated TI7 prediction engine. As the tournament progresses, so will our model.

Outcomes are forever uncertain, but stats never lie. That being said, the team at Stratz.com is pleased to present the current results of our tournament simulations.

Top 5

Team Liquid: Having won all of their last 8 series in a row, Team Liquid is on fire. It’s no surprise that our simulations have them coming away with the Aegis more than 20% of the time.

One curiosity that remains is Team Liquid’s decision to play in DreamLeague, which ended just 2 weeks prior to the start of The International. With their win earning them less than 1% of that of TI7’s grand prize, revealing so much of their playing dynamics may prove risky.

VP: With one of the most dynamic hero rosters in the professional scene [with 81 heroes played in 17 matches at The Summit 7], VP is the only team that’s statistically on par with Liquid at the moment.

The only shadow cast on this powerhouse is their squandering of advantages in the Kiev Major finals against OG.

EG: The boys in blue have unsurprisingly fared well in our simulations, securely claiming the 3rd highest statistical rate of achievement. Having claimed the Aegis in 2015, and finishing T3 last year and in 2014, EG are among the highest rated TI performers of all time.

Still, EG’s success has long-since been driven by the drafting prowess of ex-captain-turned-CEO PPD. Their new captain Cr1t- has led many great performances, but fell out of last year’s TI [while playing for OG] without winning a single series in the Main Event.

LGD.Forever Young: Perhaps the most surprising of our top results is LGD-FY, the 2nd place China Region qualifier. This team has been phenomenal this year, though its performances haven’t been as celebrated as some others.

LGD-FY won every group game at Epicentre, before being knocked out only to the eventual winners Liquid. They repeated their impressive performances at Mars Dota League, narrowly losing in the final to their sister team LGD Gaming, who they’d defeated in the upper-bracket finals. With their appearance in the Boston Major being hampered by visa issues and standins, plus no Kiev Major qualification, this team hasn’t had many opportunities to shine on Dota 2’s biggest stages, until now.

OG: How can the highest performing team in this year’s Majors walk away with the Aegis in less than 6% of our TI7 simulations? By severely under-performing in all other recent LAN events.

OG is unquestionably one of the most OP teams on DotA 2’s biggest stages today. Even so, their worst upset came at last year’s TI, losing straight series after being directly invited to the event. With serious nerfs to illusion heroes, will we see a repeat of TI6?

Behind the Stats

Before we even began to think about running TI simulations to evaluate probable winners, we needed a rating system that could reliably rank the strength of each team.

DotA 2’s Matchmaking Rating (MMR) is an iteration of what’s known as the ELO system. Basically, the higher rated your opponent is, the more MMR points you’ll gain by winning against them, and the less they’d earn from defeating you. But can this type of system be used to reliably rank professional DotA 2 teams?

Lets try to use ELO for DotA 2 team rankings!

After applying an MMR-like ELO system to rank DotA 2 teams, an unexpected pattern emerges.

All of the teams that made it through the closed qualifiers for big tournaments were being ranked suspiciously high, often above teams that had been directly invited. For example, Team Secret were consistently coming out as the 1st or 2nd highest rated team in the world. While undoubtedly a strong team, this was a surprising result.

So why was this happening?

The open qualifiers are typically won by newly formed teams, or teams who do not have significant pro experience.

With no previous professional matches, these new teams earn a default professional MMR (e.g. 1000). Their true professional rating at this point is unknown.

When Team Secret beat CoolBeans, or when Fnatic beat Skatemasters in closed qualifier matches, an ELO-based MMR system treats it as if Fnatic is defeating a team that genuinely earned a professional 1000 MMR rating, and therefore rewards them with points that reciprocate that result.

Alternatively, while Planet Dog and Hellraisers, for example, fought their way through open and closed qualifiers, giving some credence to their professional MMR rating, many new teams do not.

These ‘free points’, right before TI, for teams playing in the closed qualifiers, were responsible for their unrealistically high ratings. So how did we solve this?

Introducing Glicko(2) for Dota 2 rankings!

The Glicko(2) rating system is similar to ELO, except that it attempts to address the issues we’ve noted.

In a Glicko-based rating system, new teams start with a high rating uncertainty. This means that a newcomer like Skatemasters will experience a more rapid change in rating early on to help quickly calibrate them to a more appropriate rating. At the same time, Skatemasters‘s opponents will receive relatively minor rating shifts, due to the uncertainty associated with their rating at that time.

Despite this favorable treatment, we discovered a new problem using this rating system.

Glicko is designed for formats with a consistent distribution of matches, like ladders or leagues. Professional DotA 2 is almost exclusively comprised of tournaments. A team like EG, who rarely have to play through qualifiers, can go 2 weeks without a single pro-game, then play 10 in 3 days.

If we set our rating period to span an entire tournament, then a team like Newbee, who might come into a Major as the highest rated team, could find due to a new meta, they suddenly lose most of the group stage matches (which is reminiscent of their post-TI4 slump).

If, for example, TNC then beat Newbee in the 1st round of the lower bracket, a sensible model should treat it as TNC beating a mediocre team. However, due to still being in the same rating period, Glicko treats this as if TNC beat the best team in the world.

If we shorten the rating period to address this, we risk over-estimating the uncertainty in teams’ ratings. Glicko works best with at least 10–15 games per team in a rating period, which is far below what we get if we split up LAN events into multiple rating periods.

ELO is back!

Why did we discard ELO?

Many new teams to the model had too high an impact on the teams they played, giving them ‘free points’.

After further analysis, our analytics expert pointed out that this could be solved using a dynamic K-factor. The K-factor in ELO is a system constant that determines how much ratings should update by. A larger K = more points being gained from wins, and more points dropped from losses.

By treating matches against brand new teams as having a small K-factor, then winning against these new teams leaves the closed qualifier teams’ ratings relatively unchanged.

The precise values we chose to mean ‘high K’, ‘low K’ and ‘medium K’ came from an analysis of the values that international chess organizations use. Reliable ratings are a crucial part of competitive chess, which is also what the ELO and Glicko systems were originally designed for. If you’re going to trust anyone when developing ranking systems, you trust chess-people.

After committing to implementing a dynamic K-factor, we discovered we could use it to make the rankings more accurate in some interesting ways. For example:

Surely matches at the main event of The International and Valve Majors mean more than minor LANs?

Of course they do! So we gave them a high K-factor.

Surely matches on the current patch should be more significant than previous patches?

They should! So we increased the K-factor for 7.06e matches.

Below is the precise definition for K depending on match conditions:
Opponent played < 3 games: k = 10 Opponent played < 10 games: k = 20 Opponent played >= 10 games: k = 30 TI/Major main event match: k = 40 Patch 7.06e match: k = k_above + 10

Hurdles we had to jump

So we finally had a sensible looking team ranking system after applying the dynamic K-factor above. What else did we then encounter?

Roster Switching

Think about the recent changes involving Onyx → DC, and DC →Thunderbirds. Onyx players are now given the team MMR that DC should have, while DC players go back to having the MMR of a new team.

Glicko would be okay in this situation, as high uncertainty results in ratings that spike in response to performance. Unfortunately, since we are using ELO, the recalibration is a slower process. Because of this, we decided to hard-code recent player-organization changes into the model, and when they occurred.

2. The OG factor

Oddly enough, any TI7 predictions based on purely results-generated ratings will tell you that OG are a middle-of-the-pack team, with relatively slim odds of winning. While our simulations don’t predict OG as favorites, other ranking systems can place them below Cloud 9, Team Secret, IG, Team Empire, etc.

Stats don’t lie, but the interpretation of statistics are often wrong. If we processed match results alone, our ratings wouldn’t reflect OG’s incredible performances at all recent Majors, and wouldn’t properly balance that against their lackluster performance at minor LANs. Regardless of what the reasons for OG’s success at big events are, the fact remains that they’ve been utterly dominant. Because of this, OG are likely to be contenders at TI7.

So how did we take the “OG factor” into account in our simulations?

We decided to offset the MMR of the teams who played at both the Boston and Kiev majors, by the amount they had increased/decreased by during those events. We ignored TI6, because a year is a very long time in Dota 2. We ignored teams that were at both Majors, but had late replacements or visa issues, which could have adversely affected their performance. This left us with the following adjustments:

Professional MMR shifts based on previous Majors:
OG: +38
VP: +39
EG: -10
Newbee: -83

These MMR shifts were areinteresting, especially considering VP are actually gaining more MMR at previous Majors than OG. After everything we said above, how is this possible?

As it turns out, despite noting that OG often under-performing outside of Majors, pre-Kiev and pre-Boston OG had solid tournament finishes. They finished runner-up at The Summit 6 just before the Boston Major, and were the runner-up at DAC just before the Kiev Major. This meant they actually entered these tournaments as one of the top ranked teams, and therefore received less substantial MMR increases. Statistically, they’re going into TI7 in considerably worse form than they had before Kiev or Boston.

In contrast, VP came into those tournaments weaker, and Newbee was the highest ranked team before the Boston Major. VP defeating them 2–0, plus Newbee’s then suffered a first round exit at the hands of Ad Finem, which helps explain both VP’s large gain and Newbee’s massive loss.

3. Regions

Teams tend to play against teams in their region a lot more than those outside their region. While the ratings of teams within a region are still accurate, an EU team with a 1200 professional MMR may be stronger than a South American team with a 1300 MMR. This is because the SA teams predominantly play against each other, and therefore exist in a nearly entirely separate ranking ladder. The average SA team might be far weaker than the average EU team, and until they play against more international teams, our rating system has difficulty knowing if this is the case.

International tournaments do allow MMR to flow between regions, however a) It requires time for that MMR to be redistributed among the regions’ teams; and b) Not enough international matches occur to make it fair for all regions.

It seems like our dynamic K-factor, which reduces point gains against new teams, somewhat lowered the region bias, but this is a highly subjective judgment. While the rankings shown below are still slightly biased, we‘re able to quantify the inaccuracy in cross-region rankings, by taking it into account when simulating individual matches (See Estimating win probabilities below).

After these hurdles, we finally had a team rating system we felt was a good enough model of reality, which we could use for predicting the outcomes of individual matches, thus simulating TI7.

Here are the final rankings:

Planet Odd would be on 1295 points, just above Team Secret. They were clearly the strongest team to miss out on TI7 qualification.

Estimating win probabilities

The estimated win chance for a team is already built into the ELO model. That’s how it knows how many points to award a team when they win.

Fig 1. Converting Rating Difference to Expected Win Probability

Still, we were uncertain if these ELO estimates were overconfident, so we ran logistic regression models (aka how much does our guess align with reality) based on ELO predictions, against existing match results.

Across all pro matches, the results showed ELO was pretty accurate in its predictions. When ELO said that Team A, with a rating 200 points higher than team B, has a 75% chance to win, very close to 75% of the time Team A would actually win. See Fig 1.

This accuracy doesn’t mean that our model is perfect; most matches, ELO can only predict the winner to a relatively low accuracy. For example, VP(1497) vs EG(1430) predicts a VP win with only 59% certainty. It’s possible that a more complex model could increase this certainty.

Cross-region accuracy

We re-ran the above tests, but only on cross-region matches. These came out as noticeably closer to 50/50 than the ELO predictions, probably because regions with undervalued MMR ratings will score upsets against so-called ‘favorites’ more often than the ELO model expects. Cross-region games in the simulation were therefore played out with the following reduced certainty.

In the future, we might use similar techniques to quantify by how much ELO ratings should be skewed in favor of each region.

Why not use neural networks to predict match results?

When it comes to developing a prediction model, there are two common ways it might fail.

First is the OG factor described above, where the model oversimplifies reality. At first, our ELO model ignored how well teams performed at big events, pretending their performance at a much smaller LANs were just as indicative of TI7 performance, as something like the Kiev Major.

The second issue is effectively the opposite: trying to successfully implement an extremely complex model. We suspect that this was the issue that Bing encountered when they predicted the Kiev Major brackets.

Bing predicted a TNC vs DC grand-final. Anybody who knows the DotA 2 pro-scene can understand that this was an unusual prediction.

If you try to make your model too complex, taking into account too many variables, you reach a point where there is not enough data for everything to average itself out. It’s like flipping a coin only 3 times and getting 3 heads. A model based on those results would claim flipping a coin always gets heads.

While a less complex model might produce greater uncertainty, it can also provide more sensible results. As we move forward, we’ll be expanding on the inputs we consider, at a slow enough pace to ensure the models always stay sensible, and importantly, understandable (which can be another potential downside when using advanced machine learning techniques).

Simulating TI

Monte-Carlo simulations are the standard way to simulate a tournament. We played out 10,000,000 individual instances of TI7, letting match results be determined probabilistically. If VP has a 59% chance to beat EG, roll a random number up to a 100. If it’s less than 59, then VP wins; if it’s more than 59, EG wins.

The simulation fully plays out all of the group stage matches and the main event’s double elimination bracket.

We’ve assumed that each group will contain 3 of the invited teams. We also allowed ELO rankings to update during each tournament simulation. We believe this will better mirror reality, since we’re using ELO to simulate the entire tournament, as opposed to just single matches.

What Next?

1) Once group stages are announced, we’ll re-run the simulations on the precise groups [A and B], producing the most common whole brackets.

2) Attempt to simulate other compendium predictions, since we’ll soon have a better estimate of how many matches each team/player will play.

Number of games for each different team will greatly affect compendium predictions, such as “Team with the most kills in a game”, “Player with most last hits in a game”. The more games a team has, the more chances they get to win these predictions.

We also need to analyse how many games they should win. It’s unlikely a team will get highest kills, or highest kill average if they are constantly losing.

3) Carefully expand our model’s complexity, especially by looking at regions.

Many thanks for taking the time to read. This is STRATZ’s first implementation of tournament prediction and ranking. Any feedback or suggestions are more than welcome! :D

STRATZ_ThePianoDentist