League of Legend Pro Game Analysis

Published:

Authors: Samuel Lee, Nian-Nian Wang

Introduction

The data set we are working with is professional League of Legends match data for 2022. The main question we explore throughout this EDA is what makes a team win a match. We mainly look at two aspects: the team’s side (blue or red) and the champions they have selected and banned. One of the original motivations for this is due to the widely known rumor that the blue side wins more than the red, and teams tend to choose to be on the blue side when they have priority. We want to verify whether the rumor is true and back it up with data. On the other hand, we are also curious how much of an impact the champions being selected or banned have on the outcome of the match.

Our data set is accessed from the website Oracle’s Elixir. The original data consists of 148992 rows. Every 12 rows is data for a single match; 10 of them are player data, and 2 are team data. In our project, we have decided to focus on Tier 1 and split the data into two sets: player data and team data.

The main columns we work with are gameid, url, league, split, playoffs, patch, side, ban1, ban2, ban3, ban4, ban5, pick1, pick2, pick3, pick4, pick5, gamelength, result. Among all, side, ban1, ban2, ban3, ban4, ban5, pick1, pick2, pick3, pick4, pick5, result were mainly used for exploring the factors for the team to win. Most others were used for early exploratory data analysis and data cleaning to understand more about the data set.

The description to the relevant columns are as follows:

  • gameid: unique game identification number
  • url: url to game data else NaN
  • league: the region/league
  • split: the split/season the game was in
  • playoffs: 0 if the game was in regular season; 1 if the game was in playoffs
  • patch: version of game
  • side: the team’s side, either red or blue
  • ban1, ban2, ban3, ban4, ban5: the 5 champions banned by the opposing team, meaning the champions that cannot be used
  • pick1, pick2, pick3, pick4, pick5: the 5 champions being picked by the team
  • gamelength: time the game took in seconds
  • result: 1 if the team won; 0 if the team lost

Data Cleaning and Exploratory Data Analysis

Data Cleaning

For the data cleaning, we first separate the orignal dataset into two: tier1_player and tier1_team, that is separate the first 10 rows (10 players per match) and 2 rows (2 teams per match) for every 12 rows, because we found that the rows in the original dataset belong to two categories of players and teams. If we don’t separate them, there will be many missing by design values.

A snippet of tier1_player:

gameiddatacompletenessurlleagueyearsplitplayoffsdategamepatchban3ban4ban5pick1pick2pick3pick4pick5gamelengthresult
8401-8401_game_1partialhttps://lpl.qq.com/es/stats.shtml?bmid=8401LPL2022Spring02022-01-10 09:24:26112.01CaitlynJayceCamilleNaNNaNNaNNaNNaN13651
8401-8401_game_1partialhttps://lpl.qq.com/es/stats.shtml?bmid=8401LPL2022Spring02022-01-10 09:24:26112.01CaitlynJayceCamilleNaNNaNNaNNaNNaN13651
8401-8401_game_1partialhttps://lpl.qq.com/es/stats.shtml?bmid=8401LPL2022Spring02022-01-10 09:24:26112.01CaitlynJayceCamilleNaNNaNNaNNaNNaN13651
8401-8401_game_1partialhttps://lpl.qq.com/es/stats.shtml?bmid=8401LPL2022Spring02022-01-10 09:24:26112.01CaitlynJayceCamilleNaNNaNNaNNaNNaN13651
8401-8401_game_1partialhttps://lpl.qq.com/es/stats.shtml?bmid=8401LPL2022Spring02022-01-10 09:24:26112.01CaitlynJayceCamilleNaNNaNNaNNaNNaN13651

A snippet of tier1_team:

gameiddatacompletenessurlleagueyearsplitplayoffsdategamepatchban3ban4ban5pick1pick2pick3pick4pick5
8401-8401_game_1partialhttps://lpl.qq.com/es/stats.shtml?bmid=8401LPL2022Spring02022-01-10 09:24:26112.01CaitlynJayceCamilleJinxJarvan IVNautilusSyndraGwen
8401-8401_game_1partialhttps://lpl.qq.com/es/stats.shtml?bmid=8401LPL2022Spring02022-01-10 09:24:26112.01AkaliLeBlancRumbleXin ZhaoThreshApheliosVexJax
8401-8401_game_2partialhttps://lpl.qq.com/es/stats.shtml?bmid=8401LPL2022Spring02022-01-10 10:09:22212.01ThreshJayceCamilleJinxXin ZhaoRakanRumbleCorki
8401-8401_game_2partialhttps://lpl.qq.com/es/stats.shtml?bmid=8401LPL2022Spring02022-01-10 10:09:22212.01Jarvan IVLeBlancAkaliLee SinLeonaZiggsGangplankTwisted Fate
8402-8402_game_1partialhttps://lpl.qq.com/es/stats.shtml?bmid=8402LPL2022Spring02022-01-10 11:26:11112.01ApheliosNautilusLeonaJinxViegoThreshCorkiGraves

Univariate Analysis

Number of Games Played in Each Patch

We can see that there were no games played in tier 1 leagues in 12.07, and there were only few games in 12.06, 12.08, and 12.16. It is because that most league spring playoffs happened in 12.05 and 12.06, so there was no game played in 12.07. The Mid-Seasonal Invitational(MSI) was hold during 12.08, so there was few games in 12.08 in tier 1 leagues, as we do not include the international competitions. Same reason apply to 12.16, most league summer playoffs happened in 12.15 and 12.16, and then the Worlds happened in 12.18.

Distribution of Game Length

For the following plot, we can see that the distribution of game length is skewed to the right, with the median of 31.39 minutes. We converted from seconds to minutes here for easier interpretation.

Bivariate Analysis

Average Game Length in Each Tier 1 League

From the following boxplot and barplot, we can see that there is no significant game length between each league. One interesting fact is that LCK, known for best game strategy in the game with less teamfight, has the longest average game length, while VCS, known for its bloody and frequent teamfights, has the shortest average game length.

Blue / Red Team Win Rate in Each Tier 1 League

From the table and grouped barplot below, we can see that in most leagues (except LCS, LLA, and VCS), the win rate of blue side is higher than the win rate of red side. Especially in PCS, the difference of the win rates between blue and red sides is about 0.20, which is a significant amount that can affect the result of the game.

leagueBlueRed
CBLOL0.5390950.460905
LCK0.5074950.492505
LCO0.5141510.485849
LCS0.4934640.506536
LEC0.5349790.465021
LJL0.5140190.485981
LLA0.4919790.508021
LPL0.5496180.450382
PCS0.6051660.394834
VCS0.4953560.504644

Interesting Aggregates

Number of Games Played in Each Tier 1 League in 2022

From the plot below, we can see that LPL has the most games played in 2022. This is because they had 17 teams with Single Round Robin and BO3(Best of three) in regular seasons. (Source) The second largest is LCK, which had 10 teams with Double Round Robin and BO3 in regular seasons. (Source)

Most Picked Champions in Each Position

We first create a pivot table to count the times of champions picked in the roles.

championTopJungleMidBotSupport
Aatrox2240100
Ahri0088600
Akali144020300
Akshan210200
Alistar0000270

From the plots below, we can see the most played 10 picked champions in each position. A long list of picked champions have minimal meaning, but this pivot table condenses information and can provide information about the popular champions for every position, and potentially the better performing ones.

Assessment of Missingness

NMAR Analysis

NMAR occurs when the probability of data being missing depends on unobserved information. As we focus on data for tier 1, we realize that url column in the dataset is missing for some rows.

leagueCount
LPL1572
LEC84
LCS4
LCO4
LCK2
PCS2
leagueCount
LPL1572
LCK934
VCS646
LCS612
PCS542
LEC486
CBLOL486
LJL428
LCO424
LLA374

From above, we see how the LPL is not missing any urls, while others teams completely do not have any urls or have some. If LPL consistently provides this URL while other teams vary, it suggests that the missingness is related to the specific teams themselves. This is NMAR because the presence or absence of a URL linking to match information depends on the team; however, it cannot be recovered by other columns.

This indicate differences in how teams handle data reporting regarding match information. It could also reflect differences in resources, priorities, or organizational policies among the teams.

Missingness Dependency

We would like to carry out permutation tests to test if the missingness of ban1 to ban5 columns are dependent to the side column. We chose the test statistic of TVD and the p-value cutoff of 0.05.

From the graph above, we can see that the p-values of the permutation tests for 5 ban columns are much larger than 0.05, which means that the missingness in the ban columns are not dependent on the side of the team.

However, is the missingness of ban1 column related to missingness of ban2, ban3, ban4, and/or ban 5? We can perform another permutation test to answer this question.

Therefore, from the above plots, it is clear that the missingness of ban1 is related to the missingness of other ban columns.

Hypothesis Testing

  • Null hypothesis: The win rates of teams on the Blue and Red sides are equal.
  • Alternative hypothesis: The win rate of teams on the Blue side is higher than the win rate of teams on the Red side.
  • Test statistic: The difference in proportions of win games between the Blue and Red sides.
  • Significant Level(p-value cutoff): 0.05
  • We will perform permutation test to test the hypothesis.

The observed test statistic is 0.0553. The p-value is 0.0. Since the p-value is less than the significant level (0.05), we reject our null hypothesis. From our test, the win rate of teams on the Blue side is higher than the win rate of teams on the Red side.

Framing a Prediction Problem

Our prediction problem is that we want to predict the result of the game at the end of ban/pick. Therefore, it is a binary classification problem. We use the DecisionTreeClassifer to predict 0 and 1 for the result column. We use precision as our metric because we think that it is important to evaluate the false positive of our model because we don’t want to be misleading about a strategy that will not make the team win.

Baseline Model

In the baseline model, we use pick1 to pick5 columns as features. They are all nominal variable, having the names of the champions, which is string, so we have 5 nominal features in our baseline model. We use customized one hot encoder to solve the problems that some champions may not appear in the train data but can appear in the test data. We use the DecisionTreeClassifier() with the max depth of 5 to predict.

Our train score is 0.5244, and our test score is 0.4771, which are very low, even lower than random chance (p = 0.50). Therefore, we need to include new features and change our feature engineering process.

Final Model

For this part, we are going to include the side column as a new feature to predict the result of the game because we think that side selection can affect the result. As we demonstrated earlier, the win rate of blue side is higher than the win rate of red side.

For feature engineering part, we would like to add the feature of pick rate of each champion because this can provide more information about the strength of individual champions. Therefore, we build another customized one hot transformer that one hot encode the champions and also include the pick rates of them. Like the baseline model, we also use DecisionTreeClassifier().

The train and test scores of our new model now become 0.5564 and 0.5159. We can see that the score improves a little bit. We are going to find the best hyperparameters for our model. After running through different hyperparameters, the best hyperparameters are the max_depth of 4 and the min_samples_split of 8, which has the test score of 0.5545.

The final model has about 0.04 higher score than our baseline mode. Although it is not very high, we believe that those are reasonable because players and teams are more important than the champions they pick and the side they choose.

Fairness Analysis

For conducting fairness analysis, we want to compare whether the model works equally for both blue and red side. As our hypothesis testing concluded that we cannot be sure if their side has an effect on winning, we want to further compare the performances between these two groups.

img2

With consideration that teams may consider selecting champions as one important factor to winning a match, cost of false positive of high. It is worse than false negative, where we predict them to lose while they actually won the match during the game. The grid above also shows out of 183 times, 76 teams actually lost when our model predicted them to win. Therefore, we decide to evalue our model’s performance on precision, which is 0.5815.

We run a hypothesis test to check if the difference of our model performance between blue and red sides is statistically significant.

  • Null Hypothesis: The precision of the model for teams on the blue side and red side are roughly the same, and any differences are due to random chance.
  • Alternative Hypothesis: The precision of the model for teams on the blue side is greather than its precision for teams on the red side.
  • Cutoff: 0.05
  • We will perform permutation testing to test our hypothesis

After performing the permutation test, our p-values is 0.034, which is smaller than our cutoff. Therefore, we reject our null hypothesis, meaning that blue side having a higher precision than the red side.