Detecting outlying statlines in NBA

November 28, 2017

We all know what happens on social media after - insert name of random/unknown/boring/funny/weird/forgotten player - all of sudden makes all 8 of his 3 point attempts, scores thousands of points and wins that very important game in mid-January for his team. You know, the monthly-Jeff-Green-can-be-a-factor-in-the-playoffs game.

We have seen the Brandon-Jennings-scoring-55-as-a-rookie game. That Kendrick-Perkins-owns-Nuggets-in-the-paint-with-16pts performance (that, for me as a Nuggets fan, was heartbreaking). All those Dwight-Howard-after-years-of-struggle-is-finally-back games. Jonas Valanciunas pump-fake finally fools a defender. Jahlil Okafor enters the game. Twitter is buzzing. Blogposts are being written. Video-analysis are conducted. Bruno Caboclo becomes new Kevin Durant. You can watch every single highlight from all 28 angles on YouTube.

Things are not better at the other side of basketball world. When the superstar player ends a game with just a couple of buckets and miserable field goal percentage, he may read a lot about his actual health, relationship with teammates, minutes he plays and maybe it is a time for a panic trade to safe the season.

But tomorrow they play next game and everything comes back to norm. We all forget about the guy and move on to the next big interesting thing.

Those situations are called outliers and they are integral part of any repetitive process that can be described in numbers. NBA regular season consists of 1230 games and in every one of those games there are 25 players working hard to fill the stat sheets - which sounds like an accounting. But this is what we nerds get in the end.

For me personally there is only one thing worse than overreacting to the outlying performance - that is using SAS.

It’s not true.

But using SAS is still a bad idea. The thing that is worse than outliers fever is the Small sample size mania at the start of each season. You know, all those guys looking through the stats after first 2 games and then tweeting:

Ish Smith is leading the league in that advanced stat per game, with some ridiculous, almost made up number. He is on the pace to accumulate gazilllions of that stat after 82 games! It is sooo craaazy!

But you know why?

You exactly know why.

HASHTAG SMALL SAMPLE SIZE

So you know that number is not real! Ha!

It probably annoys me more than it should and you don’t have to agree with me. It may even be funny. Who knows. Anyway, let’s go back to the main topic.

We are almost in 25% of 2017/18 regular season and most of the players have already played around 20 games. I decided to go through the data and search for the performances which for some reason were out of place.

(That’s the part of blog post when I explain the method I use, I will let you know when I am done)

So there is very good tool that can be used for identifying outlying observations and that is z-score:

\[ Z = \frac{X - mean(X)}{stdev(X)}\]

Which is very simple to use and explain. You take the number, subtract the mean of all observations from it and then divide it by standard deviation of all observations. If Z is larger than 3 or smaller then -3 then we have an outlier.

But this time I am going to use modified version of that.. and not just because I want to have a fancy blog - Z-score itself is unfortunately prone to outliers. If there is already a very high value in the data it will make the mean and standard deviation values increase, which in turn will make Z score go smaller for incoming observations.

To avoid such situation I will use modified z-score which is based on median and median absolute deviation which are much more stable distribution statistics.

Let’s see the King’s stats. Lebron played in 20 games, averaging 28.55 points per game. Measuring by z-score there was only one outlying game (Z = 3.1) and that was his 57-points game against Washington Wizards on 3rd November. You can see below what difference in given statistics that one game makes

Mean drops by 6% when in the same time median stays exactly the same. Standard deviation goes down by 30%, while median standard deviation is reduced by just 12%. Conclusion is simple, using modified z-score will result in much more stable outcome across whole season.

\[ Z = \frac{X - median(X)}{MAD(X)}\]

I just can’t help myself with writing MAD in uppercase. It is just more fitting that way.

THEORY PART IS FINISHED

So I quickly prepared modified z-score as R function (could not find a package with that)

modified_z <- function(x){
    ma <- mad(x)
    me <- median(x)
    zscore <- (x-me)/ma
    return(zscore)
  }

Then I queried all the box scores with traditional stats that can be found on NBA.com. If you are R user, you can download that data yourself in very easy manner thanks to my NBAr package. If you are python user - there is a small chance I will rewrite those functions for python as well. If you prefer Stata or SPSS, don’t ask for .csv files and start using some proper tool, it’s almost 2018. If eventually you prefer SAS for some reason, you probably don’t care about such fun things like basketball.

Below you will find the code for finding outliers. Keep in mind that I have just started my transition to tidyverse way of code and I am pretty sure some parts could have been written better.

p <- 5
trad <- traditional %>%
            filter(GAME_ID >= 21700000 & COMMENT == '') %>%
            left_join(schedule, by = 'GAME_ID') %>%
            mutate(VERSUS = ifelse(TEAM_ABBREVIATION == HOME, VISITOR, HOME)) %>%
            select(PLAYER_NAME,GAME_DATE,VERSUS,PTS,AST,REB,BLK,STL,TO,
                   FGA,FTA,FG3A,MINS, FGM,FG3M, FTM) %>%
            group_by(PLAYER_NAME) %>%
            mutate (RN = 1:n()) %>%
            filter(max(RN) >= 10) %>%
            mutate(mPTS = modified_z(PTS), mAST = modified_z(AST),
                   mREB = modified_z(REB), mBLK = modified_z(BLK),
                   mSTL = modified_z(STL), mTO = modified_z(TO),
                   mFGA = modified_z(FGA), mFTA = modified_z(FTA),
                   mFG3A = modified_z(FG3A), mMINS = modified_z(MINS)
                   ) %>%
            filter(abs(mPTS) >= p | abs(mAST) >=p |
                   abs(mREB) >= p | abs(mBLK) >=p |
                   abs(mSTL) >= p | abs(mTO) >=p |
                     abs(mFGA) >= p | abs(mFTA) >=p |
                     abs(mFG3A) >= p | abs(mMINS)  >=p ) %>%
            drop_na()

I picked some some basic basketball stats for that. Simple things that have been filling stat sheets for years. Points, rebounds, assists, steals, turnovers, blocks, minutes, 3-points attempts, field goal attempts, free throws. That’s it. Let me know if there is particularly interesting advanced value that should be taken into consideration.

Also added the filter to analyse players which played at least in 10 games. Border value for modified Z-score can stay at 3 which gives us 151 outliers (2.3% of all observations). But I would rather pick the most extreme ones to avoid excessive dullness - so I set the value to 5, which resulted in 18 most outlying player performances so far in 2017/18 season.

Biggest so far:

22 Points from Raul Neto against Brooklyn Nets: (Z=12)

Raul Neto

Raul Neto

Classic Lance Stephenson start of the season (Z=8.7)

Lance

Lance

The Lonzo Ball one! (Z=5.4)

Lonzo

Lonzo

Actual low point (-6.4)

PG

PG

Aaron Gordon during Orlando’s hot start (5.6)

AG

AG

15 rebounds for a guard is an achievement (Z=7)

Oladipo

Oladipo

And this is it! I hope you you find it interesting and if you have any comments you can leave them below or talk to me on Twitter. Feel free to share it article if you like it!

btw. code for plotting:

plot_player <- function(row, ds, p){
  require(ggplot2)
  require(ggthemes)
  require(tidyverse)
  
  ro <- row %>%
    gather() %>%
    filter(key %in% c('PLAYER_NAME','GAME_DATE','VERSUS') |
             (grepl('^m',key) &as.numeric(value) >= p)) %>%
    mutate(kpi = sub('^m','',key))

   player <- unique(ro$value[ro$key == 'PLAYER_NAME'])
   gd <- unique(ro$value[ro$key == 'GAME_DATE'])
   vs <- unique(ro$value[ro$key == 'VERSUS'])
   kpis <- ro$kpi[!ro$kpi %in% c('PLAYER_NAME','GAME_DATE','VERSUS')]
   
   sv <- row[,kpis] %>% gather(kpi,spv)
  

    p <- ds %>%
    filter(GAME_ID >= 21700000 & COMMENT == '' & PLAYER_NAME == player) %>%
    select(PLAYER_NAME,GAME_DATE, kpis)  %>%
    gather('kpi', 'value', gather_cols=kpis) %>%
    left_join(sv, by = 'kpi') %>%
    mutate(clr = ifelse(GAME_DATE == gd, 'red','blue')) %>%
    mutate(sz = ifelse(GAME_DATE == gd, '5','3')) %>%
    ggplot(aes(kpi, value))  +
    geom_boxplot(outlier.shape = NA) +
    geom_jitter(aes(fill=clr, size = sz), shape = 21, width = 0.1, height = 0.1) +
    theme_wsj()  +
    theme(legend.position="none") +
    ggtitle(paste(player, ' vs ', vs,'\n',gd,sep=''))
    print(p)
  
}

This is luckily mostly tidyverse code but then I used FOR loop to plot all examples, keeping this project as inconsistent as it can get.