How to do cumulative unique count with R and dplyr
Couple of days ago I had to find a way to visualize in R cumulative number of unique customers over period of time. Having data with dates and customer ids ready I felt pretty confident that I will finish the task in seconds and most of the time will be eaten by loading tidyverse package. Damn I could have felt the coffee break LOOMING behind my screen.
Little did I know back then but apparently there is no function yet to do just that one task. I am always a bit shocked or/and sad when I am not able to do a task with purrr dplyr and this seemed like a case.
So checked cumstats…. none.
Maybe TTR?…… none.
…Plyr?
DataTable? Probably, but I want to keep code sort of in one format.
All of a sudden instead LOOMING break I was facing a danger of me LOOPING the data. Which is dreadful.
I searched and then I searched, then I tried some things and then tried some more. At the end of the day (not literally, it did not take me so much time…) I got result which satisfied my needs:
This is the solution with minimal example:
suppressPackageStartupMessages(library(tidyverse))
df <- tibble(
item = c(1,2,3,4,5,6,1,2,3,7,4,8,9),
day = c(1,1,1,1,1,1,2,2,2,2,3,3,3)
)
df
## # A tibble: 13 x 2
## item day
## <dbl> <dbl>
## 1 1 1
## 2 2 1
## 3 3 1
## 4 4 1
## 5 5 1
## 6 6 1
## 7 1 2
## 8 2 2
## 9 3 2
## 10 7 2
## 11 4 3
## 12 8 3
## 13 9 3
Dataset has date-alike column (day number) and unique item id which we want to accumulate, first we need to group-nest ids over date (I have no clue how well or badly it scales…).
suppressPackageStartupMessages(library(tidyverse))
stage <- df %>%
group_by(day) %>%
summarise(item = list(unique(item)))
stage
## # A tibble: 3 x 2
## day item
## <dbl> <list>
## 1 1 <dbl [6]>
## 2 2 <dbl [4]>
## 3 3 <dbl [3]>
Then map function written below…
cumulative_n_distinct = function(x){
require(purrr)
require(dplyr)
y <- x %>%
accumulate(append) %>%
map(.,n_distinct) %>%
as.numeric
return(y)
}
..to our nested dataset:
stage %>%
mutate(n_uniqe_cum = cumulative_n_distinct(item)) %>%
select(day, n_uniqe_cum)
## # A tibble: 3 x 2
## day n_uniqe_cum
## <dbl> <dbl>
## 1 1 6
## 2 2 7
## 3 3 9
Voila!
Take in mind that this example is quite minimal and my task also wasn’t including very big dataset (around 1000 customers). Still feel free to copy-paste, leave a comment or maybe improve the solution!
Cheers