Putting up with Julia #1 - reading JSONs from web API
Introduction
This blog post is definately not about love, I can assure you that.
I have to admit to myself that after being fanatic R user for more than 3 years, it’s time to learn something new and add cool stuff to my little green toolbox. Anytime I needed to do stuff with data outside of databases I was launching rstudio and was ready to go in no time with well known and comfortable tidy libraries, pipelines and Rprojects.
The most appalling tasks became easy and almost enjoyable to conquer with tons of documentation piling up on the internet and the army of R community ready to answer any question on Stack. CRANs libraries contain more than I think they do. Wanna get a boxplot? GGplot. Filter data? Dplyr. Input NA? We got you. Connect to Keras? Do speech recognition? Write music? Plot chess boards? We have it all. Plot-xcdf-like-charts-all.
Of course I am not abandoning R. “Patrick we need that for today” - no worries, let me just spend 20 minutes on searching how to convert Ints to Strings, report will be ready before lunch. But it’s cool to be fluent in some alternative tool.
Why Julia then? Well, I already know bits of Python and - its just unpopular opinion - I find those bits annoying. Pandas feels like base R. No matter what I want to do, I always have the wrong version. Dedicated python IDEs are all wrong - Pycharm is intimidating(ly) complicated and Spyder never works properly. Please don’t let me start talking about Jupyter. I also suppose there wouldn’t be much improvement in speed.
Scala would be an ok idea, but it’s a bit too serious for now. Julia seems like a good choice. Last year Julia developers released version 1.0 promising more stability and readiness for production use in foreseeable future. Its code is nice to look at. It’s fast. It has interesting name that makes no sense and dissapointing lack of backstory about it.
This intro is already too long. So let’s move on. It is a bigger challenge to start learning a tool that is not that popular yet. There are scarce blog posts here and then, some Stack questions, bits of documentation and thats it. I wasn’t able to find many use cases (or I just searched wrong way). Anyway, I decided to start working on NBAr close cousin, juNBA package.
Idea is simple - learn Julia by figuring out how to rebuild my own R package in it. Describe written code to share with other poor souls looking for tips on the internet.
Read JSON file from web source
I want to download play-by-play data from stats.nba.com. After some research I found out I need to use 3 libraries: JSON, HTTP and DataFrames. Below you will find code with comments. I tried to comment parts that I got stuck on during the work (I started with no Julia knowledge whatsoever, so there are plenty of comments).
using JSON
using HTTP
using DataFrames
### glue link:
game_id = 21800100
season_id = parse(Int,"20"*SubString("$(game_id)",2:3))
pbp_link = "https://data.nba.com/data/10s/v2015/json/mobile_teams/nba/
$(season_id)/scores/pbp/00
$(game_id)_full_pbp.json"
### Sending HTTP request to read data:
resp = HTTP.get(pbp_link)
### Convert response body to string
str = String(resp.body)
### String can be parsed to proper
### json object (It's Julia type is dictionary)
jobj = JSON.Parser.parse(str)
### I used function 'get' to obtain values
### from dicts (not a big fan of indexes)
game_info = get(jobj, "g", 0)
game_code = get(game_info,"gcode",0)
### 'Convert' function is unable to convert from
### Strings to Ints so 'parse' has to be used instead
game_date_id = parse(Int,split(game_code,"/")[1])
### just some data manipulation
visit_team = split(game_code,"/")[2][1:3]
home_team = split(game_code,"/")[2][4:6]
periods_list = get(game_info, "pd",0)
number_of_periods = length(periods_list)
So far so good. I glued the link, read the file, and do some minor tweaks. Now the problem is that I got nested dicts (one game has dict for each period and then each period has around 100 dicts - one for each row) and want to convert them into one data frame. Was searching for something simple but haven’t seen anything useful, so I written small function and map it across all rows.
function parse_game_period(periods_list, p_number)
period_number = get(periods_list[p_number],"p",0)
plays = get(periods_list[p_number],"pla",0)
## here I use 'map' to convert each row to DataFrame type.
## 'vcat' binds those dataframes into one.
## I want to believe that there is better way to convert dicts into dataframes
period_df = vcat(map(play_row -> convert(DataFrame, play_row), plays)...)
## adding new column
period_df[:period_number] = period_number
return period_df
end
### here I map above function to all game periods
pbp_full = vcat(map(period_number ->
parse_game_period(periods_list, period_number),
1:number_of_periods)...)
We got the final data frame. Time for some cleaning:
colnames = ["clock",
"description",
"secondary_player_id",
"event_type",
"event_id",
"home_score",
"loc_x",
"loc_y",
"message_type",
"offensive_team_id",
"opponent_player_id",
"opt1",
"opt2",
"order_no",
"player_id",
"team_id",
"visitor_score",
"period_number"]
### column renaming:
names!(pbp_full, Symbol.(colnames))
pbp_full[:game_id] = game_id
pbp_full[:game_date_id] = game_date_id
pbp_full[:home_team] = home_team
pbp_full[:visit_team] = visit_team
Done! We have the table ready. Below you can see whole function:
using HTTP
using JSON
using DataFrames
function get_playbyplay(game_id)
season_id = parse(Int,"20"*SubString("$(game_id)",2:3))
pbp_link = "https://data.nba.com/data/10s/v2015/json/mobile_teams/nba/
$(season_id)
/scores/pbp/00$(game_id)
_full_pbp.json"
resp = HTTP.get(pbp_link)
str = String(resp.body)
jobj = JSON.Parser.parse(str)
game_info = get(jobj, "g", 0)
game_code = get(game_info,"gcode",0)
game_date_id = parse(Int,split(game_code,"/")[1])
visit_team = split(game_code,"/")[2][1:3]
home_team = split(game_code,"/")[2][4:6]
periods_list = get(game_info, "pd",0)
number_of_periods = length(periods_list)
pbp_full = vcat(map(period_number ->
parse_game_period(periods_list, period_number),
1:number_of_periods)...)
colnames = ["clock",
"description",
"secondary_player_id",
"event_type",
"event_id",
"home_score",
"loc_x",
"loc_y",
"message_type",
"offensive_team_id",
"opponent_player_id",
"opt1",
"opt2",
"order_no",
"player_id",
"team_id",
"visitor_score",
"period_number"]
names!(pbp_full, Symbol.(colnames))
pbp_full[:game_id] = game_id
pbp_full[:game_date_id] = game_date_id
pbp_full[:home_team] = home_team
pbp_full[:visit_team] = visit_team
return pbp_full
end
using juNBA
pbp = get_playbyplay(21800100)
If you are beginning Julia learner like me then I hope it could help you in some way. Hopefully there will be another part sooner than in couple of months. Cheers!