Introduction
Pandas groupbys are some of the most useful functions in a data scientist's toolkit. And yet, time and again I have found that colleagues do not realize the flexibility these ubiquitous functions can grant them. In the following post, I will demonstrate some of my favorite uses of groupbys in the hope that it will help others in the future.
The Data
Let's start with a simple example and work our way up in difficulty. We'll start with reading in Covid19 data that Johns Hopkins University aggregated from the World Health Organization. The dataset contains approximately 300,000 observations from different countries and regions of the world on Covid19 cases, recoveries, and deaths.
Python1from pandas as import read_csv, Grouper 2from datetime import timedelta 3 4covid_ts = read_csv("covid_19_data.csv").dropna() 5covid_ts['ObservationDate'] = pd.to_datetime(covid_ts['ObservationDate']
Now that we've loaded our data, we can take a look.
Python1covid_ts.head()
| SNo | ObservationDate | Province/State | Country/Region | Last Update | Confirmed | Deaths | Recovered |
|---|---|---|---|---|---|---|---|
| 1 | 01/22/2020 | Anhui | Mainland China | 1/22/2020 | 1.0 | 0.0 | 0.0 |
| 2 | 01/22/2020 | Beijing | Mainland China | 1/22/2020 | 14.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
Here we can confirm that we have seven columns of regional and national-level disease data. ObservationDate and Last Update don't differ by much, so for our purposes we'll simply stick with Observation Date, dropping the other.
Since we'll be leveraging the temporality of our dataset, let's quickly check the range of our data.
Python1 2start = min(covid_ts['ObservationDate']) 3end = max(covid_ts['ObservationDate']) 4range_ = end - start + timedelta(days=1) # inclusive range 5 6print(f"Observations range form {str(start.date())} to {str(end.date())}, or {range_.days} days"
Observations range from 2020-01-22 to 2021-05-29, or 494 days.
Okay, with that let's jump into some groupbys.
Advanced Groupbys
Being Covid timeseries data, the first thing we can do is check how the disease has progressed over the weeks. Enter pandas.Grouper, a groupby-specific function that allows users to control how their data will be grouped at a time-based level. All we have to do is invoke Grouper within a typical groupby function, provide an offset alias (e.g. D for daily, W for weekly, Y for yearly), and an aggregation metric (e.g. sum, mean, count) such as is done in the following lines of code:
Multiline functions can be surrounded with () for readability
Python1 2( 3 covid_ts.groupby(Grouper(key="ObservationDate", freq="1W")) 4 [['Confirmed']] 5 .sum() 6 .reset_index() 7)
Resulting in the Weekly Aggregate Covid19 Cases table below:
| ObservationDate | Confirmed |
|---|---|
| 2020-01-26 | 89,332 |
| 2020-02-02 | 304,681 |
| 2020-02-09 | 621,653 |
| ... | ... |
By grouping "confirmed" cases in covid_ts using Grouper, the offset alias 1W, and sum, we have easily aggregated weekly confirmed cases over the date range of our data. If we wanted more granular aggregations we could have easily changed our offsets to _D, where _ is any number of day offets. But this is only step one of what we can achieve with Grouper, and groupby aggregations in general.
The convenience of Grouper is extended by its ability to aggregate subgroups by the offsets it's provided—all that matters is the placement of Grouper relative to other groupby columns. For example, in the groupby snippet below, "Country/Region" is placed in a list before our Grouper function at the 1M (one month) offset, producing the National/Regional Covid Cases By Month table below.
Python1 2( 3 covid_ts 4 .groupby(["Country/Region", Grouper(key="ObservationDate", freq="1M")]) 5 [['Confirmed']] 6 .sum() 7 .reset_index() 8)
| Country/Region | ObservationDate | Confirmed |
|---|---|---|
| Afghanistan | 2020-02-29 | 5 |
| Afghanistan | 2020-03-31 | 4,438 |
| Afghanistan | 2020-04-30 | 21,139 |
| ... | ... | ... |
Whereas, by placing "Country/Region" in a list after Grouper, one can get a similar, but slightly different aggregation of the data: Monthly Covid Cases by Country/Region.
Python1 2( 3 covid_ts 4 .groupby([ 5 Grouper(key="ObservationDate", freq="1M"), "Country/Region"]) 6 [['Confirmed']] 7 .sum() 8 .reset_index() 9)
| ObservationDate | Country/Region | Confirmed |
|---|---|---|
| 2020-01-31 | Australia | 248 |
| 2020-01-31 | Belgium | 1 |
| 2020-01-31 | Cambodia | 1 |
| ... | ... | ... |
Note here that as we progress down our table, we pass the total number of covid cases for each country reporting, whereas previously we would pass through the entire history of reported covid cases per country. As ever, the order our data is listed as is determined by the placement of columns in our groupby. With Grouper, our options are simply extended to aggregations of date-values—and the extensibility does not end there.
Complementary Functions
After grouping our data, we often want to operate on the values we have derived. Luckily, pandas provides us with assign, a function for manipulating newly derived columns in place. To use assign, most often one will also want to be comfortable with lambda expressions, so we'll be sure to implement them here as a reminder. Let's return to our National/Regional Covid Cases By Month for a use-case.
First, we'll rename our columns with some named aggregations for clarity, replacing "Confirmed" with the more accurate "Total_Cases".
Python1 2( 3 covid_ts 4 .groupby(["Country/Region", 5 .Grouper(key="ObservationDate", freq="1M"), ]) 6 .agg(Total_Cases=("Confirmed", "sum"), 7 ) 8 .reset_index() 9 .head() 10)
| Country/Region | ObservationDate | Total_Cases |
|---|---|---|
| Afghanistan | 2020-02-29 | 5 |
| Afghanistan | 2020-03-31 | 4,438 |
| Afghanistan | 2020-04-30 | 21,139 |
| ... | ... | ... |
Next, we'll add additional data to our table by inserting a named aggregation for "Total_Deaths" per country per month.
Python1 2( 3 covid_ts 4 .groupby(["Country/Region", 5 .Grouper(key="ObservationDate", freq="1M"), ]) 6 .agg(Total_Deaths=("Deaths", "sum"), 7 Total_Cases=("Confirmed", "sum"), 8 ) 9 .reset_index() 10 .head() 11 )
| Country/Region | ObservationDate | Total_Deaths | Total_Cases |
|---|---|---|---|
| Afghanistan | 2020-02-29 | 0 | 5 |
| Afghanistan | 2020-03-31 | 148 | 4,438 |
| Afghanistan | 2020-04-30 | 572 | 21,139 |
| ... | ... | ... | ... |
Finally, we'll leverage assign by referencing our new "Total_Deaths" and "Total_Cases" in-line, using them to create an entirely new column of data: "Death_Case_Ratio", or "Total_Deaths" divided by "Total_Cases". By multiplying our new ratio by 100 we can derive an informative metric: the percentage of infected individuals who die each month in a given country/region. Finally, we'll rename our columns to more aesthetic titles, as spaces aren't allowed in named aggregations.
Python1 2( 3 covid_ts 4 .groupby(["Country/Region", 5 pd.Grouper(key="ObservationDate", freq="1M"), ]) 6 .agg(Total_Deaths=("Deaths", "sum"), 7 Total_Cases=("Confirmed", "sum") 8 ) 9 .assign( 10Death_Case_Ratio=lambda x: round(x['Total_Deaths']/x['Total_Cases']*100,2) ) 11 .reset_index() 12 .rename({'Total_Deaths':'Total Deaths', 13 'Total_Cases':'Total Cases', 14 'Death_Case_Ratio':'Death/Case Ratio (%)'}, 15 axis=1) 16)
And wala we have the following National/Regional Death/Case Ratio By Month table.
| Country/Region | ObservationDate | Total Deaths | Total Cases | Death/Case Ratio (%) |
|---|---|---|---|---|
| Afghanistan | 2020-02-29 | 0 | 5 | 0.00 |
| Afghanistan | 2020-03-31 | 148 | 4,438 | 3.34 |
| Afghanistan | 2020-04-30 | 572 | 21,139 | 2.71 |
| ... | ... | ... | ... | ... |
Aggregating Text and Filtering
Groupbys are not simply convenient for aggregating numerical data—they are also useful for summarizing text data too. Let's return again to our National/Regional Death/Case Ratio by Month table. This time, we'll leverage a lambda function within our agg function, expanding its flexibility to its fullest extent.
Python1 2( 3 covid_ts 4 .groupby(["Country/Region", 5 pd.Grouper(key="ObservationDate", freq="1M"), ]) 6 .agg(Total_Deaths=("Deaths", "sum"), 7 Total_Cases=("Confirmed", "sum"), 8 City_States=('Province/State', lambda x: ', '.join(set(x))) 9 ) 10 .assign( 11Death_Case_Ratio=lambda x: round(x['Total_Deaths']/x['Total_Cases']*100,2) ) 12 .reset_index() 13 .rename({'Total_Deaths':'Total Deaths', 14 'Total_Cases':'Total Cases', 15 'City_States':'City/States', 16 'Death_Case_Ratio':'Death/Case Ratio (%)'}, 17 axis=1) 18 .query(""" `Country/Region` == 'US'""") 19 )
Here we generate a "City_States" column (subsequently renamed to "City/States") in which we aggregate "Province/State" text data to the monthly and country level. By implementing a join function, and filtering redundant data with set, we are instructing pandas to list out which cities and towns comprise our data. To visualize this, a filter was also added in the form of Pandas' powerful query function, which allows us to filter columns in place using boolean expressions. Here we filtered "Country/Region" to only include the United States. The result of our work can be seen in the US Death/Case Ratio by Month table below.
| Country/Region | ObservationDate | Total Deaths | Total Cases | City/States | Death/Case Ratio (%) |
|---|---|---|---|---|---|
| US | 2020-01-31 | 0 | 7 | Washington, Chicago, ... | 0.00 |
| US | 2020-02-29 | 1 | 68 | California, Washington, New York, ... | 1.47 |
| US | 2020-03-31 | 4,081 | 190,089 | All 50 states + territories | 2.15 |
| ... | ... | ... | ... | ... | ... |
Summary
As we can see, pandas groupbys are far more flexible than they are typically used for. We have seen that when we leverage functions such as Grouper, we are able to aggregate timeseries data using offset aliases. By implementing the agg function with named aggregations, we can reference and manipulate these new columns in place by appending an assign function to our code. Finally, we have seen that even text need not be ignored, thanks to our ability to use lambda functions within agg functions as well. To top it off, we saw that Pandas' powerful query function allows us to filter our data to whatever granularity we'd like using boolean expressions.