Aggregating some observations in a data frame

A common task when working on spatio-temporal data frames is aggregating the observations of some variables, either in space, or in time. This can be done easilty with the aggregate_by function of the dplyrx package. Let’s load it:

library(dplyrx)

To understand it better what the function can help do let’s have a look at the following meteorological data from the dplyrx package:

data(meteo)
meteo
#>    year     month station temperature humidity rainfall sunshine
#> 1  2013   January Can Tho   26.145161     26.2     15.1    209.5
#> 2  2013       May Can Tho   28.941935     32.3    169.1    232.3
#> 3  2013 September Can Tho   27.076667     30.5    336.7    155.9
#> 4  2014   January Can Tho   24.738710     23.1      1.4    223.2
#> 5  2014       May Can Tho   29.348387     32.1    197.1    251.0
#> 6  2014 September Can Tho   27.316667     30.2    298.1    199.5
#> 7  2015   January Can Tho   25.200000     24.3      0.0    227.7
#> 8  2015       May Can Tho   29.500000     31.4    168.3    278.5
#> 9  2015 September Can Tho   27.700000     30.9    288.6    207.8
#> 10 2013   January  Da Lat   16.103226     14.6      5.0    224.8
#> 11 2013       May  Da Lat   20.070968     19.9    268.6    194.2
#> 12 2013 September  Da Lat   18.300000     18.8    389.7     83.5
#> 13 2014   January  Da Lat   14.793548     13.0      5.9    237.6
#> 14 2014       May  Da Lat   19.851613     20.0    325.6    185.3
#> 15 2014 September  Da Lat   18.760000     19.6    341.5    127.8
#> 16 2015   January  Da Lat   15.500000     13.9      0.1    260.6
#> 17 2015       May  Da Lat   20.100000     20.1    282.2    174.5
#> 18 2015 September  Da Lat   19.200000     19.4    252.9    119.2
#> 19 2013   January Da Nang   21.922581     21.9     17.8    126.1
#> 20 2013       May Da Nang   29.161290     30.6     43.3    288.3
#> 21 2013 September Da Nang   27.126667     30.2    750.8    145.3
#> 22 2014   January Da Nang   20.300000     19.5     86.6    104.3
#> 23 2014       May Da Nang   29.335484     30.4      5.5    280.2
#> 24 2014 September Da Nang   28.686667     31.0    111.5    225.2
#> 25 2015   January Da Nang   21.200000     20.4     24.2    151.5
#> 26 2015       May Da Nang   29.900000     30.9     34.6    281.7
#> 27 2015 September Da Nang   28.900000     31.3    416.0    240.0
#> 28 2013   January   Sa Pa    7.441935      9.9     44.8     62.1
#> 29 2013       May   Sa Pa   19.377419     18.9    401.1    165.9
#> 30 2013 September   Sa Pa   17.056667     17.7    300.0     95.3
#> 31 2014   January   Sa Pa    7.790323      9.4     11.6    128.0
#> 32 2014       May   Sa Pa   20.151613     18.4    118.3    208.0
#> 33 2014 September   Sa Pa   18.533333     20.1    346.2     99.4
#> 34 2015   January   Sa Pa    9.200000      9.9    112.1    143.0
#> 35 2015       May   Sa Pa   21.300000     18.1    297.6    250.6
#> 36 2015 September   Sa Pa   19.300000     19.8    301.4     72.2

This data set contains the values of temperature (°C), absolute humidity (g / m\(^3\)), rainfall (mm) and hours of sunshine from 3 meteorological stations (Can Tho, Da Lat and Da Nang) for the months of January, May and September in 2013, 2014 and 2015. From here, imagine that you want to aggretate all the observations of the stations Can Tho and Da Lat. An obvious way to do so is probably to take the mean. This can be done so in two step with the package dplyr and with the aggregate_by function of the dplyrx package: First, we need to recode the stations selected with the same value and then used the aggregate_by function on the output.

library(dplyr)
meteo %>% 
  mutate(station = recode(station, `Can Tho` = "group_prov", `Da Lat`= "group_prov")) %>% 
  aggregate_by(station, month, year, .funs = mean)
#> # A tibble: 27 x 7
#> # Groups:   station, month [?]
#>    station    month      year temperature humidity rainfall sunshine
#>    <fct>      <ord>     <int>       <dbl>    <dbl>    <dbl>    <dbl>
#>  1 group_prov January    2013        21.1     20.4    10.0      217.
#>  2 group_prov January    2014        19.8     18.0     3.65     230.
#>  3 group_prov January    2015        20.4     19.1     0.05     244.
#>  4 group_prov May        2013        24.5     26.1   219.       213.
#>  5 group_prov May        2014        24.6     26.0   261.       218.
#>  6 group_prov May        2015        24.8     25.8   225.       226.
#>  7 group_prov September  2013        22.7     24.6   363.       120.
#>  8 group_prov September  2014        23.0     24.9   320.       164.
#>  9 group_prov September  2015        23.4     25.2   271.       164.
#> 10 Da Nang    January    2013        21.9     21.9    17.8      126.
#> # ... with 17 more rows

The second argument are listed all the variables that should be used to identify unique observations to aggregate in each of the stations Can Tho and Da Lat. Here we want aggregate by space but keep the temporal information contained in the variables month and year.

The last argument, .funs specifies the function that should be used to perform the aggregation. By default it’s sum but here it makes more sense to do mean.

We can naturally aggregate the observations and apply the function on temperature and humidity only, the argument .funs need to be specify as a list:

meteo %>% 
  mutate(station = recode(station, `Can Tho` = "group_prov", `Da Lat`= "group_prov")) %>% 
  aggregate_by(station, month, year, .funs = list(mean(temperature, humidity)))
#> # A tibble: 27 x 5
#> # Groups:   station, month [?]
#>    station    month      year `mean(temperature)` `mean(humidity)`
#>    <fct>      <ord>     <int>               <dbl>            <dbl>
#>  1 group_prov January    2013                21.1             20.4
#>  2 group_prov January    2014                19.8             18.0
#>  3 group_prov January    2015                20.4             19.1
#>  4 group_prov May        2013                24.5             26.1
#>  5 group_prov May        2014                24.6             26.0
#>  6 group_prov May        2015                24.8             25.8
#>  7 group_prov September  2013                22.7             24.6
#>  8 group_prov September  2014                23.0             24.9
#>  9 group_prov September  2015                23.4             25.2
#> 10 Da Nang    January    2013                21.9             21.9
#> # ... with 17 more rows

A more complex situation

Let’s now consider the case where, with the meteo data set, instead of performing a spatial aggregation we want to perform a temporal aggregation. Let’s imagine for example that we want to aggregate the months of January and May for all the provinces. What’s apparent here is that the proper way of performing such an aggregation for the variables temperature and humidity would probably to use the mean function whereas it would certainly make more sense to use the sum function for the variables rainfall and sunshine. The pipeline below shows how to do so:

meteo %>% 
  mutate(month = recode(month, `January` = "group_month", `May`= "group_month")) %>% 
  aggregate_by(station, month, year, .funs = list(mean(temperature, humidity), sum(rainfall, sunshine)))
#> # A tibble: 24 x 7
#> # Groups:   station, month [?]
#>    station month    year `mean(temperatu… `mean(humidity)` `sum(rainfall)`
#>    <fct>   <ord>   <int>            <dbl>            <dbl>           <dbl>
#>  1 Can Tho group_…  2013             27.5             29.2            184.
#>  2 Can Tho group_…  2014             27.0             27.6            198.
#>  3 Can Tho group_…  2015             27.4             27.8            168.
#>  4 Can Tho Septem…  2013             27.1             30.5            337.
#>  5 Can Tho Septem…  2014             27.3             30.2            298.
#>  6 Can Tho Septem…  2015             27.7             30.9            289.
#>  7 Da Lat  group_…  2013             18.1             17.2            274.
#>  8 Da Lat  group_…  2014             17.3             16.5            332.
#>  9 Da Lat  group_…  2015             17.8             17              282.
#> 10 Da Lat  Septem…  2013             18.3             18.8            390.
#> # ... with 14 more rows, and 1 more variable: `sum(sunshine)` <dbl>

Programming with aggregate_by

The function aggregate_by has been written to make it possible to program with. The following first examples here are reduced to very simple components to make them easier to understand.

Writing a function with aggregate_by is possible as the arguments can be specify with or without quotation. For these examples, we will just use the temperature and rainfall columns of the meteo data set, to have a simplier visualisation

meteo2 <- meteo %>% select(year, month, station, temperature, rainfall)

Let’s start with a simple case: we want to aggregate the station Can Tho and Da Lat by month and year.

meteo2 %>% 
  mutate(station = recode(station, `Can Tho` = "group_prov", `Da Lat`= "group_prov")) %>% 
  aggregate_by(station, month, year)
#> # A tibble: 27 x 5
#> # Groups:   station, month [?]
#>    station    month      year temperature rainfall
#>    <fct>      <ord>     <int>       <dbl>    <dbl>
#>  1 group_prov January    2013        42.2    20.1 
#>  2 group_prov January    2014        39.5     7.30
#>  3 group_prov January    2015        40.7     0.1 
#>  4 group_prov May        2013        49.0   438.  
#>  5 group_prov May        2014        49.2   523.  
#>  6 group_prov May        2015        49.6   450.  
#>  7 group_prov September  2013        45.4   726.  
#>  8 group_prov September  2014        46.1   640.  
#>  9 group_prov September  2015        46.9   542.  
#> 10 Da Nang    January    2013        21.9    17.8 
#> # ... with 17 more rows

In a function:

my_aggregate_by <- function(df, group_var) {
  aggregate_by(df, station, group_var)
}

group_var <- c("year", "month")
meteo2 %>% 
  mutate(station = recode(station, `Can Tho` = "group_prov", `Da Lat`= "group_prov")) %>% 
  my_aggregate_by(group_var)
#> # A tibble: 27 x 5
#> # Groups:   station, year [?]
#>    station     year month     temperature rainfall
#>    <fct>      <int> <ord>           <dbl>    <dbl>
#>  1 group_prov  2013 January          42.2    20.1 
#>  2 group_prov  2013 May              49.0   438.  
#>  3 group_prov  2013 September        45.4   726.  
#>  4 group_prov  2014 January          39.5     7.30
#>  5 group_prov  2014 May              49.2   523.  
#>  6 group_prov  2014 September        46.1   640.  
#>  7 group_prov  2015 January          40.7     0.1 
#>  8 group_prov  2015 May              49.6   450.  
#>  9 group_prov  2015 September        46.9   542.  
#> 10 Da Nang     2013 January          21.9    17.8 
#> # ... with 17 more rows

It’s the same principle with the parameter col_name (here: station). We can complexify a little bit our function:

my_aggregate_by <- function(df, col_name, group_var) {
  aggregate_by(df, col_name, group_var)
}

sel <- "station"
sel_var <- c("year", "month")

meteo2 %>% 
  mutate(station = recode(station, `Can Tho` = "group_prov", `Da Lat`= "group_prov")) %>% 
  my_aggregate_by(sel, sel_var)
#> # A tibble: 27 x 5
#> # Groups:   station, year [?]
#>    station     year month     temperature rainfall
#>    <fct>      <int> <ord>           <dbl>    <dbl>
#>  1 group_prov  2013 January          42.2    20.1 
#>  2 group_prov  2013 May              49.0   438.  
#>  3 group_prov  2013 September        45.4   726.  
#>  4 group_prov  2014 January          39.5     7.30
#>  5 group_prov  2014 May              49.2   523.  
#>  6 group_prov  2014 September        46.1   640.  
#>  7 group_prov  2015 January          40.7     0.1 
#>  8 group_prov  2015 May              49.6   450.  
#>  9 group_prov  2015 September        46.9   542.  
#> 10 Da Nang     2013 January          21.9    17.8 
#> # ... with 17 more rows

The next challenge is to also integrate the .funs parameters in our function to be able to filter the values used for aggregation.

my_aggregate_by <- function(df, col_name, group_var, functs) {
  aggregate_by(df, col_name, group_var, .funs = functs)
}

The .funs parameter should be provide as a list of vector character when used in programming to avoid problem of evaluation.

sel <- "station"
sel_var <- c("year", "month")
lst_fun <-  list("mean(temperature)", "sum(rainfall)")

meteo2 %>% 
  mutate(station = recode(station, `Can Tho` = "group_prov", `Da Lat`= "group_prov")) %>% 
  my_aggregate_by(sel, sel_var, functs = lst_fun)
#> # A tibble: 27 x 5
#> # Groups:   station, year [?]
#>    station     year month     `mean(temperature)` `sum(rainfall)`
#>    <fct>      <int> <ord>                   <dbl>           <dbl>
#>  1 group_prov  2013 January                  21.1           20.1 
#>  2 group_prov  2013 May                      24.5          438.  
#>  3 group_prov  2013 September                22.7          726.  
#>  4 group_prov  2014 January                  19.8            7.30
#>  5 group_prov  2014 May                      24.6          523.  
#>  6 group_prov  2014 September                23.0          640.  
#>  7 group_prov  2015 January                  20.4            0.1 
#>  8 group_prov  2015 May                      24.8          450.  
#>  9 group_prov  2015 September                23.4          542.  
#> 10 Da Nang     2013 January                  21.9           17.8 
#> # ... with 17 more rows

A more complex situation

Programming with aggregate_by

Contents