vignettes/aggregating_observations.Rmd
aggregating_observations.Rmd
A common task when working on spatio-temporal data frames is aggregating the observations of some variables, either in space, or in time. This can be done easilty with the aggregate_by
function of the dplyrx
package. Let’s load it:
library(dplyrx)
To understand it better what the function can help do let’s have a look at the following meteorological data from the dplyrx
package:
data(meteo)
meteo
#> year month station temperature humidity rainfall sunshine
#> 1 2013 January Can Tho 26.145161 26.2 15.1 209.5
#> 2 2013 May Can Tho 28.941935 32.3 169.1 232.3
#> 3 2013 September Can Tho 27.076667 30.5 336.7 155.9
#> 4 2014 January Can Tho 24.738710 23.1 1.4 223.2
#> 5 2014 May Can Tho 29.348387 32.1 197.1 251.0
#> 6 2014 September Can Tho 27.316667 30.2 298.1 199.5
#> 7 2015 January Can Tho 25.200000 24.3 0.0 227.7
#> 8 2015 May Can Tho 29.500000 31.4 168.3 278.5
#> 9 2015 September Can Tho 27.700000 30.9 288.6 207.8
#> 10 2013 January Da Lat 16.103226 14.6 5.0 224.8
#> 11 2013 May Da Lat 20.070968 19.9 268.6 194.2
#> 12 2013 September Da Lat 18.300000 18.8 389.7 83.5
#> 13 2014 January Da Lat 14.793548 13.0 5.9 237.6
#> 14 2014 May Da Lat 19.851613 20.0 325.6 185.3
#> 15 2014 September Da Lat 18.760000 19.6 341.5 127.8
#> 16 2015 January Da Lat 15.500000 13.9 0.1 260.6
#> 17 2015 May Da Lat 20.100000 20.1 282.2 174.5
#> 18 2015 September Da Lat 19.200000 19.4 252.9 119.2
#> 19 2013 January Da Nang 21.922581 21.9 17.8 126.1
#> 20 2013 May Da Nang 29.161290 30.6 43.3 288.3
#> 21 2013 September Da Nang 27.126667 30.2 750.8 145.3
#> 22 2014 January Da Nang 20.300000 19.5 86.6 104.3
#> 23 2014 May Da Nang 29.335484 30.4 5.5 280.2
#> 24 2014 September Da Nang 28.686667 31.0 111.5 225.2
#> 25 2015 January Da Nang 21.200000 20.4 24.2 151.5
#> 26 2015 May Da Nang 29.900000 30.9 34.6 281.7
#> 27 2015 September Da Nang 28.900000 31.3 416.0 240.0
#> 28 2013 January Sa Pa 7.441935 9.9 44.8 62.1
#> 29 2013 May Sa Pa 19.377419 18.9 401.1 165.9
#> 30 2013 September Sa Pa 17.056667 17.7 300.0 95.3
#> 31 2014 January Sa Pa 7.790323 9.4 11.6 128.0
#> 32 2014 May Sa Pa 20.151613 18.4 118.3 208.0
#> 33 2014 September Sa Pa 18.533333 20.1 346.2 99.4
#> 34 2015 January Sa Pa 9.200000 9.9 112.1 143.0
#> 35 2015 May Sa Pa 21.300000 18.1 297.6 250.6
#> 36 2015 September Sa Pa 19.300000 19.8 301.4 72.2
This data set contains the values of temperature (°C), absolute humidity (g / m\(^3\)), rainfall (mm) and hours of sunshine from 3 meteorological stations (Can Tho, Da Lat and Da Nang) for the months of January, May and September in 2013, 2014 and 2015. From here, imagine that you want to aggretate all the observations of the stations Can Tho and Da Lat. An obvious way to do so is probably to take the mean. This can be done so in two step with the package dplyr
and with the aggregate_by
function of the dplyrx
package: First, we need to recode
the stations selected with the same value and then used the aggregate_by
function on the output.
library(dplyr)
meteo %>%
mutate(station = recode(station, `Can Tho` = "group_prov", `Da Lat`= "group_prov")) %>%
aggregate_by(station, month, year, .funs = mean)
#> # A tibble: 27 x 7
#> # Groups: station, month [?]
#> station month year temperature humidity rainfall sunshine
#> <fct> <ord> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 group_prov January 2013 21.1 20.4 10.0 217.
#> 2 group_prov January 2014 19.8 18.0 3.65 230.
#> 3 group_prov January 2015 20.4 19.1 0.05 244.
#> 4 group_prov May 2013 24.5 26.1 219. 213.
#> 5 group_prov May 2014 24.6 26.0 261. 218.
#> 6 group_prov May 2015 24.8 25.8 225. 226.
#> 7 group_prov September 2013 22.7 24.6 363. 120.
#> 8 group_prov September 2014 23.0 24.9 320. 164.
#> 9 group_prov September 2015 23.4 25.2 271. 164.
#> 10 Da Nang January 2013 21.9 21.9 17.8 126.
#> # ... with 17 more rows
The second argument are listed all the variables that should be used to identify unique observations to aggregate in each of the stations Can Tho and Da Lat. Here we want aggregate by space but keep the temporal information contained in the variables month
and year
.
The last argument, .funs
specifies the function that should be used to perform the aggregation. By default it’s sum
but here it makes more sense to do mean
.
We can naturally aggregate the observations and apply the function on temperature
and humidity
only, the argument .funs
need to be specify as a list:
meteo %>%
mutate(station = recode(station, `Can Tho` = "group_prov", `Da Lat`= "group_prov")) %>%
aggregate_by(station, month, year, .funs = list(mean(temperature, humidity)))
#> # A tibble: 27 x 5
#> # Groups: station, month [?]
#> station month year `mean(temperature)` `mean(humidity)`
#> <fct> <ord> <int> <dbl> <dbl>
#> 1 group_prov January 2013 21.1 20.4
#> 2 group_prov January 2014 19.8 18.0
#> 3 group_prov January 2015 20.4 19.1
#> 4 group_prov May 2013 24.5 26.1
#> 5 group_prov May 2014 24.6 26.0
#> 6 group_prov May 2015 24.8 25.8
#> 7 group_prov September 2013 22.7 24.6
#> 8 group_prov September 2014 23.0 24.9
#> 9 group_prov September 2015 23.4 25.2
#> 10 Da Nang January 2013 21.9 21.9
#> # ... with 17 more rows
Let’s now consider the case where, with the meteo
data set, instead of performing a spatial aggregation we want to perform a temporal aggregation. Let’s imagine for example that we want to aggregate the months of January and May for all the provinces. What’s apparent here is that the proper way of performing such an aggregation for the variables temperature
and humidity
would probably to use the mean
function whereas it would certainly make more sense to use the sum
function for the variables rainfall
and sunshine
. The pipeline below shows how to do so:
meteo %>%
mutate(month = recode(month, `January` = "group_month", `May`= "group_month")) %>%
aggregate_by(station, month, year, .funs = list(mean(temperature, humidity), sum(rainfall, sunshine)))
#> # A tibble: 24 x 7
#> # Groups: station, month [?]
#> station month year `mean(temperatu… `mean(humidity)` `sum(rainfall)`
#> <fct> <ord> <int> <dbl> <dbl> <dbl>
#> 1 Can Tho group_… 2013 27.5 29.2 184.
#> 2 Can Tho group_… 2014 27.0 27.6 198.
#> 3 Can Tho group_… 2015 27.4 27.8 168.
#> 4 Can Tho Septem… 2013 27.1 30.5 337.
#> 5 Can Tho Septem… 2014 27.3 30.2 298.
#> 6 Can Tho Septem… 2015 27.7 30.9 289.
#> 7 Da Lat group_… 2013 18.1 17.2 274.
#> 8 Da Lat group_… 2014 17.3 16.5 332.
#> 9 Da Lat group_… 2015 17.8 17 282.
#> 10 Da Lat Septem… 2013 18.3 18.8 390.
#> # ... with 14 more rows, and 1 more variable: `sum(sunshine)` <dbl>
The function aggregate_by
has been written to make it possible to program with. The following first examples here are reduced to very simple components to make them easier to understand.
Writing a function with aggregate_by
is possible as the arguments can be specify with or without quotation. For these examples, we will just use the temperature and rainfall columns of the meteo data set, to have a simplier visualisation
meteo2 <- meteo %>% select(year, month, station, temperature, rainfall)
Let’s start with a simple case: we want to aggregate the station Can Tho
and Da Lat
by month and year.
meteo2 %>%
mutate(station = recode(station, `Can Tho` = "group_prov", `Da Lat`= "group_prov")) %>%
aggregate_by(station, month, year)
#> # A tibble: 27 x 5
#> # Groups: station, month [?]
#> station month year temperature rainfall
#> <fct> <ord> <int> <dbl> <dbl>
#> 1 group_prov January 2013 42.2 20.1
#> 2 group_prov January 2014 39.5 7.30
#> 3 group_prov January 2015 40.7 0.1
#> 4 group_prov May 2013 49.0 438.
#> 5 group_prov May 2014 49.2 523.
#> 6 group_prov May 2015 49.6 450.
#> 7 group_prov September 2013 45.4 726.
#> 8 group_prov September 2014 46.1 640.
#> 9 group_prov September 2015 46.9 542.
#> 10 Da Nang January 2013 21.9 17.8
#> # ... with 17 more rows
In a function:
my_aggregate_by <- function(df, group_var) {
aggregate_by(df, station, group_var)
}
group_var <- c("year", "month")
meteo2 %>%
mutate(station = recode(station, `Can Tho` = "group_prov", `Da Lat`= "group_prov")) %>%
my_aggregate_by(group_var)
#> # A tibble: 27 x 5
#> # Groups: station, year [?]
#> station year month temperature rainfall
#> <fct> <int> <ord> <dbl> <dbl>
#> 1 group_prov 2013 January 42.2 20.1
#> 2 group_prov 2013 May 49.0 438.
#> 3 group_prov 2013 September 45.4 726.
#> 4 group_prov 2014 January 39.5 7.30
#> 5 group_prov 2014 May 49.2 523.
#> 6 group_prov 2014 September 46.1 640.
#> 7 group_prov 2015 January 40.7 0.1
#> 8 group_prov 2015 May 49.6 450.
#> 9 group_prov 2015 September 46.9 542.
#> 10 Da Nang 2013 January 21.9 17.8
#> # ... with 17 more rows
It’s the same principle with the parameter col_name
(here: station
). We can complexify a little bit our function:
my_aggregate_by <- function(df, col_name, group_var) {
aggregate_by(df, col_name, group_var)
}
sel <- "station"
sel_var <- c("year", "month")
meteo2 %>%
mutate(station = recode(station, `Can Tho` = "group_prov", `Da Lat`= "group_prov")) %>%
my_aggregate_by(sel, sel_var)
#> # A tibble: 27 x 5
#> # Groups: station, year [?]
#> station year month temperature rainfall
#> <fct> <int> <ord> <dbl> <dbl>
#> 1 group_prov 2013 January 42.2 20.1
#> 2 group_prov 2013 May 49.0 438.
#> 3 group_prov 2013 September 45.4 726.
#> 4 group_prov 2014 January 39.5 7.30
#> 5 group_prov 2014 May 49.2 523.
#> 6 group_prov 2014 September 46.1 640.
#> 7 group_prov 2015 January 40.7 0.1
#> 8 group_prov 2015 May 49.6 450.
#> 9 group_prov 2015 September 46.9 542.
#> 10 Da Nang 2013 January 21.9 17.8
#> # ... with 17 more rows
The next challenge is to also integrate the .funs
parameters in our function to be able to filter the values used for aggregation.
my_aggregate_by <- function(df, col_name, group_var, functs) {
aggregate_by(df, col_name, group_var, .funs = functs)
}
The .funs
parameter should be provide as a list of vector character when used in programming to avoid problem of evaluation.
sel <- "station"
sel_var <- c("year", "month")
lst_fun <- list("mean(temperature)", "sum(rainfall)")
meteo2 %>%
mutate(station = recode(station, `Can Tho` = "group_prov", `Da Lat`= "group_prov")) %>%
my_aggregate_by(sel, sel_var, functs = lst_fun)
#> # A tibble: 27 x 5
#> # Groups: station, year [?]
#> station year month `mean(temperature)` `sum(rainfall)`
#> <fct> <int> <ord> <dbl> <dbl>
#> 1 group_prov 2013 January 21.1 20.1
#> 2 group_prov 2013 May 24.5 438.
#> 3 group_prov 2013 September 22.7 726.
#> 4 group_prov 2014 January 19.8 7.30
#> 5 group_prov 2014 May 24.6 523.
#> 6 group_prov 2014 September 23.0 640.
#> 7 group_prov 2015 January 20.4 0.1
#> 8 group_prov 2015 May 24.8 450.
#> 9 group_prov 2015 September 23.4 542.
#> 10 Da Nang 2013 January 21.9 17.8
#> # ... with 17 more rows