# Arithmetic between dataframes with varying numbers of rows in R

I have objects containing monthly data on plant growth. Each object is a fixed number of columns, and the number of rows is equal to the number of months the plant survives. I would like to take the mean of these objects so that the mean considers only plants surviving at a given timestep. Here is example data:

``````df1 <- data.frame(GPP = 1:10, NPP = 1:10)
df2 <- data.frame(GPP = 2:8, NPP = 2:8)
df3 <- data.frame(GPP = 3:9, NPP = 3:9 )
``````

In this scenario, the maximum timesteps is 10, and the 2nd and 3rd plants did not survive this long. To take the mean, my initial thought was to replace empty space with `NA` to make the dimensions the same, such as this:

``````na <- matrix( , nrow = 3, ncol = 2)
colnames(na) <- c("GPP","NPP")
df2 <- rbind(df2, na)
df3 <- rbind(df3, na)
``````

This is not desirable because the `NA` does not simply ignore the value as I had hoped, but nullifies the field, leading to all outputs of arithmetic with `NA` becoming `NA`, such as this:

``````(df1 + df2 + df3) / 3
``````
``````   GPP NPP
1    2   2
2    3   3
3    4   4
4    5   5
5    6   6
6    7   7
7    8   8
8   NA  NA
9   NA  NA
10  NA  NA
``````

I can NOT just fill `na` with 0s because I want to see the mean of every plant that is living at a given timestep while completely ignoring those that have died. Replacing with 0s would skew the mean, and not achieve this. For my example data here, this is the desired outcome

``````(df1 + df2 + df3) / 3
``````
``````   GPP NPP
1    2   2
2    3   3
3    4   4
4    5   5
5    6   6
6    7   7
7    8   8
8    8   8
9    9   9
10  10  10
``````

Here rows 8-10 are replaced with the values from `df1` because there are only 7 rows in both `df2` and `df3`.

## Solution

I’ll restate my comment: it is generally much safer to encode the month in the original data before you do anything else; it is explicit and will insulate you from mistakes later in the pipeline that might inadvertently change the order of rows (which completely breaks any valid significance you hope to attain). Additionally, since I’m going to recommend putting all data into one frame, let’s encode the plant number as well (even if we don’t use it immediately here).

For that, then:

``````df1 <- data.frame(plant = "A", month = 1:10, GPP = 1:10, NPP = 1:10)
df2 <- data.frame(plant = "B", month = 1:7, GPP = 2:8, NPP = 2:8)
df3 <- data.frame(plant = "C", month = 1:7, GPP = 3:9, NPP = 3:9)
``````

From this, I’m a huge fan of having all data in one frame. This is well-informed by https://stackoverflow.com/a/24376207/3358227, where a premise is that if you’re going to do the same thing to a bunch of frames, it should either be a list-of-frames or one combined frame (that keeps the source id encoded):

``````dfs <- do.call(rbind, list(df1, df2, df3))
### just a sampling, for depiction
dfs[c(1:2, 10:12, 17:19),]
#    plant month GPP NPP
# 1      A     1   1   1
# 2      A     2   2   2
# 10     A    10  10  10
# 11     B     1   2   2
# 12     B     2   3   3
# 17     B     7   8   8
# 18     C     1   3   3
# 19     C     2   4   4
``````

## base R

``````aggregate(cbind(GPP, NPP) ~ month, data = dfs, FUN = mean, na.rm = TRUE)
#    month GPP NPP
# 1      1   2   2
# 2      2   3   3
# 3      3   4   4
# 4      4   5   5
# 5      5   6   6
# 6      6   7   7
# 7      7   8   8
# 8      8   8   8
# 9      9   9   9
# 10    10  10  10
``````

## dplyr

``````library(dplyr)
dfs %>%
group_by(month) %>%
summarize(across(c(GPP, NPP), mean))
# # A tibble: 10 x 3
#    month   GPP   NPP
#    <int> <dbl> <dbl>
#  1     1     2     2
#  2     2     3     3
#  3     3     4     4
#  4     4     5     5
#  5     5     6     6
#  6     6     7     7
#  7     7     8     8
#  8     8     8     8
#  9     9     9     9
# 10    10    10    10
``````

Side point: two bits of data you are "losing" in this summary is the size of data and the variability of each month. You might include them with:

``````dfs %>%
group_by(month) %>%
summarize(across(c(GPP, NPP), list(mu = ~ mean(.), sigma = ~ sd(.), len = ~ length(.))))
# # A tibble: 10 x 7
#    month GPP_mu GPP_sigma GPP_len NPP_mu NPP_sigma NPP_len
#    <int>  <dbl>     <dbl>   <int>  <dbl>     <dbl>   <int>
#  1     1      2         1       3      2         1       3
#  2     2      3         1       3      3         1       3
#  3     3      4         1       3      4         1       3
#  4     4      5         1       3      5         1       3
#  5     5      6         1       3      6         1       3
#  6     6      7         1       3      7         1       3
#  7     7      8         1       3      8         1       3
#  8     8      8        NA       1      8        NA       1
#  9     9      9        NA       1      9        NA       1
# 10    10     10        NA       1     10        NA       1
``````

In this case, an average of `8` may be meaningful, but noting that it is a `len`gth of 1 is also informative of the "strength" of that statistic (i.e., weak).

﻿