Arithmetic between dataframes with varying numbers of rows in R

I have objects containing monthly data on plant growth. Each object is a fixed number of columns, and the number of rows is equal to the number of months the plant survives. I would like to take the mean of these objects so that the mean considers only plants surviving at a given timestep. Here is example data:

df1 <- data.frame(GPP = 1:10, NPP = 1:10)
df2 <- data.frame(GPP = 2:8, NPP = 2:8)
df3 <- data.frame(GPP = 3:9, NPP = 3:9 )

In this scenario, the maximum timesteps is 10, and the 2nd and 3rd plants did not survive this long. To take the mean, my initial thought was to replace empty space with NA to make the dimensions the same, such as this:

na <- matrix( , nrow = 3, ncol = 2)
colnames(na) <- c("GPP","NPP")
df2 <- rbind(df2, na)
df3 <- rbind(df3, na)

This is not desirable because the NA does not simply ignore the value as I had hoped, but nullifies the field, leading to all outputs of arithmetic with NA becoming NA, such as this:

(df1 + df2 + df3) / 3
GPP NPP
1    2   2
2    3   3
3    4   4
4    5   5
5    6   6
6    7   7
7    8   8
8   NA  NA
9   NA  NA
10  NA  NA

I can NOT just fill na with 0s because I want to see the mean of every plant that is living at a given timestep while completely ignoring those that have died. Replacing with 0s would skew the mean, and not achieve this. For my example data here, this is the desired outcome

(df1 + df2 + df3) / 3
GPP NPP
1    2   2
2    3   3
3    4   4
4    5   5
5    6   6
6    7   7
7    8   8
8    8   8
9    9   9
10  10  10

Here rows 8-10 are replaced with the values from df1 because there are only 7 rows in both df2 and df3.

Solution

I’ll restate my comment: it is generally much safer to encode the month in the original data before you do anything else; it is explicit and will insulate you from mistakes later in the pipeline that might inadvertently change the order of rows (which completely breaks any valid significance you hope to attain). Additionally, since I’m going to recommend putting all data into one frame, let’s encode the plant number as well (even if we don’t use it immediately here).

For that, then:

df1 <- data.frame(plant = "A", month = 1:10, GPP = 1:10, NPP = 1:10)
df2 <- data.frame(plant = "B", month = 1:7, GPP = 2:8, NPP = 2:8)
df3 <- data.frame(plant = "C", month = 1:7, GPP = 3:9, NPP = 3:9)

From this, I’m a huge fan of having all data in one frame. This is well-informed by https://stackoverflow.com/a/24376207/3358227, where a premise is that if you’re going to do the same thing to a bunch of frames, it should either be a list-of-frames or one combined frame (that keeps the source id encoded):

dfs <- do.call(rbind, list(df1, df2, df3))
### just a sampling, for depiction
dfs[c(1:2, 10:12, 17:19),]
#    plant month GPP NPP
# 1      A     1   1   1
# 2      A     2   2   2
# 10     A    10  10  10
# 11     B     1   2   2
# 12     B     2   3   3
# 17     B     7   8   8
# 18     C     1   3   3
# 19     C     2   4   4

base R

aggregate(cbind(GPP, NPP) ~ month, data = dfs, FUN = mean, na.rm = TRUE)
#    month GPP NPP
# 1      1   2   2
# 2      2   3   3
# 3      3   4   4
# 4      4   5   5
# 5      5   6   6
# 6      6   7   7
# 7      7   8   8
# 8      8   8   8
# 9      9   9   9
# 10    10  10  10

dplyr

library(dplyr)
dfs %>%
group_by(month) %>%
summarize(across(c(GPP, NPP), mean))
# # A tibble: 10 x 3
#    month   GPP   NPP
#    <int> <dbl> <dbl>
#  1     1     2     2
#  2     2     3     3
#  3     3     4     4
#  4     4     5     5
#  5     5     6     6
#  6     6     7     7
#  7     7     8     8
#  8     8     8     8
#  9     9     9     9
# 10    10    10    10

Side point: two bits of data you are "losing" in this summary is the size of data and the variability of each month. You might include them with:

dfs %>%
group_by(month) %>%
summarize(across(c(GPP, NPP), list(mu = ~ mean(.), sigma = ~ sd(.), len = ~ length(.))))
# # A tibble: 10 x 7
#    month GPP_mu GPP_sigma GPP_len NPP_mu NPP_sigma NPP_len
#    <int>  <dbl>     <dbl>   <int>  <dbl>     <dbl>   <int>
#  1     1      2         1       3      2         1       3
#  2     2      3         1       3      3         1       3
#  3     3      4         1       3      4         1       3
#  4     4      5         1       3      5         1       3
#  5     5      6         1       3      6         1       3
#  6     6      7         1       3      7         1       3
#  7     7      8         1       3      8         1       3
#  8     8      8        NA       1      8        NA       1
#  9     9      9        NA       1      9        NA       1
# 10    10     10        NA       1     10        NA       1

In this case, an average of 8 may be meaningful, but noting that it is a length of 1 is also informative of the "strength" of that statistic (i.e., weak). ﻿