6

This question is a follow-up to my previous question on recursive random sampling Efficient recursive random sampling. The solutions in that thread work fine when the groups are of identical size or when a fixed number of samples per group is required. However, let's imagine a dataset as follows;

   ID1 ID2
1    A   1
2    A   6
3    B   1
4    B   2
5    B   3
6    C   4
7    C   5
8    C   6
9    D   6
10   D   7
11   D   8
12   D   9

where we want to randomly sample up to n ID2 for each ID1, and doing so recursively. Recursively here means that we are moving from the first ID1 to the last ID1, and if an ID2 was already sampled for an ID1, then it should not be used for a subsequent ID1. Let's say n = 2, then expected results would be as follows;

ID1 ID2
1    A   1
2    A   6
4    B   2
5    B   3
6    C   4
7    C   5
11   D   8
12   D   9
  • For ID1 = "A", there are exactly two potential ID2, so both are selected.
  • For ID1 = "B", there are two potential ID2 left to select, so both are selected.
  • For ID1 = "C", there are two potential ID2 left to select, so both are selected.
  • For ID = "D", there are three potential ID2 left to sample from, so two are randomly selected from those.

What can happen beyond the situation shown in the example;

  • Every ID1 always has a non-zero number of ID2 available, however, it is possible that all of those ID2 were already used. In that case, ID1 should be simply left out.
  • It is possible that none of ID1 will have the specified n of ID2. In that case, the n closest to specified n should be retrieved.
  • ID doesn't have to be seq(ID1).
  • ID2 could be also a character vector similar to ID1.

Sample df;

df <- structure(list(ID1 = c("A", "A", "B", "B", "B", "C", "C", "C", 
"D", "D", "D", "D"), ID2 = c(1, 6, 1, 2, 3, 4, 5, 6, 6, 7, 8, 
9)), class = "data.frame", row.names = c(NA, -12L))
6
  • 1
    is ID2 always seq(ID1) or can you just have 1, 6 for example? Commented Jan 27, 2022 at 17:33
  • @Onyambu it could be any number. Also, ID2 could be theoretically even a string. Commented Jan 27, 2022 at 17:37
  • 1
    Just saying, the example you gave might be misleading. Please if possible, switch up some ID2, in that you do not have the sequence. Foe example my first thought was to drop all groups with size 2 since they cannot be sampled as A has already been sampled and has size 2. You know what I mean? eg if E had size 2 we could not sample from it. This idea is wrong because E can have ID2=10, 20. So please try to make the data more intruiging. HOpe this helps Commented Jan 27, 2022 at 17:41
  • Does it have to be a data.frame? Commented Jan 27, 2022 at 17:44
  • @Onyambu thanks for your remarks, I updated it accordingly. I also agree that it's actually a mix of random sampling and selection of all available values, depending on the group size. Commented Jan 27, 2022 at 17:48

3 Answers 3

1

The following function seems to give what you are after. Basically, it loops through each group of ID1 and selects the rows where the corresponding ID2 has not been sampled. Then it selects the distinct rows (in the case that some group of ID1 has duplicate ID2 values. The sample size will be the minimum of either n, or the number of rows for that group.

sample <- function(df, n) {
  `%notin%` <- Negate(`%in%`)
  groups <- unique(df$ID1)
  out <- data.frame(ID1 = character(), ID2 = character())
  
  for (group in groups) {
    options <- df %>%
      filter(ID1 == group,
             ID2 %notin% out$ID2)
    
    chosen <- sample_n(options,
                       size = min(n, nrow(options))) %>%
      distinct()
    
    out <- rbind(out, chosen)
  }
  
  out
}

set.seed(123)
sample(df, 2)

  ID1 ID2
1   A   1
2   A   6
3   B   2
4   B   3
5   C   4
6   C   5
7   D   8
8   D   9

Case where a group of ID1 has ID2s that were already used up: Input:

# A tibble: 10 × 2
   ID1     ID2
   <chr> <dbl>
 1 A         1
 2 A         3
 3 B         1
 4 B         3
 5 C         5
 6 C         6
 7 C         7
 8 C         7
 9 D        10
10 D        20

Output:

sample(df2, 2)
# A tibble: 6 × 2
  ID1     ID2
  <chr> <dbl>
1 A         3
2 A         1
3 C         6
4 C         7
5 D        20
6 D        10
Sign up to request clarification or add additional context in comments.

1 Comment

This seems very promising, thanks a lot :) I will leave it open so that I can also award the bounty.
1

Here is a base R option using dynamic programming (DP)

d <- table(df)
nms <- dimnames(d)
res <- list()
for (i in nms$ID1) {
  idx <- which(d[i, ] > 0)
  if (length(idx) >= 2) {
    j <- sample(idx, 2)
    res[[i]] <- nms$ID2[j]
    d[, j] <- 0
  }
}
dfout <- type.convert(
  setNames(rev(stack(res)), names(df)),
  as.is = TRUE
)

which gives

  ID1 ID2
1   A   6
2   A   1
3   B   2
4   B   3
5   C   4
6   C   5
7   D   7
8   D   8

For the case with used ID2 already, e.g.,

> (df <- structure(list(ID1 = c(
+   "A", "A", "B", "B", "B", "C", "C", "C",
+   "D", "D", "D", "D"
+ ), ID2 = c(
+   1, 3, 1, 2, 3, 3, 4, 5, 4, 5, 6, .... [TRUNCATED]
   ID1 ID2
1    A   1
2    A   3
3    B   1
4    B   2
5    B   3
6    C   3
7    C   4
8    C   5
9    D   4
10   D   5
11   D   6
12   D   1

we will obtain

  ID1 ID2
1   A   1
2   A   3
3   C   5
4   C   4

Comments

0

I dont know whether I am oversimplifying the problem. Take a look at the following and see whether it works in your case:

library(tidyverse)


df %>%
  group_split(ID1)%>%
  reduce(~ bind_rows(.x, .y) %>%
           filter(!duplicated(ID2))%>%
           group_by(ID1)%>%
           slice_sample(n=2) %>%
           ungroup, 
         .init = slice_sample(.[[1]], n=2))

# A tibble: 8 x 2
  ID1     ID2
  <chr> <dbl>
1 A         1
2 A         6
3 B         2
4 B         3
5 C         4
6 C         5
7 D         9
8 D         8

Disclaimer: NOt vectorized, thus inefficient

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.