3

Here is a problem I am trying to solve. Say, I have two data frames like the following:

observations <- data.frame(id = rep(rep(c(1,2,3,4), each=5), 5),
    time = c(rep(1:5,4), rep(6:10,4), rep(11:15,4), rep(16:20,4), rep(21:25,4)),
    measurement = rnorm(100,5,7))

sampletimes <- data.frame(location = letters[1:20], 
    id = rep(1:4,5),
    time1 = rep(c(2,7,12,17,22), each=4), 
    time2 = rep(c(4,9,14,19,24), each=4))

They both contain a column named id, which links the data frames. I want to have the measurements from observationss for whichtimeis betweentime1andtime2from thesampletimesdata frame. Additionally, I'd like to connect the appropriatelocation` to each measurement.

I have successfully done this by converting my sampletimes to a wide format (i.e. all the time1 and time2 information in one row per entry for id), merging the two data frames by the id variable, and using conditional statements to take only instances when the time falls between at least one of the time intervals in the row, and then assigning location to the appropriate measurement.

However, I have around 2 million rows in observations and doing this takes a really long time. I'm looking for a better way where I can keep the data in long format. The example dataset is very simple, but in reality, my data contains variable numbers of intervals and locations per id.

For our example, the data frame I would hope to get back would be as follows:

id time measurement letters[1:20]
1    3  10.5163892             a
2    3   5.5774119             b
3    3  10.5057060             c
4    3  14.1563179             d
1    8   2.2653761             e
2    8  -1.0905546             f
3    8  12.7434161             g
4    8  17.6129261             h
1   13  10.9234673             i
2   13   1.6974481             j
3   13  -0.3664951             k
4   13  13.8792198             l
1   18   6.5038847             m
2   18   1.2032935             n
3   18  15.0889469             o
4   18   0.8934357             p
1   23   3.6864527             q
2   23   0.2404074             r
3   23  11.6028766             s
4   23  20.7466908             t
0

2 Answers 2

4

Here's a proposal with merge:

# merge both data frames
dat <- merge(observations, sampletimes, by = "id")
# extract valid rows
dat2 <- dat[dat$time > dat$time1 & dat$time < dat$time2, seq(4)]
# sort
dat2[order(dat2$time, dat2$id), ]

The result:

    id time measurement location
11   1    3    7.086246        a
141  2    3    6.893162        b
251  3    3   16.052627        c
376  4    3   -6.559494        d
47   1    8   11.506810        e
137  2    8   10.959782        f
267  3    8   11.079759        g
402  4    8   11.082015        h
83   1   13    5.584257        i
218  2   13   -1.714845        j
283  3   13  -11.196792        k
418  4   13    8.887907        l
99   1   18    1.656558        m
234  2   18   16.573179        n
364  3   18    6.522298        o
454  4   18    1.005123        p
125  1   23   -1.995719        q
250  2   23   -6.676464        r
360  3   23   10.514282        s
490  4   23    3.863357        t
Sign up to request clarification or add additional context in comments.

Comments

3

Not efficient , but do the job :

 subset(merge(observations,sampletimes), time > time1 & time < time2)
        id time measurement location time1 time2
    11   1    3    3.180321        a     2     4
    47   1    8    6.040612        e     7     9
    83   1   13   -5.999317        i    12    14
    99   1   18    2.689414        m    17    19
    125  1   23   12.514722        q    22    24
    137  2    8    4.420679        f     7     9
    141  2    3   11.492446        b     2     4
    218  2   13    6.672506        j    12    14
    234  2   18   12.290339        n    17    19
    250  2   23   12.610828        r    22    24
    251  3    3    8.570984        c     2     4
    267  3    8   -7.112291        g     7     9
    283  3   13    6.287598        k    12    14
    360  3   23   11.941846        s    22    24
    364  3   18   -4.199001        o    17    19
    376  4    3    7.133370        d     2     4
    402  4    8   13.477790        h     7     9
    418  4   13    3.967293        l    12    14
    454  4   18   12.845535        p    17    19
    490  4   23   -1.016839        t    22    24

EDIT

Since you have more than 5 millions rows, you should give a try to a data.table solution:

library(data.table)
OBS <- data.table(observations)
SAM <- data.table(sampletimes)
merge(OBS,SAM,allow.cartesian=TRUE,by='id')[time > time1 & time < time2]

4 Comments

Initially, this solution worked. However, when I increased the size of the observations data frame to roughly 5 million rows, it did not work. I got the following error: Error: evaluation nested too deeply: infinite recursion / options(expressions=)?. Every other command I tried produced the same error message.
@Frank I edit my solution to provide a data.table solution. let me know if you get the same error.
My times were in POSIXlt format, and it didn't work, as data.table doesn't support POSIXlt. Once I converted my time columns to POSIXct, it worked.
@Frank works for the 5 millions rows? better to use POSIXct, it is a numerric. POSIXlt is a list , use it only if you want to subset your datetimes (get hour, days,months..)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.