2

I have a problem which I hope someone can help me with. It is basically data manipulation. I have a big dataset that consists of 10 columns, "id" and 3 sets of similar variables, "type","startdate", and "enddate". An example can be seen below.

  id type1 startdate1   enddate1 type2 startdate2   enddate2 type3 startdate3
1  1     A 2006-08-20 2006-12-06     W 2006-08-01 2007-08-29     P 2007-08-18
2  2     A 2006-01-05 2007-07-02    NA         NA         NA     Q 2008-01-15

    enddate3
1 2007-09-27
2 2008-02-07

I would like to obtain the following cleaned and sorted dataset:

  id type1 startdate1   enddate1 type2 startdate2   enddate2 type3 startdate3
1  1     W 2006-08-01 2007-08-29     A 2006-08-20 2006-12-06     P 2007-08-18
2  2     A 2006-01-05 2007-07-02     Q 2008-01-15 2008-02-07    NA         NA 

enddate3
1 2007-09-27
2 NA             

I would like to sort in ascending order, every row/observation according to the "startdate". Hence for row 1, since the second group or set of variables has an earlier "startdate" (2006-08-01) as compared to the first group's "startdate" (2006-08-20), I would place it to the first position.

As for row 2, I would like to push all the NAs to the end.

Any tips on how I can do this efficiently?

Should I convert data type of "startdate" and "enddate" to numeric? If I should, how should I handle "NA"?

Is it wise to apply paste() function on the (type,startdate,enddate) for all the 3 sets?

Appreciate any help! Thank you in advance!

0

3 Answers 3

2

Same approach as Mikko Marttila but without using non-standard libraries:

> ## use vectors of class Date
> df[c(3,4,6,7,9,10)] <- lapply(df[c(3,4,6,7,9,10)], as.Date)

> ## reshape to long format
> df.1 <- reshape(df, idvar=1,
+                 varying=list(c(2,5,8), c(3,6,9), c(4,7,10)),
+                 v.names=c('type', 'startdate', 'enddate'),
+                 times=c(1,2,3), timevar='group', direction='long')
> df.1
#     id group type  startdate    enddate
# 1.1  1     1    A 2006-08-20 2006-12-06
# 2.1  2     1    A 2006-01-05 2007-07-02
# 1.2  1     2    W 2006-08-01 2007-08-29
# 2.2  2     2 <NA>       <NA>       <NA>
# 1.3  1     3    P 2007-08-18 2007-09-27
# 2.3  2     3    Q 2008-01-15 2008-02-07

> ## reset group variable according to startdate
> df.1$group <- with(df.1, unsplit(lapply(split(startdate, id), order), id))
> df.1
#     id group type  startdate    enddate
# 1.1  1     2    A 2006-08-20 2006-12-06
# 2.1  2     1    A 2006-01-05 2007-07-02
# 1.2  1     1    W 2006-08-01 2007-08-29
# 2.2  2     3 <NA>       <NA>       <NA>
# 1.3  1     3    P 2007-08-18 2007-09-27
# 2.3  2     2    Q 2008-01-15 2008-02-07

> ## back to wide format
> df.2 <- reshape(df.1[order(df.1$group), ], idvar=1,
+                 v.names=c('type', 'startdate', 'enddate'), timevar='group',
+                 direction='wide')

> ## sort by id
> df.2 <- df.2[order(df.2$id), ]

> df.2
#     id type.1 startdate.1  enddate.1 type.2 startdate.2  enddate.2 type.3
# 1.2  1      W  2006-08-01 2007-08-29      A  2006-08-20 2006-12-06      P
# 2.1  2      A  2006-01-05 2007-07-02      Q  2008-01-15 2008-02-07   <NA>
#     startdate.3  enddate.3
# 1.2  2007-08-18 2007-09-27
# 2.1        <NA>       <NA>
Sign up to request clarification or add additional context in comments.

Comments

1

Here's a solution using dplyr and tidyr that relies on converting the dataset into long format, reordering as desired, and then converting back to wide format. The conversion to long format coerces values to character, so column types need to be reapplied.

library(tidyr)
library(dplyr)

df <- read.table(header = TRUE, text = "
id type1 startdate1   enddate1 type2 startdate2   enddate2 type3 startdate3   enddate3
 1     A 2006-08-20 2006-12-06     W 2006-08-01 2007-08-29     P 2007-08-18 2007-09-27
 2     A 2006-01-05 2007-07-02    NA         NA         NA     Q 2008-01-15 2008-02-07
")

df %>%
    gather(key, value, -id) %>%  # convert to long format
    extract(key, c("var", "seq"), "(.*)(\\d)") %>%  # extract sequence number
    spread(var, value) %>%  # spread to wide format by id and sequence
    group_by(id) %>%
    arrange(startdate) %>%  # sort seq by startdate in id groups
    mutate(seq = 1:n()) %>%  # calculate new sequence order
    gather(key, value, -id, -seq) %>%  # convert to long format
    transmute(var = paste0(key, seq), value) %>%  # generate wide format names
    spread(var, value) %>%  # spread to back to wide format
    select(one_of(names(df))) %>%  # restore original column order
    mutate_each("as.Date", one_of(grep("date", names(df), value = TRUE)))
        # reapply date type to original date variables

#     Source: local data frame [2 x 10]
#     Groups: id [2]
#     
#          id type1 startdate1   enddate1 type2 startdate2   enddate2 type3 startdate3   enddate3
#       (int) (chr)     (date)     (date) (chr)     (date)     (date) (chr)     (date)     (date)
#     1     1     W 2006-08-01 2007-08-29     A 2006-08-20 2006-12-06     P 2007-08-18 2007-09-27
#     2     2     A 2006-01-05 2007-07-02     Q 2008-01-15 2008-02-07    NA       <NA>       <NA>

Comments

1

We can use the rbind.fill from the plyr package. Now, that function is intelligent enough to combine according to column names - we don't want that. To push the observations forward for each row, we remove NA's and then apply the names of the original data frame to the new vector.

library(plyr)

df <- data.frame("obs" = seq(3),
                 type1 = c(2,2,NA),date1 = c("date11","date21",NA), 
                 type2 = c(3,NA,5),date2 = c("date12",NA,"date31"),
                 type3 = c(4,3,1), date3 = c("date13","date22","date32"),
                 type4 = c(4,4,NA),date4 = c("date14","date23",NA))
df
#    obs type1  date1 type2  date2 type3  date3 type4  date4
#    1   1     2 date11     3 date12     4 date13     4 date14
#    2   2     2 date21    NA   <NA>     3 date22     4 date23
#    3   3    NA   <NA>     5 date31     1 date32    NA   <NA>

newdf <- sapply(1:nrow(df), function(i){
    newrow <- (df[i,!is.na(df[i,])])              ## Remove NA's
    names(newrow) <- names(df)[1:length(newrow)]  ## Apply names

    newrow                                        ## Output
})

rbind.fill(newdf)
#    obs type1  date1 type2  date2 type3  date3 type4  date4
#    1   1     2 date11     3 date12     4 date13     4 date14
#    2   2     2 date21     3 date22     4 date23    NA   <NA>
#    3   3     5 date31     1 date32    NA   <NA>    NA   <NA>

Caution: this code only works if type and the dates come together as either observed or NA's.

2 Comments

I JUST saw that you want the push to be dependent on the dates. I think essentially you are asking two questions - 1): how to push and 2): how to sort. I have only responded to the first question.
Thanks a lot! This seems very useful as my dataset is very sparse, and I would really need to push the NAs to the right.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.