0

I have vector of strings in R! which stores file names.

File names contains date stored in following format: 'YYYYMMDD'. Sample file names are as follows:

"ext-SM_OPER_MIR_CLF31A_20150506T000000_20150506T235959_300_002_7_1.DBL.nc" "ext-SM_RE04_MIR_CLF31A_20150505T000000_20150505T235959_300_001_7_1.DBL.nc"

I would like to sort list using date from file name (so that files for the earliest date will be the first in the vector). Unfortunately, sort function in R! doesn't have any 'regex' sorting criteria parameter. How should I do that?

My sample data:

files <- c("ext-SM_OPER_MIR_CLF31A_20150506T000000_20150506T235959_300_002_7_1.DBL.nc", 
"SMAP_L3_SM_AP_20150422_R13080_001.h5.tif","SMAP_L3_SM_AP_20150606_R13080_001.h5.tif",
"ext-SM_OPER_MIR_CLF31A_20150530T000000_20150530T235959_300_003_7_1.DBL.nc",
"ext-SM_RE04_MIR_CLF31A_20150418T000000_20150418T235959_300_001_7_1.DBL.nc", 
"ext-SM_RE04_MIR_CLF31A_20150419T000000_20150419T235959_300_001_7_1.DBL.nc")
2
  • You will have to extract the date which can then be used to order the original vector. Commented Nov 19, 2016 at 13:51
  • In Python I would do that as 'import re dates = re.findall('(\d{8})',FileName) dates[0]' No idea how to do that in R! (I've tried grep), but anyway - what should I do next if I have vector of dates extracted? Commented Nov 19, 2016 at 13:58

3 Answers 3

2

This should work:

files[order(as.Date(regmatches(files,regexpr("(?<=_)[0-9]{8}",files,perl=T)),format="%Y%m%d"))]

edit: same approach as everyone. Extract the dates, turn them into a date format, then use them to reorder files.
The idea behind the regex is to extract a series of 8 numbers ([0-9]{8}) that occurs after a _ symbol ((?<=_))

Sign up to request clarification or add additional context in comments.

Comments

2

You can use stringi to extract the dates and sort, i.e.

library(stringi)
v1 <- unique(unlist(stri_extract_all_regex(files, '2015[0-9]{4}')))
ind <- order(as.POSIXct(v1, format = '%Y%m%d'))
files[ind]
#[1] "ext-SM_RE04_MIR_CLF31A_20150418T000000_20150418T235959_300_001_7_1.DBL.nc"
#[2] "ext-SM_RE04_MIR_CLF31A_20150419T000000_20150419T235959_300_001_7_1.DBL.nc"
#[3] "SMAP_L3_SM_AP_20150422_R13080_001.h5.tif"                                 
#[4] "ext-SM_OPER_MIR_CLF31A_20150506T000000_20150506T235959_300_002_7_1.DBL.nc"
#[5] "ext-SM_OPER_MIR_CLF31A_20150530T000000_20150530T235959_300_003_7_1.DBL.nc"
#[6] "SMAP_L3_SM_AP_20150606_R13080_001.h5.tif"  

Comments

2

Here is what you can do:

temp <- as.Date(sub('^\\S+\\_([0-9]{8})[T\\_][0A-Z]\\S+', '\\1', files), "%Y%m%d")
files[order(temp)]

# [1] "ext-SM_RE04_MIR_CLF31A_20150418T000000_20150418T235959_300_001_7_1.DBL.nc"
# [2] "ext-SM_RE04_MIR_CLF31A_20150419T000000_20150419T235959_300_001_7_1.DBL.nc"
# [3] "SMAP_L3_SM_AP_20150422_R13080_001.h5.tif"                                 
# [4] "ext-SM_OPER_MIR_CLF31A_20150506T000000_20150506T235959_300_002_7_1.DBL.nc"
# [5] "ext-SM_OPER_MIR_CLF31A_20150530T000000_20150530T235959_300_003_7_1.DBL.nc"
# [6] "SMAP_L3_SM_AP_20150606_R13080_001.h5.tif"  

The first step is to extract the dates from the files names, to put in in the variable temp, and then to sort the vector files according to the order of the dates.

The regex works likes this:

Start at the end of the file name (^), select non-whitespaces characters (\\S+), then an underscore (\\_), then the eight numbers of a date (([0-9]{8})), put them in a capturing groups (with the brackets), then match the following characters (a T or an underscore), followed by a 0 or a letter, and return the capturing group (\\1).

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.