Grab data from strings in R using regular expression

Question

Now the string is looks like:

"Interest.USD,Vol=[Integrated,(0,0.101),(0.2,0.108),(1,0.110),(2,0.106),
(3,0.102),(4,0.09),(5,0.091),(6,0.09128272)],Drift=[Integrated,(0.002,0.09),
(0.24,0.0007),(0.4,0.007),(1,-0.033),(2,-0.005),(3,-0.0041),
(4,-0.3505),(5,-0.65),(7,-0.08346),(8,-0.049),(9,-0.0613),(10,-0.019)],
Risk_Neutral=YES,Lambda=0.09,FX_Volatility=0.01,FX_Correlation=0.9"

I want to grab the data following the "Vol" and "Drift" in a matrix format like:

Vol matrix:

0,0.101
0.2,0.108
1,0.110
2,0.106
3,0.102
4,0.09
5,0.091
6,0.09128272

and also the single value like 0.09 for Lambda. I guess I shuold use regular expression, but I not that familiar with that. Any suggestion? :)

P.S. I tried using:

str_extract_all(text,'[ .+? ]')

try to get the data bewteen [ and ], but it returns "."

You should use regular expression. Have you tried learning how to use them? — Señor O
– Señor O, Commented Jun 19, 2014 at 14:25
@SeñorO hi, thank you for comment. I edited my question with the way I have tried. Any suggestion for the code is welcome :) — Louisyan
– Louisyan, Commented Jun 19, 2014 at 14:33
@AvinashRaj Sorry..It was a mistake.. No new line exists in my input — Louisyan
– Louisyan, Commented Jun 19, 2014 at 14:46

MrFlick · Accepted Answer · 2014-06-19 15:05:21Z

Here's a way to extract those values in R. Let's assume that strings you posted is stored in a variable named a. In order to make things easier, i'm going to use a helper function: getcapturedmatches(). Then you can do

expr <- "(Vol|Drift)=\\[Integrated,([^\\]]*)\\]"
mm <- regcapturedmatches(a,gregexpr(expr,a, perl=T))[[1]]
expr <- "\\(([^,]+),([^,]+)\\)"
vv <- regcapturedmatches(mm[,2],gregexpr(expr,mm[,2], perl=T))

First we do a pass to extract the Vol and Drift elements in mm and then we split the comma delimited lists into vv. Now we can combine the data into one large data.frame

tt <- Map(data.frame, col=mm[,1], val=lapply(vv, 
    function(x) {class(x)<-"numeric"; x}))
dd<-do.call(rbind, unname(tt))

In the end dd will look like

     col  val.1       val.2
1    Vol  0.000  0.10100000
2    Vol  0.200  0.10800000
3    Vol  1.000  0.11000000
4    Vol  2.000  0.10600000
5    Vol  3.000  0.10200000
6    Vol  4.000  0.09000000
7    Vol  5.000  0.09100000
8    Vol  6.000  0.09128272
9  Drift  0.002  0.09000000
10 Drift  0.240  0.00070000
11 Drift  0.400  0.00700000
12 Drift  1.000 -0.03300000
13 Drift  2.000 -0.00500000
14 Drift  3.000 -0.00410000
15 Drift  4.000 -0.35050000
16 Drift  5.000 -0.65000000
17 Drift  7.000 -0.08346000
18 Drift  8.000 -0.04900000
19 Drift  9.000 -0.06130000
20 Drift 10.000 -0.01900000

This method allows for any number of repeated values in each of those sections.

If you did just want simple matrices then

Map(function(a,b) {class(b)<-"numeric"; b}, mm[,1], 
    lapply(vv, function(x) {class(x)<-"numeric"; x}))

will give you a named list of the matrices.

@Louisyan if you find this answer as a good one then don't forget to accept it.

Avinash Raj · Accepted Answer · 2014-06-19 14:39:48Z

2

You could try this regex. The value inside brackets are stored into seperate groups and the stored groups are again referenced through backreference.

Vol=.*\(([\d,.]+)\).*\(([\d,.]+)\).*\(([\d,.]+)\).*\(([\d,.]+)\).*\(([\d,.]+)\).*\(([\d,.]+)\).*\(([\d,.]+)\).*\(([\d,.]+)\).*(?=,Drift)

DEMO

See the stored group on the right-side.

answered Jun 19, 2014 at 14:39

Avinash Raj

175k32 gold badges246 silver badges289 bronze badges

Collectives™ on Stack Overflow

Grab data from strings in R using regular expression

2 Answers 2

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Related