RegEx and Replace

Question

I have a Powershell script that I'm trying to write to go thru a poorly formatted XML file to look for any nodes that have the word "Date" as part of the node name. I.E.

<System><SystemName>Acme</Systemname><SystemDate>313</SystemDate><SystemNumber>3</SystemNumber><FileDate>394</FileDate></System>

The above pattern is repeated with hundreds of times thoughout the file... for about 70MB worth of data.

The real file has lots more nodes and no linefeeds or anything... so it all appears on one line.

What I need to do is scan the file and look for any nodes that end in "Date" where the value is not 4 digits and replace with a 4 digit value.

Here is what I have so far... but it looks like the replace is only changing the first occurance and not all other matches after the first match.

Using the example above, it should find the closing </SystemDate> and closing </FileDate> node and see that the digit is only 3 characters and replace with 9999.

 $infile=get-content z:\system.txt
 write-host $infile.Length
 $regex = New-Object System.Text.RegularExpressions.Regex ">\d\d\d</(.*Date)"
 $replace = $regex.Replace($infile,"9999")
 write-host $infile.Length
 write-host $replace.Length
 set-content -Value $replace z:\new_system.txt

Any help would be appreciated!

Zac Thompson · Accepted Answer · 2011-06-16 06:00:50Z

1

( I think you've oversimplified your code ... e.g. you probably mean to say $regex.Replace($infile,">9999</$1") )

Leaving that aside, the first thing I would do would be to make the matching regex more precise: ">\d\d\d</([^>]*Date)" ... I'm assuming that PowerShell's regex implementation is greedy as with other implementations. This might solve the problem right away.

If not, I think the natural thing to do would be to loop over the Matches. But the Replace method claims to replace them all, so I think it should be possible to avoid that.

answered Jun 16, 2011 at 6:00

Zac Thompson

12.8k48 silver badges59 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user500741 Over a year ago

I'm not sure I follow what you're saying. But the way I read the regex is to find 3 digits between the start of a openiging-tag last bracket (from previous elment) and the start ending-tag on next element for anything that has *Time within the tag.

Zac Thompson Over a year ago

The regex is not looking "within the tag" it is looking at your whole text. .* means any text at all. So >\d\d\d</.*Date applied against <Date>123</Date><a>a</a>...otherstuff...<Date>2011</Date> will match everything from >123 up to the second-last character. This is assuming that PowerShell regexes are "greedy".

Elroy Flynn Over a year ago

Kinda late to the party, but I agree with Zac's solution, except that it would match on ">321</Date" within text that is ">321</DateMePlease>". Fix that by using pattern ">\d\d\d</([^>]*Date>)" and adjusting the replacement text accordingly.

Zac Thompson Over a year ago

@Elroy Yah, the OP says both "part of the node name" and "ends in ...", so it's quite possible that your fix would be closer to their actual needs. But it's not clear from the question.

meggar · Accepted Answer · 2011-06-16 05:53:38Z

0

$xmlDocument = [XML](get-content z:\system.txt)

Do it XML style

answered Jun 16, 2011 at 5:53

meggar

1,2191 gold badge10 silver badges18 bronze badges

2 Comments

Zac Thompson Over a year ago

I like this idea, but it might be harder for the OP because of the requirement to match all nodes that match *Date*. However, if this is a small known set then I think using XML methods would be much simpler.

user500741 Over a year ago

I was using [XML] initially and the REGEX was not working at all and displaying the file was just showing blanks. Although I can iterate through the nodes using a foreach successfully.

Collectives™ on Stack Overflow

RegEx and Replace

2 Answers 2

4 Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

2 Comments

Related