2

I have a Powershell script that I'm trying to write to go thru a poorly formatted XML file to look for any nodes that have the word "Date" as part of the node name. I.E.

<System><SystemName>Acme</Systemname><SystemDate>313</SystemDate><SystemNumber>3</SystemNumber><FileDate>394</FileDate></System>

The above pattern is repeated with hundreds of times thoughout the file... for about 70MB worth of data.

The real file has lots more nodes and no linefeeds or anything... so it all appears on one line.

What I need to do is scan the file and look for any nodes that end in "Date" where the value is not 4 digits and replace with a 4 digit value.

Here is what I have so far... but it looks like the replace is only changing the first occurance and not all other matches after the first match.

Using the example above, it should find the closing </SystemDate> and closing </FileDate> node and see that the digit is only 3 characters and replace with 9999.

 $infile=get-content z:\system.txt
 write-host $infile.Length
 $regex = New-Object System.Text.RegularExpressions.Regex ">\d\d\d</(.*Date)"
 $replace = $regex.Replace($infile,"9999")
 write-host $infile.Length
 write-host $replace.Length
 set-content -Value $replace z:\new_system.txt

Any help would be appreciated!

2 Answers 2

1

( I think you've oversimplified your code ... e.g. you probably mean to say $regex.Replace($infile,">9999</$1") )

Leaving that aside, the first thing I would do would be to make the matching regex more precise: ">\d\d\d</([^>]*Date)" ... I'm assuming that PowerShell's regex implementation is greedy as with other implementations. This might solve the problem right away.

If not, I think the natural thing to do would be to loop over the Matches. But the Replace method claims to replace them all, so I think it should be possible to avoid that.

Sign up to request clarification or add additional context in comments.

4 Comments

I'm not sure I follow what you're saying. But the way I read the regex is to find 3 digits between the start of a openiging-tag last bracket (from previous elment) and the start ending-tag on next element for anything that has *Time within the tag.
The regex is not looking "within the tag" it is looking at your whole text. .* means any text at all. So >\d\d\d</.*Date applied against <Date>123</Date><a>a</a>...otherstuff...<Date>2011</Date> will match everything from >123 up to the second-last character. This is assuming that PowerShell regexes are "greedy".
Kinda late to the party, but I agree with Zac's solution, except that it would match on ">321</Date" within text that is ">321</DateMePlease>". Fix that by using pattern ">\d\d\d</([^>]*Date>)" and adjusting the replacement text accordingly.
@Elroy Yah, the OP says both "part of the node name" and "ends in ...", so it's quite possible that your fix would be closer to their actual needs. But it's not clear from the question.
0
$xmlDocument = [XML](get-content z:\system.txt)

Do it XML style

2 Comments

I like this idea, but it might be harder for the OP because of the requirement to match all nodes that match *Date*. However, if this is a small known set then I think using XML methods would be much simpler.
I was using [XML] initially and the REGEX was not working at all and displaying the file was just showing blanks. Although I can iterate through the nodes using a foreach successfully.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.