Make change operations on substring only

Question

In a file that has any garbled text before and after a section that is marked by patterns START and END (specific strings that occur only once each and in the correct order and on the same line). I would like to do some string manipulation ONLY on the part between START and END

Example input:

aomodi3hriq32| ¶³r 0q93aoiSTART_this_is_to_be_modified_ENDaqsdofuha23uru| ²23i ii3uhfia
oawpo3<9"§ A hSTART_this_also_needs_modification_ENDqa 032/a237(°1Q"§ >A_this_
START changeme ENDnot_this_modias

In terms of sed-operations, the substring (and the substring only) between START and END should be modified as if I used sed 's/_this_// ; s/modi/MODI/ ; y/as/45/'.

Example output:

aomodi3hriq32| ¶³r 0q93aoiSTARTi5_to_be_MODIfied_ENDaqsdofuha23uru| ²23i ii3uhfia
oawpo3<9"§ A hSTART4l5o_need5_MODIfic4tion_ENDqa 032/a237(°1Q"§ >A_this_
START ch4ngeme ENDnot_this_modias

awk with FS="START|END" fails as the OFS cannot be set to multiple values at different positions.

I tried using sed with a nested command substitution and different separators (~) but failed and also fear that there might be characters before START/after END that will mess with the command (e.g. a /). The idea was to only select the "inner" substring and do the operations then use it as part of the replacement:

sed "s/^\(.*\)START.*END\(.*\)$/\1$(sed 's~^.*START~~
                                         s~END.*~~
                                         s~_this_~~
                                         s~modi~MODI~
                                         y~as~45~' infile)\2/" infile

I am not familiar with e.g. perl .... but whatever it takes.

Is there any way to make a set of sed-operations apply to a REGEX-matched substring of a line only?

Do you know if there's a character that won't appear in the text? — schrodingerscatcuriosity
– schrodingerscatcuriosity, Commented Jan 12, 2022 at 22:47
@schrodigerscatcuriosity well, NUL, CR ... say most control characters except for newline and tab. But anything may appear. — FelixJN
– FelixJN, Commented Jan 12, 2022 at 22:57

choroba · Accepted Answer · 2022-01-13 17:50:07Z

5

perl -CSD -ne '
    if (my ($before, $between, $after) = /^(.*START)(.*)(END.*)/) {
        s/_this_//, s/modi/MODI/, tr/as/45/ for $between;
        print "$before$between$after\n";
    } else { print; }' -- file

-CSD decodes the input from UTF-8 and encodes output to UTF-8
Instead of populating the three variables $before, $between, and $after, we could have used /p with ${^PREMATCH} and ${^POSTMATCH}, but I don't find the solution nicer:
```
if (my ($s) = /START(.*)END/p) {
    s/_this_//, s/modi/MODI/, tr/as/45/ for $s;
    print "${^PREMATCH}START${s}END${^POSTMATCH}";
} else { print; }
```

If START...END parts can be repeated on a single line, you need to loop over each line.

for my $part (split /(START.*?END)/) {
    if ($part =~ /^START.*END$/) {
        s/_this_//, s/modi/MODI/, tr/as/45/ for $part;
    }
    print "$part";
}

edited Jan 13, 2022 at 17:50

answered Jan 12, 2022 at 22:39

choroba

49.4k7 gold badges92 silver badges118 bronze badges

1

I clearly need to introduce myself to perl for advanced string manipulations. However, e.g. ° seem to be corrupted. Not using -Cio is fine in my case. Looks neat and powerful. Does using p set pre-/postmatch variables?

FelixJN
– FelixJN

2022-01-12 23:12:36 +00:00
Commented Jan 12, 2022 at 23:12
Yes, it does. Maybe the file is not UTF-8 in your case but some other encoding?

choroba
– choroba

2022-01-13 00:30:32 +00:00
Commented Jan 13, 2022 at 0:30
file returns "UTF-8 Unicode text" - using perl v5.32.1 .... As a matter of fact using the UTF-8 input file and redirecting the output with -Cio creates a ISO-8859 text file fore me - with the corruptions.

FelixJN
– FelixJN

2022-01-13 09:52:30 +00:00
Commented Jan 13, 2022 at 9:52
1

OK, use -CSD instead (updated), it works for all types of input/output.

choroba
– choroba

2022-01-13 10:00:26 +00:00
Commented Jan 13, 2022 at 10:00
1

I'll grant the "accepted answer" here because it seems like perl was made for this: sieving out a substring, manipulating it, and putting it back into place. It also is the most intuitive one, that does not rely on counting characters, manually introducing separators, and is quite straightforward to follow.

FelixJN
– FelixJN

2022-01-13 14:07:43 +00:00
Commented Jan 13, 2022 at 14:07

| Show 4 more comments

Kusalananda · Accepted Answer · 2022-01-13 08:58:41Z

Using standard sed and assuming every line contains exactly one START and one END substring (in that order):

# Skip (pass through) lines that does not have START followed by END.
/.*START\(.*\)END.*/ !b

# Save the original line in the hold space.
h

# Remove the start and the end from the line.
# This leaves the bit of the line that we want to modify.
# (This reuses the previous regular expression.)
s//\1/

# Modify what's left.
s/_this_//
s/modi/MODI/
y/as/45/

# Append the original line from the hold space,
# with a newline as delimiter.
G

# Move the modified bit into the correct spot with a substitution,
# while deleting the old substring between START and END.
s/\(.*\)\n\(.*START\).*\(END.*\)/\2\1\3/

Testing:

$ cat file
aomodi3hriq32| ¶³r 0q93aoiSTART_this_is_to_be_modified_ENDaqsdofuha23uru| ²23i ii3uhfia
oawpo3<9"§ A hSTART_this_also_needs_modification_ENDqa 032/a237(°1Q"§ >A_this_
START changeme ENDnot_this_modias

$ sed -f script file
aomodi3hriq32| ¶³r 0q93aoiSTARTi5_to_be_MODIfied_ENDaqsdofuha23uru| ²23i ii3uhfia
oawpo3<9"§ A hSTART4l5o_need5_MODIfic4tion_ENDqa 032/a237(°1Q"§ >A_this_
START ch4ngeme ENDnot_this_modias

In-line, on the command line:

sed -e '/.*START\(.*\)END.*/!b' -e h -e 's//\1/' \
    -e 's/_this_//' -e 's/modi/MODI/' -e 'y/as/45/' \
    -e G -e 's/\(.*\)\n\(.*START\).*\(END.*\)/\2\1\3/' file

So it can be done in sed and a single script only. Very interestign. — FelixJN
– FelixJN, Commented Jan 13, 2022 at 9:56

αғsнιη · Accepted Answer · 2022-01-13 10:41:02Z

You can always build your own multiple OFS:

awk -v FS='START|END' -v OFS= -v map='_this_\r\rmodi\rMODI\ra\r4\rs\r5' '
  BEGIN{ split(FS, mOFS, "|") }
  { n=split(map, tr, "\r"); for(i=1; i<n; i+=2) gsub(tr[i], tr[i+1], $2);
  print $1, mOFS[1], $2, mOFS[2], $3
}' infile

note that first argument of the gsub() is the regex, so careful when defining the map=....; also there should not be some specials characters for their right-hand mapping such as &، back-references \1, etc; however as you are writing the mapping manually, so you can escape any special characters to avoid them interpreting specially by the gsub().

I used CR \r to separate the mapping as you mentioned that's the only thing that won't be exist in your input file beside \0 which this cannot use in split() and other functions in awk (or maybe in other programming languages too) as awk will only consider maximum one \0 can be exist within a string. so every left-hand regex tr[i] (strings here) will be replaced with their next right-hand one tr[i+1] from the tr array.

using this way will save you from writing multiple gsub() for every pair.

FelixJN · Accepted Answer · 2022-01-13 14:20:11Z

3

Maybe with awk and string functions:

awk 'BEGIN{FS="START|END"}
     /START.+END/ {gsub(/_this_/,"",$2)
     gsub(/modi/,"MODI",$2)
     gsub(/a/,"4",$2)
     gsub(/s/,"5",$2)
     print $1"START"$2"END"$3 ; next}
     1' infile

edited Jan 13, 2022 at 14:20

answered Jan 12, 2022 at 22:17

FelixJN

14.1k2 gold badges36 silver badges55 bronze badges

This works only on lines that have as the first occurence a "START" and as the 2nd (and last) occurence a "END". It wouldn't work if the first separator is an "END". And not also work if there are more than 2 separator per line. (Not sure if those cases are possible, but maybe should be stated as a limitation?). cycling over each field and tracking if we are currently "behind a START and (up to now) behind an END" would solve this and make it more general ?

Olivier Dulac
– Olivier Dulac

2022-01-13 14:10:25 +00:00
Commented Jan 13, 2022 at 14:10
@OlivierDulac Originally, I expect START/END to be present in every line, always as a pair and in correct order. But you are right - I now added a check so that both must be present before doing the operation, otherwise just return the line. This makes it a more general solution.

FelixJN
– FelixJN

2022-01-13 14:17:45 +00:00
Commented Jan 13, 2022 at 14:17
I added a solution that works for every occurence of 'START...END' that have neither 'START' nor 'END' inside '...', and '...' where can span several lines

Olivier Dulac
– Olivier Dulac

2022-01-13 15:05:41 +00:00
Commented Jan 13, 2022 at 15:05

Add a comment |

Ed Morton · Accepted Answer · 2022-01-13 01:57:41Z

2

Using any awk in any shell on every Unix box:

$ cat tst.awk
match($0,/START.*END/) {
    tgt = substr($0,RSTART+5,RLENGTH-8)
    sub(/_this_/,"",tgt)
    sub(/modi/,"MODI",tgt)
    gsub(/a/,"4",tgt)
    gsub(/s/,"5",tgt)
    $0 = substr($0,1,RSTART+4) tgt substr($0,RSTART+RLENGTH-3)
}
{ print }

$ awk -f tst.awk file
aomodi3hriq32| ¶³r 0q93aoiSTARTi5_to_be_MODIfied_ENDaqsdofuha23uru| ²23i ii3uhfia
oawpo3<9"§ A hSTART4l5o_need5_MODIfic4tion_ENDqa 032/a237(°1Q"§ >A_this_
START ch4ngeme ENDnot_this_modias

answered Jan 13, 2022 at 1:57

Ed Morton

35.8k6 gold badges25 silver badges60 bronze badges

Add a comment |

Kusalananda · Accepted Answer · 2022-01-13 08:42:54Z

2

This GNU sed gives the desired result

$ sed 's/\(.\)\(START\|END\)/\1\n\2\n/g' file | \
  sed -ne '/START/,/END/s/_this_//' \
  -ne '/START/,/END/y/as/45/' \
  -ne '/START/,/END/s/modi/\U&/g;p' | \
  sed -z 's/\n\(START\|END\)\n/\1/g'
aomodi3hriq32| ¶³r 0q93aoiSTARTi5_to_be_MODIfied_ENDaqsdofuha23uru| ²23i ii3uhfia
oawpo3<9"§ A hSTART4l5o_need5_MODIfic4tion_ENDqa 032/a237(°1Q"§ >A_this_
START ch4ngeme ENDnot_this_modias

edited Jan 13, 2022 at 8:42

Kusalananda♦

356k42 gold badges735 silver badges1.1k bronze badges

answered Jan 12, 2022 at 23:25

schrodingerscatcuriosity

12.8k5 gold badges38 silver badges64 bronze badges

Nice approach. Why the hashes and not just 's/START\|END/\n&\n/g' then operate on the range /^START$/,/^END$/, then -z 's/\n$START\|END$\n/\1/g'?

FelixJN
– FelixJN

2022-01-12 23:54:13 +00:00
Commented Jan 12, 2022 at 23:54
1

@FelixJN you are right, I tried many things and the hash just was there. The $.$\(START... is to prevent to add a new line if the line begins with START or END.

schrodingerscatcuriosity
– schrodingerscatcuriosity

2022-01-13 00:11:45 +00:00
Commented Jan 13, 2022 at 0:11

Add a comment |

Olivier Dulac · Accepted Answer · 2022-01-13 15:12:25Z

I present a solution that also will

work only between a START and END, whatever is in between (but ONLY if there are no START or END in between)
works even if the in-between span several lines

constraint: I assume your file doesn't use 4 characters, I chose the 'often unused' "\001" to "\004" (but any other unused 4 characters could be used instead)

(as I use: \001 to make any END start with a newline, and any END end with a newline, forcing any other combination than "START(nonSTARTnorEND)END" to be on separate lines and thus not considered. and I use \004 to "save" the original file newlines and recover them in the end. And I use \002 to represent a START, \003 to represent an END, allowing me to check that there are none of those in between as well (and that I begin with a START and end with an END when I look for strings to be replaced). All those things are allowed because of those substitutions.

One could do:

sed -e "s/START/$(printf '\001\002')/g" -e "s/END/$(printf '\003\001')/g" INPUT \
| tr '\001\n' '\n\004' \
| gawk '
  /^\002[^\002\003]*\003$/ {
    # we know we are STICTLY between a START(\002) and an END(\003), with none of them occuring inside
    gensub("_this_", "", "g", $0) # remove inbetween START&END all occ. of _this_
    gensub("a", "4", "g", $0) ; gensub("s", "5", "g", $0) # "a" -> "4", "s" -> "5"
    gensub("modi", "MODI", "g", $0)
  }
  1 # print every lines
 ' \
| tr '\n\004' '\001\n' \
| tr -d '\001' \
| sed -e "s/$(printf '\002')/START/g" -e "s/$(printf '\003')/END/g" > OUTPUT

note: this could be further simplified (no need to replace START with \002 nor end with \003, I did that first to be also able to use : [^\002\003]* to ensure the inbetween string doesn't contain either, but the \001->\n ensures that already...)

There may be some limitation on the length of the part between START and END (there was on old awks, but as I use "gensub" (and thus, gawk) I'm sure the limit, if any, is very big) — Olivier Dulac
– Olivier Dulac, Commented Jan 13, 2022 at 15:06
Wouldn't awk need RS="\0" here? (possible for gawk not mawk AFAIK) Especially since you seem to also want to cover multi-line pattern range matches. — FelixJN
– FelixJN, Commented Jan 13, 2022 at 15:06
@FelixJN : I am not sure why that would be needed, please elaborate (note that : I "conceal" the original '\n' into '\004', and add new '\n' via a temporary '\001' that is then translated to '\n' (to ensure the intermediate state have START at the beginning of a line, and END always followed by a newline). Ie, the START...END is on 1 line only (no \n inside, as they were replaced with \004) — Olivier Dulac
– Olivier Dulac, Commented Jan 13, 2022 at 15:09
Ah yes, I missed this double translation and using \004 as placeholder for the original newlines. — FelixJN
– FelixJN, Commented Jan 13, 2022 at 15:30

guest_7 · Accepted Answer · 2022-01-15 05:34:43Z

You can do what you were trying, provided you have GNU sed with the /e flag in the s/// command:

sed -Ee "
  s/'/&\\\\&&/
  s/(.*START)(.*)(END.*)/printf %s '\\1' \"\$(printf '%s\\\\n' '\\2'|sed -e 's:_this_::;s:modi:MODI:;y:as:45:')\" '\\3'/e
" infile

The above can be broken up into functions to make it look cleaner. Here we define helper functions and variables to de clutter:

xform() {
  printf '%s\n' "$1" |
  sed -e '
    s/_this_//
    s/modi/MODI/
    y/as/45/
  '
}

fx() {
  printf %s "$1" "$(xform "$2")" "$3"
}

export -f fx xform

bre=$(printf '\\(%s\\)'  '.*START' '.*' 'END.*')

sed -e "
  s/'/&\\\\&&/g
  s/$bre/fx '\\1' '\\2' '\\3'/e
" infile

With Perl , this comes naturally:

perl -lpe '
  s{(?<=START)(.*?)(?=END)}
   [
     local $_=$1;
     s/_this_//;
     s/modi/MODI/;
     tr/as/45/r;
   ]e;
' infile

Or, POSIXly we can partition the pattern space into 3 parts, store in hold, then transform the middle portion and stitch them back.

sed -e '
  s/\n.*//;ta
  s/START.*END/\
&\
/;h;D;:a
  s/_this_//;s/modi/MODI/;y/as/45/
  G;s/\(.*\)\n\(.*\)\n.*\n/\2\1/
' infile

Stack Exchange Network

Make change operations on substring only

8 Answers 8

You must log in to answer this question.

Hot Network Questions

Make change operations on substring only

8 Answers 8

You must log in to answer this question.

Related

Hot Network Questions