Delete text between parentheses, but never past empty line

Question

Consider a text file, with lines of text gathered in many blocks, where each block is separated by at least one empty line. Using a Bash one-liner, how do I delete all text from < to either > or \n\n?

To put it differently: Delete everything between each pair of < and >. If a <has no closing >, delete everything until the end of the block (an empty line), but never, ever delete outside the block!

Conceptually, should I physically separate the blocks into objects in a list before parsing for safety, or is this a straight forward linear text parsing job as long as you know what you are doing?

Example text:

This is the first
block of text.
                             <-- empty line
<delete me>
This is the second block.
<delete
here>
<delete this, but
                             <-- empty line
do not delete this>
<delete this too>
Third block here.

(more blocks)

The result should be:

This is the first
block of text.
                             <-- empty line
This is the second block.
                             <-- empty line
do not delete this>
Third block here.

What exactly do you mean by "empty line"? Is that meant to be a literal empty line where you say ...(empty line)? Or do you just mean a line without a >? If the former, perhaps just edit the question to show a literal empty line; we will understand. — Sparhawk
– Sparhawk, Commented Apr 7, 2018 at 10:25
@RomanPerekhrest: I can easily solve it with some lines of PHP, but I'm looking for something short and UNIX-y that fits snugly inside a Bash script. — forthrin
– forthrin, Commented Apr 7, 2018 at 11:00
is this okay? awk -v RS= -v ORS= '{gsub(/<[^>]+>?/, "")}1' doesn't preserve the empty lines but newline character after > remains.. so may not suit for your real sample — Sundeep
– Sundeep, Commented Apr 7, 2018 at 11:44
You got it almost right: it should be perl -0777pe 's/<.*?(>|(?=\n\n))//sg' — wolfrevokcats
– wolfrevokcats, Commented Apr 7, 2018 at 12:18

Sundeep · Accepted Answer · 2018-04-07 13:18:27Z

Try awk's paragraph mode:

$ awk -v RS= -v ORS='\n\n' '{gsub(/<[^>]+>?\n?/, "")}1' ip.txt 
This is the first
block of text.

This is the second block.


do not delete this>
Third block here.

-v RS= this will cause one or more consecutive empty lines to be used as input record separator
-v ORS='\n\n' set output record separator as two newline characters
gsub(/<[^>]+>?\n?/, "") delete < followed by non > characters followed by optional > and newline characters
1 idiomatic way to print input record contents

Same thing with perl

perl -00 -lpe 'BEGIN{$\="\n\n"} s/<[^>]+>?\n?//g' ip.txt

RomanPerekhrest · Accepted Answer · 2018-04-07 11:07:13Z

0

GNU Awk solution:

awk -v RS='[<>]' '/\n\n/{ sub(/^[^\n]+\n/, ""); print $0 RT }' file

RS='[<>]' - treat < and > as record separator
/\n\n/ - if current record contains 2 linebreaks:
- sub(/^[^\n]+\n/, "") - remove everything till the 1st newline (inclusive)
- print $0 RT - print current record followed by RT (i.e. >)
- RT - the record terminator. Gawk sets RT to the input text that matched the character or regular expression specified by RS.

The output:

<empty line>   
don't delete this>

answered Apr 7, 2018 at 11:07

RomanPerekhrest

30.8k5 gold badges47 silver badges68 bronze badges

Add a comment |

Stack Exchange Network

Delete text between parentheses, but never past empty line

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Delete text between parentheses, but never past empty line

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions