-4

I have a file (FILE1) with some repeated sections as below (example):

LINE 1 ABCD
LINE 2 EFGA
LINE 3 HCJK
REMOVE LINE11
  REMOVE LINE12
    REMOVE LINE13
LINE 4 ABCDH
LINE 5 EFGAG
LINE 6 HCJKD
REMOVE LINE11
  REMOVE LINE12
    REMOVE LINE13
LINE 7 ABCDH
LINE 8 EFGAG
LINE 9 HCJKD

I have several such files. In a pattern file (PATTERN) I have these removable lines stored.

REMOVE LINE11
  REMOVE LINE12
    REMOVE LINE13

I want to write a sed, awk (bash code) or a Perl code to remove all the sections of of FILE that match the content of the file PATTERN. Another requirement is to remove all but leave the first occurrence only.

7
  • 1
    Is the data in a well-known format, such as XML, JSON, YAML, or a similar format? It would be nice to see real data, as we could otherwise just assume that the lines matching REMOVE should be removed, to simplify the problem. Commented Aug 6 at 6:18
  • Shall every single line from the pattern file be removed or only consecutive lines which are the same as the whole pattern file? Commented Aug 6 at 7:08
  • 2
    And what part of this is giving you trouble? Please show us what you have so far so we don't waste your time with approaches you have already tried. Commented Aug 6 at 9:13
  • 2
    Never just use the word "pattern" when discussing pattern matching. There is a major distinction between whether the "pattern" should be treated as a literal string or a regular expression (and how it's delimited). See how-do-i-find-the-text-that-matches-a-pattern for details on the issue and then edit your question to tell us if your "pattern" is to be treated as a regexp or literal, whether full-line or partial line matching is desired, etc, Commented Aug 6 at 12:13
  • 2
    Regarding "Another requirement is to remove all but leave the first occurrence only." - don't ask for help with 2 problems in 1 question. Get an answer to the first problem, then try to solve the second problem yourself given that, then ask a new question if you can't solve it yourself. Commented Aug 6 at 12:26

5 Answers 5

3

Assuming you're good with full-record (i.e. matching on the whole multi-line block), full-line regexp matching (see my comment), you could do this with GNU awk for multi-char RS:

$ awk -v RS='^$' 'NR==FNR{RS="\n("$0")?"; next} 1' pat file1
LINE 1 ABCD
LINE 2 EFGA
LINE 3 HCJK
LINE 4 ABCDH
LINE 5 EFGAG
LINE 6 HCJKD
LINE 7 ABCDH
LINE 8 EFGAG
LINE 9 HCJKD

or for full-record, full-line literal string matching we could just escape all of the characters from pat to disable any potential regexp metachars:

$ awk -v RS='^$' 'NR==FNR{gsub(/[^^\\]/,"[&]"); gsub(/[\\^]/,"\\\\&"); RS="\n("$0")?"; next} 1' pat file1
LINE 1 ABCD
LINE 2 EFGA
LINE 3 HCJK
LINE 4 ABCDH
LINE 5 EFGAG
LINE 6 HCJKD
LINE 7 ABCDH
LINE 8 EFGAG
LINE 9 HCJKD

The above was run using this input:

$ head -50 pat file1
==> pat <==
REMOVE LINE11
  REMOVE LINE12
    REMOVE LINE13

==> file1 <==
LINE 1 ABCD
LINE 2 EFGA
LINE 3 HCJK
REMOVE LINE11
  REMOVE LINE12
    REMOVE LINE13
LINE 4 ABCDH
LINE 5 EFGAG
LINE 6 HCJKD
REMOVE LINE11
  REMOVE LINE12
    REMOVE LINE13
LINE 7 ABCDH
LINE 8 EFGAG
LINE 9 HCJKD
3

Depend of the source file you can try something like:

grep -v -f pat input_file

P.S. This will remove ANY line from source file independently if it's part of full pattern or not

4
  • @Pratap just to again demonstrate my point that you need to tell us what kind of "pattern matching" you actually want to do, this is doing yet another type of "pattern matching" different from the 2 in my answer (this time it's partial-line regexp matching) which will once again produce the output you show from the input you show in your question. You really should come up with a better example that'll actually test your requirements. Commented Aug 6 at 12:43
  • 2
    .... and if you add -F then it'll become a partial-record, partial-line string match solution, and if you also add -x it'll become a partial-record, full-line string match solution, etc. - all of which will produce the output in the question but would all behave differently given different input. Commented Aug 6 at 13:08
  • @EdMorton, right, but OP did not clarify what exactly he/she want... Commented Aug 6 at 13:26
  • 1
    right, I'm not saying there's any issue with your answer, just pointing out to the OP that their requirements aren't clear and their example needs to be updated to really test whatever their requirements are. Commented Aug 6 at 14:35
2

You mentioned "bash code" in your question ...

So, in pure bash utilizing its extended test syntax [[ ... ]] with its RegEx operator =~ ...

Read the lines from the pattern file pat into array elements with readarray(AKA mapfile) removing trailing linefeed from each line with the -t option like so:

readarray -t pat <"pat"

... then make those array elements into a RegEx like so:

match=$(IFS='|'; printf '(%s)' "${pat[*]}")

... then process the lines from the data file file in a loop like so:

while IFS= read -r line; do
 [[ $line =~ $match ]] || echo "$line"
 done <"file"

So, the complete code will look like this:

{
readarray -t pat <"pat"
match=$(IFS='|'; printf '(%s)' "${pat[*]}")
while IFS= read -r line; do
 [[ $line =~ $match ]] || echo "$line"
 done <"file"
}

... outputting this:

LINE 1 ABCD
LINE 2 EFGA
LINE 3 HCJK
LINE 4 ABCDH
LINE 5 EFGAG
LINE 6 HCJKD
LINE 7 ABCDH
LINE 8 EFGAG
LINE 9 HCJKD
1

With perl:

$ perl -0777 -pe '
    BEGIN{$pat = <STDIN>}
    s{\Q$pat\E}{$n++ ? "" : $&}ge
  ' FILE < PATTERN
LINE 1 ABCD
LINE 2 EFGA
LINE 3 HCJK
REMOVE LINE11
  REMOVE LINE12
    REMOVE LINE13
LINE 4 ABCDH
LINE 5 EFGAG
LINE 6 HCJKD
LINE 7 ABCDH
LINE 8 EFGAG
LINE 9 HCJKD

Where:

  • -p: sed mode where the -expression is evaluated for each input record (stored in $_), and $_ printed afterwards.
  • -0777 changes the input record separator to an impossible byte value, so the records will be the whole input. Same as -g in newer versions.
  • BEGIN{$pat = <STDIN>}, stdin (here opened by the shell on the PATTERN file) stored in $pat variable at the BEGINning.
  • s{pattern}{replacement}flag substitution like in sed, where flags are:
    • e: replacement evaluated as perl code
    • g: global like in sed: replace all occurrences
  • pattern is \Q$pat\E, $pat enclosed in \Q/\E so its content be taken literally and not as a regular expression; remove those \Q, \E for the contents of PATTERN to be taken as a (perl) regular expression (beware . in the regular expression won't match the newline character unless the s flag is added (which can also be done in the regexp with (?s))).
  • replacement is $n++ ? "" : $&. We increment $n which will be 0 on first substitution and > 0 on next, so we replace with $& (what was matched; so leave the first occurrence alone as requested) on first run and "" (so remove) on the next.

Add -i option to edit FILE in-place.

1
#!/usr/bin/perl

use strict;

use Getopt::Long;
my %opts;
$opts{'patterns-file'} = './patterns';

GetOptions(\%opts, 'in-place-edit|i',
                   'fixed-strings|F', 
                   'line-regexp|x',
                   'strip-spaces|s', 
                   'patterns-file|p=s',
                   'help|h',
);

if ($opts{'help'}) {
print <<__EOF__;
$0 [options] [file(s) to process...]

 --patterns-file, -p   Name of patterns file, defaults to ./patterns

 --in-place-edit, -i   In-place edit mode. Use with caution!
 --fixed-strings, -F   Treat patterns as fixed-text strings, not regexes
 --line-regexp,   -x   Patterns only match entire input lines.
 --strip-spaces,  -s   Strip leading and trailing white space from patterns

 --help, -h            This help message

__EOF__
exit 0;
};

my @patterns; # array to hold the patterns to remove
my $re;       # @patterns array converted to regex

my @files = @ARGV; # save a copy of the args

@ARGV = $opts{'patterns-file'};

while(<<>>) { # First, read the patterns file
  chomp;            # remove end-of-line char(s) - \n, \r\n, etc

  if ($opts{'strip-spaces'}) {
    s/^\s*|\s*$//g;
  };

  next if m/^\s*$/; # ignore empty lines

  # Add line to @patterns array...
  if ($opts{'fixed-strings'}) {
    # Treat each pattern as fixed text, even
    # regex "special" characters like . or *
    push @patterns, quotemeta($_);
  } else {
    # Treat each pattern as a regular expression:
    push @patterns, $_;
  }
};

$re = join("|", @patterns); # convert @patterns array to regex string
if ($opts{'line-regexp'}) {
  $re = '^(' . $re . ')$';
};
$re = qr/$re/;              # pre-compile the regex

if ($opts{'in-place-edit'}) {
  # script was run with `-i` option, turn on in-place
  # editing, with .bak extension for backup copies
  our $^I = '.bak';
};

@ARGV = @files; # restore the arg list

while(<<>>) {     # read and process the remaining file(s)
  next if m/$re/; # skip lines matching the regex
  print
}

This script can process multiple input files and can optionally edit the files in-place if -i or --in-place-edit is used on the command line. The other options are documented by the -h or --help option.

The original version I wrote didn't support any options, but had a LOT of comments saying things like "uncomment the next line if you want to do X"...I decided it was easier and better to just use the Getopt::Long module. Both long and short options are supported.

The patterns file defaults to patterns in the current directory, but can be over-ridden by using the -p or --patterns-file option, which requires a filename argument.

All other args are the filename(s) to process. If the -i option is NOT used, all output goes to stdout.

Save the script as, e.g. remove-patterns.pl, make it executable with chmod +x remove-patterns.pl and run it like so:

$ ./remove-patterns.pl file1 
LINE 1 ABCD
LINE 2 EFGA
LINE 3 HCJK
LINE 4 ABCDH
LINE 5 EFGAG
LINE 6 HCJKD
LINE 7 ABCDH
LINE 8 EFGAG
LINE 9 HCJKD

Or, with in-place editing of the input file(s) and using a different patterns file:

$ ./remove-patterns.pl -i -p different_patterns.txt file*

There will be no visible output if -i is used, the files are edited instead.

3
  • Yes, this is over-engineered for a simple problem, but a) I wanted to make answering the question interesting to myself, and b) I'm kind of sick of quick-and-dirty one-liner hacks, wanted to show that making a flexible and re-usable tool and is not difficult to do. Tool-making isn't hard. Option processing isn't hard. Help messages aren't hard. Adding more options when you think of them isn't hard either. The first time may be a little difficult but once you've written a few tools like this, it's extremely easy and most of the script is just comments, help, text and boilerplate code. Commented Aug 6 at 13:35
  • Also, it's easier to write a script that has options to work for a variety of cases, without having to extract a detailed description of the problem from the OP. Commented Aug 6 at 13:40
  • BTW, apart from -i (which is copied from perl. and sed. etc), -F, --fixed-strings, and -x, --line-regexp are copied from GNU grep. -h,--help is (or should be) universal. -s, --strip-spaces and -p, --patterns-file are custom to this script. It's generally a good idea to re-use option names from commonly used programs where possible, makes it easier to use. I probably should add a -I, --ignore-case option too. Or maybe make in-place-edit use the capital I and require a mandatory backup suffix (even if it's just an empty string like BSD sed), since it's the "dangerous" option. Commented Aug 6 at 13:45

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.