Multiple line pattern/ Data extraction

Question

I have the following header below in the about 100,000 files, I have already extracted each line separately and combined each record in excel, so my time crunch is over and i am now looking to for an expedient method of data extraction.

X-RSMF-Generator: RSMF Generator Sample Library

X-RSMF-Version: 1.0.0

X-RSMF-EventCount: 53

X-RSMF-BeginDate: 2022-09-20T04:33:11-04:00

X-RSMF-EndDate: 2022-09-20T16:47:56-04:00

X-RSMF-GroupID: GRP000000118

X-RSMF-SecondaryGroupID: GRP000000118_D_20220920

X-RSMF-ContainsDeleted: False

X-RSMF-Application: Native Messages

X-RSMF-Participants: Person One <5156242756> Person two, Person

three [email protected] <21243210277> Person four *** <345278652345>

MIME-Version: 1.0

Not all lines are present in all files and the last field can contain more than one line. MIME-Version: 1.0 - I think we can use MIME-Version: 1.0 as the stop. I also only need to data for each line entry. everything before the ": " (colon space) can be ignored as those are the field headings.

I started out using sed, thinking I could just concatenate each line and pipe to AWK. to make each column.

#!/bin/sh

shopt -s nullglob
FILES=/mnt/c/Temp/rsmf/*.rsmf

for f in $FILES

do
    #echo "Processing $f"
    sed -rn \
    -e '/^X-RSMF-BeginDate:/{
        s/X-RSMF-BeginDate: //
        s/T/ /
        s/-0[45]:00/ /
        s/X-RSMF-Application://
        h
        #p
        }' \
    -e '/^X-RSMF-EndDate:/{
        s/X-RSMF-EndDate: //
        s/T/ /
        s/-0[45]:00/ /
        H
        #p
        }' \
     -e '/^X-RSMF-GroupID:/{
        s/X-RSMF-GroupID: //
        H
        x
        s/\r\n//gp
        }' \
         $f
done

Results -

2022-10-05 12:54:27 2022-10-05 12:54:27 GRP000000001
2022-10-05 11:48:18 2022-10-05 11:48:18 GRP000000002

Before spending time on this, I wanted to seek recommendations on the best approach and practice for this particular project.

Thoughts??

When asking questions about text processing, please use block code formatting for verbatim reproduction of your input example. Also be sure to provide the desired output for the included input. That way, contributors can test proposed solutions before posting them as answers. — AdminBee
– AdminBee, Commented Nov 10, 2022 at 9:50

Gilles Quénot · Accepted Answer · 2022-11-09 19:10:04Z

1

With awk:

awk -F': ' 'BEGIN{ORS=" "}$1=="MIME-Version"{exit}{print $2}END{print "\n"}' file

edited Nov 9, 2022 at 19:10

answered Nov 9, 2022 at 18:51

Gilles Quénot

36.5k7 gold badges74 silver badges96 bronze badges

No sure how this does what it does,, We would need all data on one line for every files and for the time/date fields tit cuts off the time after 2 digits.

user68650
– user68650

2022-11-09 18:57:17 +00:00
Commented Nov 9, 2022 at 18:57
Post edited accordingly. For the date, you can use sub() or substr()

Gilles Quénot
– Gilles Quénot

2022-11-09 19:01:24 +00:00
Commented Nov 9, 2022 at 19:01
Only got "01" - on my end, on the field that I tried it on.

user68650
– user68650

2022-11-09 19:11:52 +00:00
Commented Nov 9, 2022 at 19:11

Add a comment |

cas · Accepted Answer · 2022-11-10 04:01:58Z

The following is pretty hacky and brute-force method and there's probably a better way of doing it (but I'd need to know more about your data and why you're even mentioning Excel - if the data was originally in a spreadsheet, there are perl modules for extracting data directly from Excel or Open/Libre Office etc), but it does work with the sample data you provided.

It can process any number of input files.

It has been written to use TAB (\t or Ctrl-I or ^I) as the output field separator, rather than a space because your field data can contain spaces.

#!/usr/bin/perl

while (<>) {
  chomp;
  s/^\s*|\s*$//g;  # strip any leading and trailing whitespace
  next if /^$/;    # ignore all blank lines

  # split input line into @F array
  # $F[0] will contain the field name and
  # $F[1] will contain the field data
  # The field separator is a quite-forgiving zero-or-more spaces followed by
  # a colon followed by one-or-more spaces. This should cope with most minor
  # variants caused by manual extraction from Excel. 
  my @F = split /\s*:\s+/;

  # print the data at end of each input record (file)
  if (/^MIME-Version/) {
    # add space-separated @participants array to end of @record array
    push @record, join(" ", @participants);

    # print @record array, tab-separated
    print join("\t", @record), "\n";

    # clear both arrays, ready for next input file
    @record=();
    @participants=();
    next;
  };

  # fix up the date format
  if (/^X-RSMF-(Begin|End)Date/) {
    $F[1] =~ s/T/ /;
    $F[1] =~ s/-0[45]:00$//;
  };

  if (/^X-RSMF-Participants/) {
    # participants need to be handled differently because this field can
    # be multi-line.  Store in a separate @participants array
    push @participants, $F[1];

  } elsif ($#F == 0) {
    # lines without a field name get added to @participants array
    push @participants, $_;

  } else {
    # all other fields get added to @record array
    push @record, $F[1];
  }
}

Save it to a file, e.g. rsmf2tab.pl, make it executable with chmod +x rsmf2tab.pl, and run it as, e.g.

./rsmf2tab.pl /mnt/c/Temp/rsmf/*.rsmf

or if your .rsmf files are in multiple subdirectories:

find /mnt/c/Temp/rsmf/ -name '*.rsmf' -exec /path/to/rsmf2tab.pl {} +

Sample output with two copies of you sample data (as file1.rsmf and file2.rsmf) as input, piped through cat -A to show the tabs as ^I:

$ ./rsmf2tab.pl *.rsmf | cat -A
RSMF Generator Sample Library^I1.0.0^I53^I2022-09-20 04:33:11^I2022-09-20 16:47:56^IGRP000000118^IGRP000000118_D_20220920^IFalse^INative Messages^IPerson One <5156242756> Person two, Person three [email protected] <21243210277> Person four <345278652345>$
RSMF Generator Sample Library^I1.0.0^I53^I2022-09-20 04:33:11^I2022-09-20 16:47:56^IGRP000000118^IGRP000000118_D_20220920^IFalse^INative Messages^IPerson One <5156242756> Person two, Person three [email protected] <21243210277> Person four <345278652345>$

BTW, you really do not want to do your FILE=/mnt/c/Temp/rsmf/*.rsmf followed by for f in $FILES. That will break if any files contain any whitespace characters. And it's not necessary, anyway - just run for f in /mnt/c/Temp/rsmf/*.rsmf or (depending on what you're running), just pass all filename args to the command you're running without using a loop.

Cas, I was avoiding a full script to complete this for the sake of expediency, but if it is the only method. I guess it will have to be done. Thank for the feedback. Since I used grep to export each X-RSMF line, I need a method of combining all fields back into a records for all files. I exported the file name and used excel to combine everything back into a table an used it for importing. This also took care of aligning records if there was line missing from the text file. — user68650
– user68650, Commented Nov 10, 2022 at 13:15
I wrote an even hackier perl one-liner, but I didn't like it (it made too many assumptions and minor variations in input broke it). Turned it into the simple script above instead. BTW, it's the joining of the multiple lines of participants that made me go with this design, it was just easier (conceptually) to use multiple arrays. BTW, I don't really see the distinction you're making about "avoiding a full script". sed scripts are scripts. awk scripts are scripts. perl scripts are scripts. some tasks require - or are done better - something more complex than just regex search & replace ops. — cas
– cas, Commented Nov 11, 2022 at 1:41
Also BTW, what was your original input file (the one you "used grep to export each X-RSMF line" from)? It would probably be easier to write a script to convert directly from that to the tab-separated output above. — cas
– cas, Commented Nov 11, 2022 at 1:44

Stack Exchange Network

Multiple line pattern/ Data extraction

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Multiple line pattern/ Data extraction

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions