I have the following header below in the about 100,000 files, I have already extracted each line separately and combined each record in excel, so my time crunch is over and i am now looking to for an expedient method of data extraction.
X-RSMF-Generator: RSMF Generator Sample Library
X-RSMF-Version: 1.0.0
X-RSMF-EventCount: 53
X-RSMF-BeginDate: 2022-09-20T04:33:11-04:00
X-RSMF-EndDate: 2022-09-20T16:47:56-04:00
X-RSMF-GroupID: GRP000000118
X-RSMF-SecondaryGroupID: GRP000000118_D_20220920
X-RSMF-ContainsDeleted: False
X-RSMF-Application: Native Messages
X-RSMF-Participants: Person One <5156242756> Person two, Person
three [email protected] <21243210277> Person four *** <345278652345>
MIME-Version: 1.0
Not all lines are present in all files and the last field can contain more than one line. MIME-Version: 1.0 - I think we can use MIME-Version: 1.0 as the stop. I also only need to data for each line entry. everything before the ": " (colon space) can be ignored as those are the field headings.
I started out using sed, thinking I could just concatenate each line and pipe to AWK. to make each column.
#!/bin/sh
shopt -s nullglob
FILES=/mnt/c/Temp/rsmf/*.rsmf
for f in $FILES
do
#echo "Processing $f"
sed -rn \
-e '/^X-RSMF-BeginDate:/{
s/X-RSMF-BeginDate: //
s/T/ /
s/-0[45]:00/ /
s/X-RSMF-Application://
h
#p
}' \
-e '/^X-RSMF-EndDate:/{
s/X-RSMF-EndDate: //
s/T/ /
s/-0[45]:00/ /
H
#p
}' \
-e '/^X-RSMF-GroupID:/{
s/X-RSMF-GroupID: //
H
x
s/\r\n//gp
}' \
$f
done
Results -
2022-10-05 12:54:27 2022-10-05 12:54:27 GRP000000001
2022-10-05 11:48:18 2022-10-05 11:48:18 GRP000000002
Before spending time on this, I wanted to seek recommendations on the best approach and practice for this particular project.
Thoughts??