Find block pattern in many distinct files and selectively extract certain lines from it

Question

I have tens of thousands of directories. Each directory is named by number, like 1, 2, 3,... Each directory contains a large .dat file called data.dat and each file has a section that looks like this:

Configurations for Sm:

  Sm Nd H  O 

  0  1  4  0          1.00          7.14%
  1  0  3  0          3.00          7.14%
  0  0  5  0          1.00          7.14%

I care about the first two numbers on each line. I want:

All of the lines that start with 0 1 (in this example, that's the first line of numbers) to end up in a new file called 0-1.dat with the file name (number) at the start of the line. An example is below, called "example."
Likewise, all of the lines that begin with 1 0 (here the second line) should end up in a file called 1-0.dat with the file number at the start of the line.
All lines that begin with 0 0 (here the third line) should go to a file called 0-0.dat.

Complications for finding the lines I need are:

Sometimes one of the lines might be missing or the lines might be in different order.
Also, each file has many sections called Configurations for X, where X is some string. So I do need to somehow use the identifier Configurations for Sm: and search the first set of numbers below it.

Example of what I want to achieve, where the first number on the line is the directory name/number containing the file from which the line was extracted:

Example
In file called 0-1.txt:
1    0  1  4  0          1.00          7.14%
2    0  1  7  1          1.00          7.14%
3    0  1 ....

In file called 1-0.txt:
1    1  0  1  0          1.00          7.14%
2    1  0  4  2          1.00          7.14%
3    1  0 ....

I currently have:

find . -name data.dat -exec grep "Configurations for Sm:" {} + > 0-1.txt

All this does though, is put anything that would come after Configurations for Sm: in a separate file. I just cannot figure out how to do what I need to do--find lines below Configurations for Sm: by their number contents. If anyone has any hints or could direct me to an online resource, I would be very grateful. Thank you.

Can you have more than one section that starts witth Configurations for Sm: in any of your initial data.dat files ? — Cbhihe
– Cbhihe, Commented Dec 14, 2019 at 8:39
Thanks for your interest, @Cbhihe. No, that particular expression is unique. There are other expressions like it, but not identical. For example, "Configurations for Nd:" — user3292696
– user3292696, Commented Dec 14, 2019 at 12:47
Thank you so much, lgpasquale and bu5hman for your help. I am on the road and will try your solutions in a few hours as soon as I get home! — user3292696
– user3292696, Commented Dec 14, 2019 at 14:08

lgpasquale · Accepted Answer · 2019-12-18 09:02:15Z

I think you can use a combination of sed and grep.

Assuming all your directories 0,1,2,... are in /your/path (e.g. /your/path/0/data.dat):

for dir in /your/path/*; do
    idx=$(basename ${dir})
    sed -n '/Configurations for Sm:/,/Configurations for/p' ${dir}/data.dat | \
        grep '^ \+0 \+1' | \
        sed "s/^/${idx}/" >> "0-1.dat"
done

The first sed should extract only the portion of the file that is of interest (between the two patterns Configurations for Sm: and Configurations for)

grep matches 0 1 at the beginning of the line (with a positive number of spaces in-between)

The second sed adds the "index" (the directory name) at the beginning of the line.

The output is appended (>>) to "0-1.dat".

You could add an outer loop to test different combinations of 0 and 1.

Note: I haven't properly tested this.

bu5hman · Accepted Answer · 2019-12-14 13:51:43Z

How about an awk solution

awk '/^ *[0-1] +[0-1]/{
    n=split(FILENAME,d,"/");print d[n-1], $0 > $1"-"$2".txt"
}' $(find . -name "*.dat")

First find all of your dat files and feed them to awk but only process those lines which start^ with 0 or 1 as the first 2 non-whitespace characters

/^ *[0-1] +[0-1]/

Then split the filename on / into an array, storing the number of elements in the array in n

n=split(FILENAME,d,"/")

Finally print the directory name/number (which is the d[n-1] element in your array) and the data from your dat file $0 to a file composed of the first two values

print d[n-1], $0 > $1"-"$2".txt"

If you have 10's of thousands of files then the overhead in splitting FILENAME for every line may be too much, in which case you could loop over the files feeding each to awk and append to the collating files instead >> $1"-"$2".txt"

maybe....

find . -iname "*.dat" -print0 | xargs -0  -n 1 -P 0 awk 'NR==1{n=split(FILENAME,d,"/"); dir=d[n-1]}/ *[0-1] +[0-1]/{print dir, $0 >> $1"-"$2".txt"}'

I am going to try your solution right now. I marked the earlier answer as correct because I tried it first, being less familiar with Awk. However, I want to learn (particularly as I heard that it's faster--would be interested to know if that's true) and really appreciate that you took time to provide this. Thank you. — user3292696
– user3292696, Commented Dec 15, 2019 at 16:40

Stack Exchange Network

Find block pattern in many distinct files and selectively extract certain lines from it

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Find block pattern in many distinct files and selectively extract certain lines from it

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions