Filter text from each file and turn it into a list of comma-separated values

Question

I'm trying to extract some information from multiple files and create a csv-type file. Until now I got the extracting and writing to file part working but don't know how I could add a comma between each output or strip the newline at the end.

#!/bin/bash
for file in folder/*.txt do
  grep 'sometext:' $file | sed '/^.*:\s*//' >> list.txt
  #doing simliar stuff with other lines in the current file
done

I tried to use echo -n to strip the newline but this did not returned anything useful.

What the code should do:
For each file in the folder find lines beginning with some patterns (ex. sometext:, someothertext: etc) and append the rest of the line and a , to a single line, corresponding to that file in list.txt.

Example of content of the file in the folder:

randomtext: ...
sometext: Hello
randomtext: ...
someothertext: World
somedifferenttext: !
randomtext:

Would result in on single line in the output file Hello,World,!,

Why would grepping for sometext also match the World and ! lines? — terdon
– terdon ♦, Commented May 3, 2016 at 11:15
Possible duplicate of Turn list into single line with delimiter so in your case sed '/some.*:/{s/.*: //;H};$!d;x;s/\n//;s/\n/,/g' "$FILE" — don_crissti
– don_crissti, Commented May 3, 2016 at 11:17
this would be done by simliar lines, see coment in script part. — Pit
– Pit, Commented May 3, 2016 at 11:17
@don_crissti I don't think that's a dupe. The OP needs to filter the text, not just replace \n with ,. — terdon
– terdon ♦, Commented May 3, 2016 at 11:18

terdon · Accepted Answer · 2016-05-03 11:56:08Z

OK, first of all do not use a for loop! That is very inefficient. Just give grep all the file names at once:

grep 'sometext:' folder/*.txt

In this case, however, I would use awk instead of grep. I made 10 copies of your input file to test:

$ awk '{
        if($1~/sometext|someothertext|somedifferenttext/){
            printf "%s,",$2
        }
        if(FNR==1 && NR>1){
            print ""
        }
    }
    END{ print "" }' folder/*txt 
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,

Explanation

awk is a scripting language that reads its input line by line and splits each line on whitespace (by default, you can change that with -F) into fields. The first field will be $1, the second $2 etc.

if($1~/sometext|someothertext|somedifferenttext/){ : if the first field matches sometext or someothertext or somedifferenttext. Note that this will also match foosometext. If you want to limit to exact matches, change this to:
```
if($1=="sometext:" || $1=="someothertext:" || $1=="somedifferenttext:"){
```
printf "%s,",$2 : if the condition above is met, print the 2nd field followed by a comma.
if(FNR==1 && NR>1){ print "" } : NR is the current input line number and FNR is the current file's line number. So, print a newline (awk's print call adds a newline by default, so printing nothing is like printing a newline) each time the file's line number is 1 but not if the total number of lines processed is also one. In other words, print a newline each time we start reading a new file.
END{ print "" }' : also print a newline after processing all files.

Note that this assumes you only have 2 fields per line. If you need to print the entire line instead, you can use (using the version that only prints exact matches to illustrate):

awk '{
    if($1=="sometext:" || 
       $1=="someothertext:" || 
       $1=="somedifferenttext:"){
        $1=""; 
        printf "%s,",$0
    }
    if(FNR==1 && NR>1){print ""}
    }END{print ""}' folder/*txt | sed 's/^ //'

The difference is that we use $0 (the full line) instead of $2 and set $1 to the empty string before printing. This results in an extra space printed at the beginning (because the empty $1 is still considered a field), so we pass that through sed to remove it.

Alternatively, you could also do the whole thing in Perl:

 $ perl -lane '
    if($F[0]=~/(sometext|someothertext|somedifferenttext):/){
        push @k,@F[1..$#F]
    } 
    if(eof){
        print join ",", @k; @k=();
    }' folder/file*
Hello,World,!
Hello,World,!
Hello,World,!
Hello,World,!
Hello,World,!
Hello,World,!
Hello,World,!
Hello,World,!
Hello,World,!
Hello,World,!
Hello,World,!

Or, to also have the trailing ,:

 $ perl -lane '
    if($F[0]=~/^(sometext|someothertext|somedifferenttext):$/){
        push @k,@F[1..$#F]
    } 
    if(eof){
        print join ",", @k , ""; @k=();
    }' folder/file*
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,

Explanation

The basic idea here is the same. Perl's -a switch makes it behave like awk, splitting each input line into the array @F. Then, if the 1st element of the array is one of the desired strings, the rest of the fields (@F[1..$#F]) are is added to the array @k. If we reach the end of a file (if(eof)), we join the contents of the @k array with commas and print the resulting string.

Finally, here's one way to do it in the way you were attempting (assuming GNU grep):

$ for f in folder/*; do 
    grep -hoP '^(sometext|someothertext|somedifferenttext): \K.*' "$f" | 
        perl -pe 's/\n/,/; END{print "\n"}'; 
  done
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,

@Pit sure. I also added another approach that uses the (inefficient) for loop. That might be easier for you to understand. — terdon
– terdon ♦, Commented May 3, 2016 at 11:56
I understand most of it, I just can't get any of the examples to write to a file. — Pit
– Pit, Commented May 3, 2016 at 12:16
@Pit just add a > file.csv at the end. Come into /dev/chat if you're having trouble. — terdon
– terdon ♦, Commented May 3, 2016 at 12:54
I have some strange bugs which duplicates each line in output, but apart from that it seems to work, and I understand enough to expand the code to match all my needs. (I'm using the perl version) — Pit
– Pit, Commented May 3, 2016 at 13:37

2 revs · Accepted Answer · 2016-05-03 13:04:57Z

2

With gnu sed:

sed -Es '/pattern1|pattern2|pattern3/{
s/.*:[[:blank:]]*//;H}
$!d;x;/^\n$/d;s/\n(.*)/\1,/;s/\n/,/g' folder/*.txt > list.txt

where list.txt content will be something like:

file1match1,file1match2,
file2match1,
file4match1,file4match2,file4match3,

so file3 is missing from the output as there was no line matching pattern*.
How it works: it processes each file -separately, removing (via s/.*:[[:blank:]]*//) the unneeded part on lines that match pattern* and appending the result to the Hold buffer. It deletes each line except the la$t one when it exchanges the buffers. If there's only a \newline in the pattern space it means no line in that file matched pattern* so it deletes the pattern space. Else it removes the leading \newline, replaces the remaining ones with commas and adds the trailing comma.

With other seds you'll have to loop:

for file in folder/*.txt do
sed '/pattern1\|pattern2\|pattern3/{
s/.*:[[:blank:]]*//
H
}
$!d
x
/^\n$/d
s/\n\(.*\)/\1,/
s/\n/,/g' "$file"
done > list.txt

edited May 3, 2016 at 13:04

community wiki

2 revs
don_crissti

I can't get it to work propperly, it seems to not add comma all the time. @terdon's solution is way easier for me to understand

Pit
– Pit

2016-05-03 13:39:28 +00:00
Commented May 3, 2016 at 13:39
1

@Pit - no problem. Keep in mind that people can only guess why it doesn't work for you. It's up to you to add details to your question and make it clearer.

don_crissti
– don_crissti

2016-05-03 13:41:23 +00:00
Commented May 3, 2016 at 13:41
I understand that, but due to the natur of the data, I can't release too much of it. So I can't make it any clearer.

Pit
– Pit

2016-05-03 13:44:30 +00:00
Commented May 3, 2016 at 13:44

Add a comment |

Stack Exchange Network

Filter text from each file and turn it into a list of comma-separated values

2 Answers 2

Explanation

Explanation

You must log in to answer this question.

Linked

Hot Network Questions

Filter text from each file and turn it into a list of comma-separated values

2 Answers 2

Explanation

Explanation

You must log in to answer this question.

Linked

Related

Hot Network Questions