2

I'm trying to extract some information from multiple files and create a csv-type file. Until now I got the extracting and writing to file part working but don't know how I could add a comma between each output or strip the newline at the end.

#!/bin/bash
for file in folder/*.txt do
  grep 'sometext:' $file | sed '/^.*:\s*//' >> list.txt
  #doing simliar stuff with other lines in the current file
done

I tried to use echo -n to strip the newline but this did not returned anything useful.

What the code should do:
For each file in the folder find lines beginning with some patterns (ex. sometext:, someothertext: etc) and append the rest of the line and a , to a single line, corresponding to that file in list.txt.

Example of content of the file in the folder:

randomtext: ...
sometext: Hello
randomtext: ...
someothertext: World
somedifferenttext: !
randomtext:

Would result in on single line in the output file Hello,World,!,

9
  • I updated the question. Commented May 3, 2016 at 11:10
  • 1
    Why would grepping for sometext also match the World and ! lines? Commented May 3, 2016 at 11:15
  • Possible duplicate of Turn list into single line with delimiter so in your case sed '/some.*:/{s/.*: //;H};$!d;x;s/\n//;s/\n/,/g' "$FILE" Commented May 3, 2016 at 11:17
  • this would be done by simliar lines, see coment in script part. Commented May 3, 2016 at 11:17
  • 1
    @don_crissti I don't think that's a dupe. The OP needs to filter the text, not just replace \n with ,. Commented May 3, 2016 at 11:18

2 Answers 2

4

OK, first of all do not use a for loop! That is very inefficient. Just give grep all the file names at once:

grep 'sometext:' folder/*.txt

In this case, however, I would use awk instead of grep. I made 10 copies of your input file to test:

$ awk '{
        if($1~/sometext|someothertext|somedifferenttext/){
            printf "%s,",$2
        }
        if(FNR==1 && NR>1){
            print ""
        }
    }
    END{ print "" }' folder/*txt 
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,

Explanation

awk is a scripting language that reads its input line by line and splits each line on whitespace (by default, you can change that with -F) into fields. The first field will be $1, the second $2 etc.

  • if($1~/sometext|someothertext|somedifferenttext/){ : if the first field matches sometext or someothertext or somedifferenttext. Note that this will also match foosometext. If you want to limit to exact matches, change this to:

    if($1=="sometext:" || $1=="someothertext:" || $1=="somedifferenttext:"){
    
  • printf "%s,",$2 : if the condition above is met, print the 2nd field followed by a comma.

  • if(FNR==1 && NR>1){ print "" } : NR is the current input line number and FNR is the current file's line number. So, print a newline (awk's print call adds a newline by default, so printing nothing is like printing a newline) each time the file's line number is 1 but not if the total number of lines processed is also one. In other words, print a newline each time we start reading a new file.

  • END{ print "" }' : also print a newline after processing all files.

Note that this assumes you only have 2 fields per line. If you need to print the entire line instead, you can use (using the version that only prints exact matches to illustrate):

awk '{
    if($1=="sometext:" || 
       $1=="someothertext:" || 
       $1=="somedifferenttext:"){
        $1=""; 
        printf "%s,",$0
    }
    if(FNR==1 && NR>1){print ""}
    }END{print ""}' folder/*txt | sed 's/^ //'

The difference is that we use $0 (the full line) instead of $2 and set $1 to the empty string before printing. This results in an extra space printed at the beginning (because the empty $1 is still considered a field), so we pass that through sed to remove it.


Alternatively, you could also do the whole thing in Perl:

 $ perl -lane '
    if($F[0]=~/(sometext|someothertext|somedifferenttext):/){
        push @k,@F[1..$#F]
    } 
    if(eof){
        print join ",", @k; @k=();
    }' folder/file*
Hello,World,!
Hello,World,!
Hello,World,!
Hello,World,!
Hello,World,!
Hello,World,!
Hello,World,!
Hello,World,!
Hello,World,!
Hello,World,!
Hello,World,!

Or, to also have the trailing ,:

 $ perl -lane '
    if($F[0]=~/^(sometext|someothertext|somedifferenttext):$/){
        push @k,@F[1..$#F]
    } 
    if(eof){
        print join ",", @k , ""; @k=();
    }' folder/file*
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,

Explanation

The basic idea here is the same. Perl's -a switch makes it behave like awk, splitting each input line into the array @F. Then, if the 1st element of the array is one of the desired strings, the rest of the fields (@F[1..$#F]) are is added to the array @k. If we reach the end of a file (if(eof)), we join the contents of the @k array with commas and print the resulting string.


Finally, here's one way to do it in the way you were attempting (assuming GNU grep):

$ for f in folder/*; do 
    grep -hoP '^(sometext|someothertext|somedifferenttext): \K.*' "$f" | 
        perl -pe 's/\n/,/; END{print "\n"}'; 
  done
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
Hello,World,!,
6
  • I need some time to read/test all of that Commented May 3, 2016 at 11:51
  • @Pit sure. I also added another approach that uses the (inefficient) for loop. That might be easier for you to understand. Commented May 3, 2016 at 11:56
  • I understand most of it, I just can't get any of the examples to write to a file. Commented May 3, 2016 at 12:16
  • @Pit just add a > file.csv at the end. Come into /dev/chat if you're having trouble. Commented May 3, 2016 at 12:54
  • I have some strange bugs which duplicates each line in output, but apart from that it seems to work, and I understand enough to expand the code to match all my needs. (I'm using the perl version) Commented May 3, 2016 at 13:37
2

With gnu sed:

sed -Es '/pattern1|pattern2|pattern3/{
s/.*:[[:blank:]]*//;H}
$!d;x;/^\n$/d;s/\n(.*)/\1,/;s/\n/,/g' folder/*.txt > list.txt

where list.txt content will be something like:

file1match1,file1match2,
file2match1,
file4match1,file4match2,file4match3,

so file3 is missing from the output as there was no line matching pattern*.
How it works: it processes each file -separately, removing (via s/.*:[[:blank:]]*//) the unneeded part on lines that match pattern* and appending the result to the Hold buffer. It deletes each line except the la$t one when it exchanges the buffers. If there's only a \newline in the pattern space it means no line in that file matched pattern* so it deletes the pattern space. Else it removes the leading \newline, replaces the remaining ones with commas and adds the trailing comma.

With other seds you'll have to loop:

for file in folder/*.txt do
sed '/pattern1\|pattern2\|pattern3/{
s/.*:[[:blank:]]*//
H
}
$!d
x
/^\n$/d
s/\n\(.*\)/\1,/
s/\n/,/g' "$file"
done > list.txt
3
  • I can't get it to work propperly, it seems to not add comma all the time. @terdon's solution is way easier for me to understand Commented May 3, 2016 at 13:39
  • 1
    @Pit - no problem. Keep in mind that people can only guess why it doesn't work for you. It's up to you to add details to your question and make it clearer. Commented May 3, 2016 at 13:41
  • I understand that, but due to the natur of the data, I can't release too much of it. So I can't make it any clearer. Commented May 3, 2016 at 13:44

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.