Got a tricky and complex program that is used for pre-processing text in order to send it to a Machine Learning software.
To make long story short:
The bash script gets inside a folder where thousands of text files are waiting, opening them with CAT, cleaning and deleting superfluous lines and then, prior to sending files to Machine Learning process writes a CSV to disk with some info for later human checking.
It's very important to keep the line number besides their content because the order of apparition of words is a key for the ML process.
So, mi approach is to add line number to every line this way (one liner with many piped commands):
for every file in *.txt
do
cat -v $file | nl -nrz -w4 -s$'\t' | .......
Then I get rid of undesired lines this way (sample) :
 ...... | sed '/^$/d'| grep -vEi 'unsettling|aforementioned|ruled' 
and finally keep two lines for further processing this way:
........ | grep -A 1 -Ei 'university|institute|trust|college'
The output is something like this (sampling two files):
file 1.txt
0098  university of Goteborg is downtown and is one of the
0099  most beautiful building you can visit
0123  the institute of Oslo for advanced investigation
0124  is near the central station and keeps
0234  most important college of Munich
0235  and the most acclaimed teachers are
file 2.txt
0023  there is no trust or confidence
0024  in the counselor to accomplish the new
0182  usually the college is visited
0183  every term for the president but
[EDITED] Missed this step that was in the wrong line. Sorry.
Then, the text is stacked into "paragraphs" this way:
tr '\n\r' ' '| grep -Eio '.{0,0}university.{0,25}|.{0,0}college.{0,25}'
[END EDIT]
This output is saved as variable "CLEANED_TXT" and piped into a WHILE, this way:
while read everyline; do 
    if [[ -n "${everyline// }" ]];then
            echo "$file;$linenumber;$everyline" >> output.csv
    fi  
    done <<< "$CLEANED_TXT"
done  # for every text file
FINAL DESIRED OUTPUT
file 1.txt;0098;university of Goteborg
file 1.txt;0123;the institute of Oslo
file 1.txt;0234;college of Munich
My issue is the line number is lost at this last step because of the GREP just before the loop. Take into account that I need the original line number. Re-numbering inside the loop is not allowed.
I'm stuck. Any help would be much appreciated.
Regards
awk '{if (match($0, "university|college")) print $1, substr($0,RSTART,RLENGTH+25)}'grep -nhelp at all?