3

I am using grep command to take the required information from a file . I am using two grep statements like the below

XXXX='grep XXXX FILE A|sort|uniq|wc -l'
grep YYYY FILE A|uniq| > FILE B

Now the file is being traversed twice . But I just want to know, if I will be able to do these two steps in a single file traversal i.e I want to know if I could use something similar to egrep where I can grep for two strings and one string I will use it for stroring in a variable and output of another string into a file.

3 Answers 3

1

You can use the following code. Here we search for lines containing XXXX or YYYY in all file for only once and store the resulting lines to an array. Then we use elements of this array to select the lines containing XXXX and the lines containing YYYY.

filtered=`grep -E '(XXXX|YYYY)' FILE A`
XXXX=`for line in ${filtered[@]}; do echo $line; done | grep XXXX | sort | uniq | wc -l`
for line in ${filtered[@]}; do echo $line; done | grep YYYY | uniq > FILE B

So the file is not traversed twice!

Sign up to request clarification or add additional context in comments.

2 Comments

This method will quickly blow up if the input size becomes larger than the available memory and only makes sense for small data batches.
If the purpose is to store data in a variable (that is the case in this question) large input can always fill up memory.
0

Or use egrep with a disjunction:

egrep '(XXXX|YYYY)' FILE A | sort | uniq | ...

Or awk:

awk '/XXXX|YYYY/' FILE A | sort | uniq | ...

4 Comments

Thank you for your answer..I understand your point ... But how can I store the result of 2 grep statements in two variables
How big is your input data? This makes only sense for small data volumes. Have a look at associative arrays in awk.
The input data is in range of 200 MB .. Its a large file
Most machines nowadays have more than 200 MB of RAM, so you may be fine. If the input data outgrows your available memory, you need to resort to the pipes-and-filters processing as above.
0

There is a trailing '|' symbol in your question, and perhaps you intended the YYYY lines to also be piped to sort (or use sort -u!), in which case you could simply do:

awk '/XXXX/ { if( !x[$0]++ ) xcount += 1 } 
     /YYYY/ { if( !y[$0]++ ) ycount += 1 }
  END { print "XXXX:", xcount
        print "YYYY:", ycount
        for( i in y ) print i | "sort > FILEB"
  }' FILE

this scans the file once, incrementing the counter whenever a uniq line containing the appropriate pattern is seen. Note that the order of the iteration over the array of YYYY lines is not well defined here, so the sort is necessary. Some versions of awk provide the ability to sort the array without relying on the external utility, but not all do. Use perl if you want to do that.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.