I have 90 gig of data culled from 13.5 Terabytes.
I have tried sort -u | uniq on data that has been awk'd from the 13.5T of syslog data.
Some malformed data was apparent so I reran the parse with awk and 'seen' like so:
awk -F, '!seen[$1]++' inputfile > outputfile
This turned out to be the most time efficient means but also included some malformed data... maybe there are malformed log entries or in sorting uniq'ing and awk'ing some lines got munged. I do not care if there is a more/better way of parsing the original data, since I have a large enough sample size - meaning losing a little data out of 13.5T is OK.
There are 3 IP addresses per valid line.
Since there are 3 periods in an IP address, I need something that will parse out only lines that have 9 "."'s.
20.20.20.20 foo.bar.baz 300.300.300.300?