Skip to main content
deleted 22 characters in body; edited tags; edited title
Source Link
Jeff Schaller
  • 68.8k
  • 35
  • 122
  • 264

Parsing only lines that have 'N' specfic characters9 periods

I have 90 gig of data culled from 13.5 Terabytes.

I have tried sort -u | uniqsort -u | uniq on data that has been awk'd from the 13.5T of syslog data.

Some malformed data was apparent so I reran the parse with awk and 'seen' like so:

awk -F, '!seen[$1]++' inputfile > outputfile

 awk -F, '!seen[$1]++' inputfile > outputfile

This turned out to be the most time efficient means but also included some malformed data... maybe there are malformed log entries or in sorting uniq'ing and awk'ing some lines got munged. I do not care if there is a more/better way of parsing the original data, since I have a large enough sample size - meaning losing a little data out of 13.5T is OK.

There are 3 IP addresses per valid line.

Since there are 3 periods in an IP address, I need something that will parse out only lines that have 9 "."'s...

Help is appreciated =)

Parsing only lines that have 'N' specfic characters

I have 90 gig of data culled from 13.5 Terabytes.

I have tried sort -u | uniq on data that has been awk'd from the 13.5T of syslog data.

Some malformed data was apparent so I reran the parse with awk and 'seen' like so:

awk -F, '!seen[$1]++' inputfile > outputfile

This turned out to be the most time efficient means but also included some malformed data... maybe there are malformed log entries or in sorting uniq'ing and awk'ing some lines got munged. I do not care if there is a more/better way of parsing the original data, since I have a large enough sample size - meaning losing a little data out of 13.5T is OK.

There are 3 IP addresses per valid line.

Since there are 3 periods in an IP address, I need something that will parse out only lines that have 9 "."'s...

Help is appreciated =)

Parsing only lines that have 9 periods

I have 90 gig of data culled from 13.5 Terabytes.

I have tried sort -u | uniq on data that has been awk'd from the 13.5T of syslog data.

Some malformed data was apparent so I reran the parse with awk and 'seen' like so:

 awk -F, '!seen[$1]++' inputfile > outputfile

This turned out to be the most time efficient means but also included some malformed data... maybe there are malformed log entries or in sorting uniq'ing and awk'ing some lines got munged. I do not care if there is a more/better way of parsing the original data, since I have a large enough sample size - meaning losing a little data out of 13.5T is OK.

There are 3 IP addresses per valid line.

Since there are 3 periods in an IP address, I need something that will parse out only lines that have 9 "."'s.

Source Link

Parsing only lines that have 'N' specfic characters

I have 90 gig of data culled from 13.5 Terabytes.

I have tried sort -u | uniq on data that has been awk'd from the 13.5T of syslog data.

Some malformed data was apparent so I reran the parse with awk and 'seen' like so:

awk -F, '!seen[$1]++' inputfile > outputfile

This turned out to be the most time efficient means but also included some malformed data... maybe there are malformed log entries or in sorting uniq'ing and awk'ing some lines got munged. I do not care if there is a more/better way of parsing the original data, since I have a large enough sample size - meaning losing a little data out of 13.5T is OK.

There are 3 IP addresses per valid line.

Since there are 3 periods in an IP address, I need something that will parse out only lines that have 9 "."'s...

Help is appreciated =)