How to extract unique domains from access.log?

Question

Here is a part of a large access.log file which I want to analyze:

4.3.2.1 - - [22/Sep/2016:14:27:18 -0500]  "GET / HTTP/1.0" 301 184 "-" "WordPress/4.5.4; http://my.example.com; verifying pingback from 127.0.0.1"-
4.3.2.1 - - [22/Sep/2016:14:27:18 -0500] "GET / HTTP/1.0" 301 184 "-" "WordPress/4.5.4; http://my.example.com; verifying pingback from 127.0.0.1"
3.2.1.4 - - [22/Sep/2016:14:27:18 -0500]  "GET / HTTP/1.0" 301 184 "-" "WordPress/4.5; http://somedomain.com; verifying pingback from 1.2.3.4"-
3.2.1.4 - - [22/Sep/2016:14:27:18 -0500] "GET / HTTP/1.0" 301 184 "-" "WordPress/4.5; http://somedomain.com; verifying pingback from 1.2.3.4"
5.4.3.2 - - [22/Sep/2016:14:27:18 -0500]  "GET / HTTP/1.0" 301 184 "-" "WordPress/4.4.2; http://demo.otherdomain.com/blog; verifying pingback from 1.2.3.4"

I'm wondering how to extract unique domains from the file. The result should be:

http://my.example.com
http://somedomain.com;
http://demo.otherdomain.com/blog;

rcjohnson · Accepted Answer · 2016-12-25 10:00:36Z

3

In situations like this I am a big fan of grep using Perl lookarounds

grep -oP '(?<=http://).*(?=;)' access.log | sort -u

Will return a list using your sample as follows

$ grep -oP '(?<=http://).*(?=;)' access.log | sort -u
demo.otherdomain.com/blog
my.example.com
somedomain.com

answered Dec 25, 2016 at 10:00

rcjohnson

9296 silver badges12 bronze badges

This is elegant and easy to remember. But I noticed sometimes in the access.log the urls do not end with ; and this causes problem. Do you have a solution for this?

supermario
– supermario

2016-12-25 10:18:09 +00:00
Commented Dec 25, 2016 at 10:18
I would need an example of a log entry that does not end in ";"

rcjohnson
– rcjohnson

2016-12-25 14:30:28 +00:00
Commented Dec 25, 2016 at 14:30

Add a comment |

user1700494 · Accepted Answer · 2016-12-25 10:24:57Z

1

 awk '{for(i=1;i<=NF;i++)if($i ~ /^http:\/\//)print $i}' access.log |sort -u

If you want to parse https as well then

awk '{for(i=1;i<=NF;i++)if($i ~ /^http(s)?:\/\//)print $i}' access.log |sort -u

Also you may use tr to remove trailing semicolon

awk '{for(i=1;i<=NF;i++)if($i ~ /^http(s)?:\/\//)print $i}' access.log |tr -d ';' |sort -u

edited Dec 25, 2016 at 10:24

answered Dec 25, 2016 at 10:04

user1700494

2,26410 silver badges13 bronze badges

This is good, but I want also not to have ; at the end of urls.

supermario
– supermario

2016-12-25 10:13:16 +00:00
Commented Dec 25, 2016 at 10:13
updated the answer

user1700494
– user1700494

2016-12-25 10:25:35 +00:00
Commented Dec 25, 2016 at 10:25

Add a comment |

Guy · Accepted Answer · 2016-12-25 10:01:32Z

-1

awk '{ print $13 }' access.log | sort -u

As a basic attempt I think. awk will pick the 13th field of each line, using white-space as a delimiter, and it's piped into sort which will sort the urls and remove multiples (-u for uniq).

If only some lines will have the info, or they won't all be this format you will need to grep the file first, to select which lines this is applicable for.

edited Dec 25, 2016 at 10:01

answered Dec 25, 2016 at 9:43

Guy

9241 gold badge6 silver badges22 bronze badges

It does not have desired output.

supermario
– supermario

2016-12-25 09:47:04 +00:00
Commented Dec 25, 2016 at 9:47
hopefully should work now, I hadn't noticed the mix of tabs and spaces..

Guy
– Guy

2016-12-25 10:02:23 +00:00
Commented Dec 25, 2016 at 10:02

Add a comment |

Stack Exchange Network

How to extract unique domains from access.log?

3 Answers 3

You must log in to answer this question.

Hot Network Questions

How to extract unique domains from access.log?

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions