1

Here is a part of a large access.log file which I want to analyze:

4.3.2.1 - - [22/Sep/2016:14:27:18 -0500]  "GET / HTTP/1.0" 301 184 "-" "WordPress/4.5.4; http://my.example.com; verifying pingback from 127.0.0.1"-
4.3.2.1 - - [22/Sep/2016:14:27:18 -0500] "GET / HTTP/1.0" 301 184 "-" "WordPress/4.5.4; http://my.example.com; verifying pingback from 127.0.0.1"
3.2.1.4 - - [22/Sep/2016:14:27:18 -0500]  "GET / HTTP/1.0" 301 184 "-" "WordPress/4.5; http://somedomain.com; verifying pingback from 1.2.3.4"-
3.2.1.4 - - [22/Sep/2016:14:27:18 -0500] "GET / HTTP/1.0" 301 184 "-" "WordPress/4.5; http://somedomain.com; verifying pingback from 1.2.3.4"
5.4.3.2 - - [22/Sep/2016:14:27:18 -0500]  "GET / HTTP/1.0" 301 184 "-" "WordPress/4.4.2; http://demo.otherdomain.com/blog; verifying pingback from 1.2.3.4"

I'm wondering how to extract unique domains from the file. The result should be:

http://my.example.com
http://somedomain.com;
http://demo.otherdomain.com/blog;

3 Answers 3

3

In situations like this I am a big fan of grep using Perl lookarounds

grep -oP '(?<=http://).*(?=;)' access.log | sort -u

Will return a list using your sample as follows

$ grep -oP '(?<=http://).*(?=;)' access.log | sort -u
demo.otherdomain.com/blog
my.example.com
somedomain.com
2
  • This is elegant and easy to remember. But I noticed sometimes in the access.log the urls do not end with ; and this causes problem. Do you have a solution for this? Commented Dec 25, 2016 at 10:18
  • I would need an example of a log entry that does not end in ";" Commented Dec 25, 2016 at 14:30
1
 awk '{for(i=1;i<=NF;i++)if($i ~ /^http:\/\//)print $i}' access.log |sort -u

If you want to parse https as well then

awk '{for(i=1;i<=NF;i++)if($i ~ /^http(s)?:\/\//)print $i}' access.log |sort -u

Also you may use tr to remove trailing semicolon

awk '{for(i=1;i<=NF;i++)if($i ~ /^http(s)?:\/\//)print $i}' access.log |tr -d ';' |sort -u
2
  • This is good, but I want also not to have ; at the end of urls. Commented Dec 25, 2016 at 10:13
  • updated the answer Commented Dec 25, 2016 at 10:25
-1
awk '{ print $13 }' access.log | sort -u

As a basic attempt I think. awk will pick the 13th field of each line, using white-space as a delimiter, and it's piped into sort which will sort the urls and remove multiples (-u for uniq).

If only some lines will have the info, or they won't all be this format you will need to grep the file first, to select which lines this is applicable for.

2
  • It does not have desired output. Commented Dec 25, 2016 at 9:47
  • hopefully should work now, I hadn't noticed the mix of tabs and spaces.. Commented Dec 25, 2016 at 10:02

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.