Skip to main content
added 792 characters in body
Source Link
PM 2Ring
  • 6.8k
  • 2
  • 30
  • 33

As I said in my comment, it's generally not a good idea to parse HTML with Regular Expressions, but you can sometimes get away with it if the HTML you're parsing is well-behaved.

In order to only get URLs that are in the href attribute of <a> elements, I find it easiest to do it in multiple stages. From your comments, it looks like you only want the top level domain, not the full URL. In that case you can use something like this:

grep -Eoi '<a [^>]+>' source.html |
grep -Eo 'href="[^\"]+"' | 
grep -Eo '(http|https)://[^/"]+'
grep -Eoi '<a [^>]+>' source.html |
grep -Eo 'href="[^\"]+"' | 
grep -Eo '(http|https)://[^/"]+'

where source.html is the file containing the HTML code to parse.

This code will print all top-level URLs that occur as the href attribute of any <a> elements in each line. The -i option to the first grep command is to ensure that it will work on both <a> and <A> elements. I guess you could also give -i to the 2nd grep to capture upper case HREF attributes, OTOH, I'd prefer to ignore such broken HTML. :)

To process the contents of http://google.com/

wget -qO- http://google.com/ |
grep -Eoi '<a [^>]+>' | 
grep -Eo 'href="[^\"]+"' | 
grep -Eo '(http|https)://[^/"]+'

output

http://www.google.com.au
http://maps.google.com.au
https://play.google.com
http://www.youtube.com
http://news.google.com.au
https://mail.google.com
https://drive.google.com
http://www.google.com.au
http://www.google.com.au
https://accounts.google.com
http://www.google.com.au
https://www.google.com
https://plus.google.com
http://www.google.com.au

My output is a little different from the other examples as I get redirected to the Australian Google page.

As I said in my comment, it's generally not a good idea to parse HTML with Regular Expressions, but you can sometimes get away with it if the HTML you're parsing is well-behaved.

In order to only get URLs that are in the href attribute of <a> elements, I find it easiest to do it in multiple stages. From your comments, it looks like you only want the top level domain, not the full URL. In that case you can use something like this:

grep -Eoi '<a [^>]+>' source.html |
grep -Eo 'href="[^\"]+"' | 
grep -Eo '(http|https)://[^/"]+'

where source.html is the file containing the HTML code to parse.

This code will print all top-level URLs that occur as the href attribute of any <a> elements in each line. The -i option to the first grep command is to ensure that it will work on both <a> and <A> elements. I guess you could also give -i to the 2nd grep to capture upper case HREF attributes, OTOH, I'd prefer to ignore such broken HTML. :)

As I said in my comment, it's generally not a good idea to parse HTML with Regular Expressions, but you can sometimes get away with it if the HTML you're parsing is well-behaved.

In order to only get URLs that are in the href attribute of <a> elements, I find it easiest to do it in multiple stages. From your comments, it looks like you only want the top level domain, not the full URL. In that case you can use something like this:

grep -Eoi '<a [^>]+>' source.html |
grep -Eo 'href="[^\"]+"' | 
grep -Eo '(http|https)://[^/"]+'

where source.html is the file containing the HTML code to parse.

This code will print all top-level URLs that occur as the href attribute of any <a> elements in each line. The -i option to the first grep command is to ensure that it will work on both <a> and <A> elements. I guess you could also give -i to the 2nd grep to capture upper case HREF attributes, OTOH, I'd prefer to ignore such broken HTML. :)

To process the contents of http://google.com/

wget -qO- http://google.com/ |
grep -Eoi '<a [^>]+>' | 
grep -Eo 'href="[^\"]+"' | 
grep -Eo '(http|https)://[^/"]+'

output

http://www.google.com.au
http://maps.google.com.au
https://play.google.com
http://www.youtube.com
http://news.google.com.au
https://mail.google.com
https://drive.google.com
http://www.google.com.au
http://www.google.com.au
https://accounts.google.com
http://www.google.com.au
https://www.google.com
https://plus.google.com
http://www.google.com.au

My output is a little different from the other examples as I get redirected to the Australian Google page.

Source Link
PM 2Ring
  • 6.8k
  • 2
  • 30
  • 33

As I said in my comment, it's generally not a good idea to parse HTML with Regular Expressions, but you can sometimes get away with it if the HTML you're parsing is well-behaved.

In order to only get URLs that are in the href attribute of <a> elements, I find it easiest to do it in multiple stages. From your comments, it looks like you only want the top level domain, not the full URL. In that case you can use something like this:

grep -Eoi '<a [^>]+>' source.html |
grep -Eo 'href="[^\"]+"' | 
grep -Eo '(http|https)://[^/"]+'

where source.html is the file containing the HTML code to parse.

This code will print all top-level URLs that occur as the href attribute of any <a> elements in each line. The -i option to the first grep command is to ensure that it will work on both <a> and <A> elements. I guess you could also give -i to the 2nd grep to capture upper case HREF attributes, OTOH, I'd prefer to ignore such broken HTML. :)