4

I have a HTML document that looks (when oversimplified) like this:

<html>
  <body>
    <a href="...">...</a>
    <a href="...">...</a>
    <a href="...">...</a>
    ...
  </body>
</html>

What I'd like to do would be to extract the URLs in line-delimited output. Enter xmllint:

$ xmllint --html --xpath //a/@href
href="..." href="..." href="..."

It's getting the attribute, the whole attribute including the name, and it's outputting them space-delimited. How can I just get a list of lines with the values of the href attribute? I want output like this:

...
...
...

where ... is the URL found in the href attribute of each a element.

How can I format this output properly?

2
  • 1
    Does this help? Commented Jul 31, 2015 at 4:28
  • 1
    don't do that thing in that link! Commented Jan 15, 2016 at 19:19

2 Answers 2

1

Given file.html:

<html>
  <body>
    <a href="url1">link text 1</a>
    <a href="url2">link text 2</a>
    <a href="url3">link text 3</a>
    ...
  </body>
</html>

We can use Unix pipes to send existing xmllint's output, to sed and see this result:

$ xmllint --html --xpath //a/@href input.html | sed 's/ href="\([^"]*\)"/\1\n/g'
url1
url2
url3

Explanation

With xmllint alone, we only get:

$ xmllint --html --xpath //a/@href input.html
 href="url1" href="url2" href="url3"%
  • the trailing % indicates there is no trailing newline

One of the benefits of Unix-like systems is we can benefit from Doug McIlroy's pipes feature, so we don't have to have one program try to do everything, we are in fact encouraged to combine programs to suit our needs.

So, finding xmllint's output unsatisfactory, we pipe to combine it with our sed command, which:

  • searches for href="URL" units
  • using \( \) grouping to surround the URL part
  • and replacing it with \1\n so it references the group we defined around the URL, while also adding a new line after that matched \1

In this way we combine xmllint and sed to obtain the desired line-delimited output, one URL per line.

0

Have you considered using sed:

sed -n 's/.*href="\([^"]*\).*/\1/p'

5
  • that gets only one per line - and i mean input lines not output lines. and so you lose <a href="http...."> </a> <a href="http...."> the first. it also gets <div text_attribute="bla...href="... all of this ... " Commented Jan 15, 2016 at 19:13
  • You are right. I used the sample from above which has each href on separate lines. @mikeserv Commented Jan 15, 2016 at 19:19
  • 1
    you can do it - you just need to be more conservative. the .* is usually an issue for these kinds of things. Commented Jan 15, 2016 at 19:20
  • 1
    this might help as an example. and there's this. Commented Jan 15, 2016 at 19:27
  • no problem. let me know if you work it out so i can upvote a worthy edit. Commented Jan 15, 2016 at 19:38

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.