Strip attribute name from result set?

Question

I have a HTML document that looks (when oversimplified) like this:

<html>
  <body>
    <a href="...">...</a>
    <a href="...">...</a>
    <a href="...">...</a>
    ...
  </body>
</html>

What I'd like to do would be to extract the URLs in line-delimited output. Enter xmllint:

$ xmllint --html --xpath //a/@href
href="..." href="..." href="..."

It's getting the attribute, the whole attribute including the name, and it's outputting them space-delimited. How can I just get a list of lines with the values of the href attribute? I want output like this:

...
...
...

where ... is the URL found in the href attribute of each a element.

How can I format this output properly?

Does this help?

eyoung100
– eyoung100

2015-07-31 04:28:45 +00:00
Commented Jul 31, 2015 at 4:28 — eyoung100
– eyoung100, Commented Jul 31, 2015 at 4:28
don't do that thing in that link!

mikeserv
– mikeserv

2016-01-15 19:19:04 +00:00
Commented Jan 15, 2016 at 19:19 — mikeserv
– mikeserv, Commented Jan 15, 2016 at 19:19

clarity123 · Accepted Answer · 2016-01-15 22:51:23Z

Given file.html:

<html>
  <body>
    <a href="url1">link text 1</a>
    <a href="url2">link text 2</a>
    <a href="url3">link text 3</a>
    ...
  </body>
</html>

We can use Unix pipes to send existing xmllint's output, to sed and see this result:

$ xmllint --html --xpath //a/@href input.html | sed 's/ href="\([^"]*\)"/\1\n/g'
url1
url2
url3

Explanation

With xmllint alone, we only get:

$ xmllint --html --xpath //a/@href input.html
 href="url1" href="url2" href="url3"%

the trailing % indicates there is no trailing newline

One of the benefits of Unix-like systems is we can benefit from Doug McIlroy's pipes feature, so we don't have to have one program try to do everything, we are in fact encouraged to combine programs to suit our needs.

So, finding xmllint's output unsatisfactory, we pipe to combine it with our sed command, which:

searches for href="URL" units
using \( \) grouping to surround the URL part
and replacing it with \1\n so it references the group we defined around the URL, while also adding a new line after that matched \1

In this way we combine xmllint and sed to obtain the desired line-delimited output, one URL per line.

cesar · Accepted Answer · 2016-01-15 19:08:26Z

0

Have you considered using sed:

sed -n 's/.*href="\([^"]*\).*/\1/p'

answered Jan 15, 2016 at 19:08

cesar

5975 silver badges15 bronze badges

that gets only one per line - and i mean input lines not output lines. and so you lose <a href="http...."> </a> <a href="http...."> the first. it also gets <div text_attribute="bla...href="... all of this ... "

mikeserv
– mikeserv

2016-01-15 19:13:11 +00:00
Commented Jan 15, 2016 at 19:13
You are right. I used the sample from above which has each href on separate lines. @mikeserv

cesar
– cesar

2016-01-15 19:19:09 +00:00
Commented Jan 15, 2016 at 19:19
1

you can do it - you just need to be more conservative. the .* is usually an issue for these kinds of things.

mikeserv
– mikeserv

2016-01-15 19:20:04 +00:00
Commented Jan 15, 2016 at 19:20
1

this might help as an example. and there's this.

mikeserv
– mikeserv

2016-01-15 19:27:55 +00:00
Commented Jan 15, 2016 at 19:27
no problem. let me know if you work it out so i can upvote a worthy edit.

mikeserv
– mikeserv

2016-01-15 19:38:39 +00:00
Commented Jan 15, 2016 at 19:38

Add a comment |

Stack Exchange Network

Strip attribute name from result set?

2 Answers 2

Explanation

You must log in to answer this question.

Linked

Hot Network Questions

Strip attribute name from result set?

2 Answers 2

Explanation

You must log in to answer this question.

Linked

Related

Hot Network Questions