bash - extract filenames from html file containing multiple links

Question

I have downloaded an html file autogenerated by a script on a webpage. The file contains multiple links, including links to images I am trying to extract the full names of the images, for example

<a href="000000.jpg" title="image name.jpg" target="_blank">Image name.jpg</a>

from the above I want to get "Image name.jpg" stored in a file. Since there are hundreds of these, I parse the file and store each name as it comes up with the following command:

grep -i -E -o "target=\"_blank\">([[:graph:]]*)\.(jpg|png|gif|webm)" "$thread" | cut -f 2 -d '>' | sed 's/ /_/g' - > "$names"

where "$thread" is the name of the html file, "$names" is the list of filenames as output. I use "cut" to remove the 'target="_blank">' portion, then convert the spaces to underscores.

Since there are several other links in the file, I specify the extensions to grab (images and webm). everything else should be ignored. I got it to the point where it is grabbing these links only, but then it misses some.

Some files contain spaces and non-alphanumeric characters. If I use [[:print:]] which should cover all these cases i get nothing, or I get a bit of the <head> portion of html and nothing else. If I use [[:graph:][:space:]], I also get nothing. If I just use [[:graph:]] (as above) or [[:alnum:][:punct:]] I can get files with alphanumeric/other characters (like "filenamewith(parenthesis).jpg"), but not spaces, and the reverse is also true, [[:alnum:][:space:]] works but omits the other printable characters ("file name with spaces.jpg" works but not "with(parenthesis,comma or other.jpg").

Supposedly [[:print:]] covers all cases but I don't get what I need, which if I'm understanding correctly,

grep -E -o should only match (per the case above) : *.jpg *.png *.gif or *.webm

I have tried grep with and without -E/-o/-e in different variations.

Any ideas? I am using Arch Linux, grep 2.20, bash 4.3.18

glenn jackman · Accepted Answer · 2014-07-11 20:56:37Z

6

The best strategy would be to use a proper html parser that can spit out the value of all <a> tags.

Here, xmlstarlet is specifically an XML parser, and your HTML may not be well-formed XML, but you might get the idea:

echo '<html>
<a href="000000.jpg" title="image name.jpg" target="_blank">Image name.jpg</a>
</html>' | xmlstarlet sel -t -v //a

Image name.jpg

answered Jul 11, 2014 at 20:56

glenn jackman

88.5k16 gold badges124 silver badges179 bronze badges

Add a comment |

Community · Accepted Answer · 2017-05-23 12:40:01Z

Your regular expression is

target="_blank">([[:graph:]]*)\.(jpg|png|gif|webm)

This matches the literal text target="_blank">, followed by any number of non-whitespace characters, with one of the four strings .jpg, .png, .gif or .webm at the end. For example, the grep command would output the bold parts of the following lines:

<a … target="_blank">something.jpg</a>
<a … target="_blank">a.gifted.child.txt</a>
<a … target="_blank">something else.jpg</a>
<a … target="_blank">something.jpg</a>+more.jpg

And if you use [[:print:]] instead of [[:graph:]], then it would match something like

<a … target="_blank">something.jpg</a> wibble wobble <a … target="_blank">something else.jpg</a>

Everything between the first matching target … bit and the last matching extension on the line is a match.

What you need is to exclude HTML markup characters from the match.

target="_blank">[^<>]*\.(jpg|png|gif|webm)</a>

With GNU grep, you can use the -P option to get constructs from Perl regular expressions, and in particular zero-width assertions that let you specify that something is preceded or followed by some constant text without including that text in the matched portion.

grep -o -P '(?<target="_blank">)[^<>]*\.(jpg|png|gif|webm)(?=</a>)'

This can still fail if there is unexpected whitespace (like a newline between the <a> tag and the closing </a>, or ). You would do better to use a proper HTML parser.

For example, in Python with BeautifulSoup (untested):

import re, sys, BeautifulSoup
soup = BeautifulSoup(sys.stdin)
for hit in soup.find_all('a', target='_blank'):
    if re.match(r'.*\.(jpg|png|gif|webm)\Z', hit.string):
        print(hit.string)

Similar code can be written with HTML::Parser in Perl, Nokogiri in Ruby, etc.

Anthon · Accepted Answer · 2016-04-15 14:26:15Z

0

I ended up doing this:

w3m -dump -T text/html "$thread" | grep -i -E -o 'File\:+([[:print:]]*)\.(jpg|png|webm|gif)'

w3m cleans the code and then I can grep for the file names. (I need the literal "File:" part to distinguish a linked file from its title). I do need [[:print:]] because it catches most whitespace, unicode chars and other printables.

which works as I intended (though I still have to figure out how to prevent overwriting files with same name but that's another day's battle)

edited Apr 15, 2016 at 14:26

Anthon

81.4k42 gold badges174 silver badges228 bronze badges

answered Jul 17, 2014 at 18:13

CLos

411 silver badge3 bronze badges

Add a comment |

Stack Exchange Network

bash - extract filenames from html file containing multiple links

3 Answers 3

You must log in to answer this question.

Linked

Hot Network Questions

bash - extract filenames from html file containing multiple links

3 Answers 3

You must log in to answer this question.

Linked

Related

Hot Network Questions