1

I'm trying to extract two strings from this string using Regular Expressions -

'<img src="http://images.efollett.com/books/978/047/012/9780470129296.gif" width="80" height="100" alt="Organic Chemistry I (as Second Language)" />'

I want the URL after src and the text after alt (so Organic Chemistry I (as Second Language)) and the url)

I've tried ('<img src=(\w+)" width'), ('<img src="(\w+)"') and ('src="(\w+)"\swidth'), for the url and all return empty.

I've also tried ('alt="(\w+)"') for the name and again, no luck.

Can anyone help?

4 Answers 4

3

Use lxml.

import lxml.html

html_string = '<img src="http://images.efollett.com/books/978/047/012/9780470129296.gif" width="80" height="100" alt="Organic Chemistry I (as Second Language)" />'

img = lxml.html.fromstring(html_string)

print "src:", img.get("src")
print "alt:", img.get("alt")

Gives:

src: http://images.efollett.com/books/978/047/012/9780470129296.gif
alt: Organic Chemistry I (as Second Language)
Sign up to request clarification or add additional context in comments.

Comments

2

Although you should not be parsing HTML with regexes, I can point out a common error here with regexes, which is your use of \w. That only matches A-Z, a-z, 0-9, and underscores. Not slashes, not parentheses. If you are trying to pull data out of attributes, use "([^"]*)" or "(.*?)"

2 Comments

two questions- first how else would I extract the information that I want (I'm using Beautiful Soup and the other form of the above is as a BeautifulSoup tag)? Second, what regex can I use to get what I want?
Oh apologies then, I did not know you were using Beautiful Soup, which is an HTML parser! There are hints in this SO question.
1

You can try r'<img[^>]*\ssrc="(.*?)"' and r'<img[^>]*\salt="(.*?)"'.

I don't know if you are dealing with HTML. [^>]* is to ensure inside brackets. \s is used to avoid some tags like "xxxsrc", and take care of newlines.

1 Comment

This works but backtracks. Probably okay for small img tags. +1 for correctness.
0

I don't know python, but may this regular expression helps?

<img.*?src="([^"]*)".*?alt="([^"]*)".*?>

2 Comments

This works provided the src comes before the alt. Also a tip for efficiency: don't use .* in the middle of a regex. .*? is more appropriate in this case.
Thanks, updated. You're right, only if the string is given as described in the question (alt after src attribute) this regex makes sense.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.