2

I would like to extract a number from a large html file with python. My idea was to use regex like this:

import re
text = 'gfgfdAAA1234ZZZuijjk'
try:
    found = re.search('AAA(.+?)ZZZ', text).group(1)
except AttributeError:
    found = ''

found

But unfortunately i'm not used to regex and i fail to adapt this example to extract 0,54125 from:

(...)<div class="vk_ans vk_bk">0,54125 count id</div>(...)

Is there an other way to extract the number or could some one help me with the regex?

2
  • 2
    Extract the contents of the tag you need with BeautifulSoup and then just split the string and get Item #0. Commented Apr 27, 2018 at 9:15
  • 1
    Do not use regex for HTML parsing: there are enough tools more suitable for this purpose, e.g. BeautifulSoup, lxml.html... Commented Apr 27, 2018 at 9:23

2 Answers 2

1

If you want output 0,54125(or \d+,\d+), then you need to set some conditions for the output.

From the following input,

 (...)<div class="vk_ans vk_bk">0,54125 count id</div>(...)

If you want to extract 0,54125, it seems you can try several regexs like follows,

(?<=\>)\d+,\d+

Demo

or,

(?<=\<div class=\"vk_ans vk_bk\"\>)\d+,\d+

Demo

, etc..

Sign up to request clarification or add additional context in comments.

Comments

1

You can replace some characters in your text before searching it. For example, to capture numbers like 12,34 you can do this:

text = 'gfgfdAAA12,34ZZZuijjk'
try:
    text = text.replace(',', '')
    found = re.search('AAA(\d+)ZZZ', text).group(1)
except AttributeError:
    found = ''

print found
# 1234

If you need to capture the digits inside a line, you can make your pattern more general, like this:

text = '<div class="vk_ans vk_bk">0,54125 count id</div>'
text = text.replace(',', '')
found = re.search('(\d+)', text).group(1)

print found
# 054125

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.