0

I need to extract the first number from a string, but I don't know the exact format of the number.

The number could be one of the following formats...1.224 some decimal... 3,455,000 some number with unknown number of commas... 45% a percentage ... or just an integer 5

it would be something like blah blah $ 2,400 or blah blah 45% or blah blah $1.23 or blah blah 7

would be interesting if it was intelligent enough to do word numbers too like blah blah seven

I don't need the dollar sign, just the number

1
  • 2
    What is "first number"(first digit or the entire number)? Should the commas or the decimals also be extracted? Commented Jun 30, 2018 at 5:28

4 Answers 4

2

For extracting the first number from a string, with different formats, you could use re.findall():

 import re

strings = ['45% blah 43%', '1.224 blah 3.2', '3,455,000 blah 4,3', '$1.2 blah blah $ 2,400', '3 blah blah 7']

for string in strings:
    first_match = re.findall(r'[0-9$,.%]+\d*', string)[0]
    print(first_match)

Which Outputs:

45%
1.224
3,455,000
$1.2
3
Sign up to request clarification or add additional context in comments.

1 Comment

Can you please modify your answer to include the treatment of negative numbers?
2

While this problem has many cases, here is a solution which solves most of them using some regex and the re module:

import re

def extractVal(s):
    return re.sub(r'^[^0-9$\-]*| .*$', '', s)

(1) It removes all leading string characters that are not 0-9, or $

(2) It removes all ending characters up to and including the first space (after (1))

Here's some data in action:

>>> data = ['blah $50,000 10', 'blah -1.224 blah', 'blah 3,455,000 blah', 'blah 45% 10 10 blah', '5 6 4']
>>> print(list(map(extractVal,data)))
['$50,000', '-1.224', '3,455,000', '45%', '5']

This solution assumes that the first number ends in a space.

We can go further as others have stated by converting these strings into numbers :

def valToInt(s):
    if '%' in s:
        a = float(s[:-1])/100
    else:
        a =  float(re.sub(r'[,$]','',s))
    return int(a) if a == int(a) else a

Resulting to (with the map() function again):

[50000, -1.224, 3455000, 0.45, 5]

7 Comments

One last thing... the text for percentage somtimes looks like this blah blah 45 % blah instead of blah blah 45% blah ... with an extra space between the number and percentage sign... so it's giving me 45 instead of 45%... is this easy to fix?
You could merely use str.replace(' %','%') at the very beginning
Oh... this might be tough... what about blahh (2)% blah for an output of -0.02?
Oh this also only works for positive numbers currently... You could replace the parenthesis with an empty string
The solution has been updated to work with negative numbers. You can remove parenthesis with re.sub('\(|\)', '', s)
|
1

If you insist on a regex, then this should work (only limited to cases you mentioned):

rgx = re.compile(r'\d+(,|\.)?\d*')
assert rgx.search("blah blah $ 2,400")
assert rgx.search("blah blah 45%")
assert rgx.search("blah blah $1.23")
assert rgx.search("blah blah 7")

As for the blah blah seven I do not thing a regex would cut it (at least not for anything more complex than a single digit).

Comments

1

Assuming you want an actual number, and that percents should be converted to a decimal:

str_ = "blah blah $ 2,400"
number, is_percent = re.search(r"([0-9,.]+)\s*(%?)", str_).groups() or (None, None)
if number is not None:
    number = float(number.replace(",", ""))
    if is_percent:
        number /= 100

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.