1

I have a super long string in DataFrame, need to extract all numbers, just all the digits, not include AW7S23211 and 7P0145 at the end

sample data:

id  rate
1   {"mileage": "42331", "pricing": [{"fees_tax_cents": 700, "start_fee_cents": 203159, "non_taxable_fees": [{"name": "Electronic Vehicle Registration or Transfer Charge", "value_cents": 2900}, {"name": "Registration Fees (Transfer and Smog)", "value_cents": 75500}], "cpo_premium_cents": 0, "taxable_fees_cents": 8000, "start_fee_tax_cents": 17776, "dealer_reserve_cents": 0, "monthly_payment_cents": 29033, "non_taxable_fees_cents": 78400, "expected_annual_mileage": 10000, "monthly_tax_payment_cents": 2540, "total_drive_off_tax_cents": 21017, "total_drive_off_cost_cents": 318592, "micro_ownership_premium_cents": 203159, "cost_per_additional_mile_cents": 13, "start_fee_without_cpo_premium_cents": 203159}, {"fees_tax_cents": 700, "start_fee_cents": 203159, "non_taxable_fees": [{"name": "Electronic Vehicle Registration or Transfer Charge", "value_cents": 2900}, {"name": "Registration Fees (Transfer and Smog)", "value_cents": 75500}], "cpo_premium_cents": 0, "taxable_fees_cents": 8000, "start_fee_tax_cents": 17776, "dealer_reserve_cents": 0, "monthly_payment_cents": 34450, "non_taxable_fees_cents": 78400, "expected_annual_mileage": 15000, "monthly_tax_payment_cents": 3014, "total_drive_off_tax_cents": 21491, "total_drive_off_cost_cents": 324009, "micro_ownership_premium_cents": 203159, "cost_per_additional_mile_cents": 13, "start_fee_without_cpo_premium_cents": 203159}], "stock_number": "AW7S23211"}
2   {"mileage": "3343", "pricing": [{"fees_tax_cents": 700, "start_fee_cents": 766343, "non_taxable_fees": [{"name": "Electronic Vehicle Registration or Transfer Charge", "value_cents": 2900}, {"name": "Registration Fees (Transfer and Smog)", "value_cents": 0}], "cpo_premium_cents": 0, "taxable_fees_cents": 8000, "start_fee_tax_cents": 67055, "dealer_reserve_cents": 0, "monthly_payment_cents": 101106, "non_taxable_fees_cents": 2900, "expected_annual_mileage": 12500, "monthly_tax_payment_cents": 8847, "total_drive_off_tax_cents": 76602, "total_drive_off_cost_cents": 878349, "micro_ownership_premium_cents": 766343, "cost_per_additional_mile_cents": 46, "start_fee_without_cpo_premium_cents": 766343}, {"fees_tax_cents": 700, "start_fee_cents": 766343, "non_taxable_fees": [{"name": "Electronic Vehicle Registration or Transfer Charge", "value_cents": 2900}, {"name": "Registration Fees (Transfer and Smog)", "value_cents": 0}], "cpo_premium_cents": 0, "taxable_fees_cents": 8000, "start_fee_tax_cents": 67055, "dealer_reserve_cents": 0, "monthly_payment_cents": 89436, "non_taxable_fees_cents": 2900, "expected_annual_mileage": 7500, "monthly_tax_payment_cents": 7826, "total_drive_off_tax_cents": 75581, "total_drive_off_cost_cents": 866679, "micro_ownership_premium_cents": 766343, "cost_per_additional_mile_cents": 46, "start_fee_without_cpo_premium_cents": 766343}], "stock_number": "7P0145"}

expected output

id   rate   
1    42331 700 203159 2900 75500 ......
2    3343  700 766343 2900 0 ......

the code below only work for simple string, but not on this super long one, please advise

import pandas as pd
df= pd.read_csv('C:/Users/Desktop/items.csv')
df=pd.DataFrame(df)
from ast import literal_eval
df['rate'] = df['rate'].apply(literal_eval)
s=df.rate.apply(pd.Series).set_index('id').stack().apply(pd.Series)

if treat it as JSON, I have "error: look-behind requires fixed-width pattern " Why ?

import re
import pandas as pd
df= pd.read_csv('C:/Users/Desktop/items.csv')
p = re.compile(r'(?<=\s+|")\d+(?!\w+)')
df.rate.apply(lambda x: re.findall(p, x))
0

2 Answers 2

1

Use a recursive generator to walk the nested dictionary object.

import json
from itertools import chain

def gnum(d):
    if str(d).isdigit():
        yield int(d)
    elif isinstance(d, dict):
        for i in chain(*map(gnum, d.values())):
            yield i
    elif isinstance(d, list):
        for i in chain(*map(gnum, d)):
            yield i

df.assign(rate=df.rate.apply(lambda x: list(gnum(json.loads(x)))))

   id                                               rate
0   1  [42331, 700, 203159, 2900, 75500, 0, 8000, 177...
1   2  [3343, 700, 766343, 2900, 0, 0, 8000, 67055, 0...
Sign up to request clarification or add additional context in comments.

1 Comment

would probably be better to use json.loads over literal_eval- this looks like json encoded data.
1

treat the json as a string and use the regex '(?<=\s|")\d+(?!\w+)' to extract all numbers.

import re
p = re.compile(r'(?<=\s+|")\d+(?!\w+)')
df.rate.apply(lambda x: re.findall(p, x))

This will find all the digits excluding those of the form AW7S23211 or 1237P or 1234ABD342 or 123.23. The result will be a list of digits for each row of the series df.rate

3 Comments

Hi Haleemur, please see my updated question with your regex code, why it didn't work
you get an error because you've added a space between r and '(?.... there shouldn't be a space. Also, are you trying to apply a regex after loading the data as a json. that approach is bound to fail.
Hi Haleemur, I revised it, has a new "error: look-behind requires fixed-width pattern ", please advise

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.