Replace strings in a pandas column

Question

Background

I have the following sample df that contains PHYSICIAN in the Text column followed by the physician name (all names below are made up)

import pandas as pd
df = pd.DataFrame({'Text' : ['PHYSICIAN: Jon J Smith was here today', 
                                   'And Mary Lisa Rider found here', 
                                   'Her PHYSICIAN: Jane A Doe is also here',
                                ' She was seen by  PHYSICIAN: Tom Tucker '], 

                      'P_ID': [1,2,3,4],
                      'N_ID' : ['A1', 'A2', 'A3', 'A4']

                     })

#rearrange columns
df = df[['Text','N_ID', 'P_ID']]
df

                                     Text         N_ID  P_ID
0   PHYSICIAN: Jon J Smith was here today           A1  1
1   And Mary Lisa Rider found here                  A2  2
2   Her PHYSICIAN: Jane A Doe is also here          A3  3
3   She was seen by PHYSICIAN: Tom Tucker           A4  4

Goal

1) Replace the names that follow the word PHYSICIAN (e.g. PHYSICIAN: Jon J Smith) with PHYSICIAN: **BLOCK**

2) Create a new column named Text_Phys

Desired Output

                                  Text            N_ID P_ID  Text_Phys
0   PHYSICIAN: Jon J Smith was here today           A1  1   PHYSICIAN: **BLOCK** was here today
1   And Mary Lisa Rider found here                  A2  2   And Mary Lisa Rider found here
2   Her PHYSICIAN: Jane A Doe is also here          A3  3   Her PHYSICIAN: **BLOCK** is also here
3   She was seen by PHYSICIAN: Tom Tucker           A4  4   She was seen by PHYSICIAN: **BLOCK**

I have tried the following

1) df['Text_Phys'] = df['Text'].replace(r'ABC.*', 'ABC: ***BLOCK***', regex=True)

2) df['Text_Phys'] = df['Text'].replace(r'ABC\s+', 'ABC: ***BLOCK***', regex=True)

But they don't seem to quite work

Question

How do I achieve my desired output?

It should be working as df['Text'] = df['Text'].replace(r'PHYSICIAN', 'PHYSICIAN: ***PHI***', regex=True) and df['Text'] = df['Text'].replace(r'Physician', 'Physician: ***PHI***', regex=True) — Karn Kumar
– Karn Kumar, Commented Jul 15, 2019 at 2:06
How about import re then df['Text_Phys'] = df['Text'].str.replace('PHYSICIAN', 'PHYSICIAN: ***PHI***', flags=re.I) but it will make the case in upper. However, earlier works fine for me, What version of pandas you are using. — Karn Kumar
– Karn Kumar, Commented Jul 15, 2019 at 2:12
how do you identify which part of the text is the name of the physician? — Andy L.
– Andy L., Commented Jul 15, 2019 at 2:13
It is easy to get the substring part after PHYSICIAN: . However, It is almost impossible to identify Jon J Smith, Jane A Doe, and Tom Tucker within the subtring. How do you know they are the names to replace unless you have some rules to identify them? — Andy L.
– Andy L., Commented Jul 15, 2019 at 2:25

SFC · Accepted Answer · 2019-10-11 20:53:18Z

Try this: Use regex to define the words you want to match and where you want to stop the search ( you could generate a list of all words occurring after "** " to further automate the code). instead of the quick hard code I did "Found|was |is " for sake of time.

code below:

import pandas as pd
df = pd.DataFrame({'Text' : ['PHYSICIAN: Jon J Smith was here today', 
                                   'And his Physician: Mary Lisa Rider found here', 
                                   'Her PHYSICIAN: Jane A Doe is also here',
                                ' She was seen by  PHYSICIAN: Tom Tucker '], 

                      'P_ID': [1,2,3,4],
                      'N_ID' : ['A1', 'A2', 'A3', 'A4']

                     })

df = df[['Text','N_ID', 'P_ID']]
df
    Text    N_ID    P_ID
0   PHYSICIAN: Jon J Smith was here today   A1  1
1   And his Physician: Mary Lisa Rider found here   A2  2
2   Her PHYSICIAN: Jane A Doe is also here  A3  3
3   She was seen by PHYSICIAN: Tom Tucker   A4  4

word_before = r'PHYSICIAN:'
words_after = r'.*?(?=found |was |is )'
words_all =r'PHYSICIAN:[\w\s]+'

import re

pattern = re.compile(word_before+words_after, re.IGNORECASE)
pattern2 = re.compile(words_all, re.IGNORECASE)

for i in range(len(df['Text'])):
    df.iloc[i,0] = re.sub(pattern,"PHYSICIAN: **BLOCK** ", df["Text"][i])
    if 'PHYSICIAN: **BLOCK**' not in df.iloc[i,0]:
        df.iloc[i,0] = re.sub(pattern2,"PHYSICIAN: **BLOCK** ", df["Text"][i])

df
    Text    N_ID    P_ID
0   PHYSICIAN: **BLOCK** was here today A1  1
1   And his PHYSICIAN: **BLOCK** found here A2  2
2   Her PHYSICIAN: **BLOCK** is also here   A3  3
3   She was seen by PHYSICIAN: **BLOCK**    A4  4

Collectives™ on Stack Overflow

Replace strings in a pandas column

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related