0

Background

I have the following sample df that contains PHYSICIAN in the Text column followed by the physician name (all names below are made up)

import pandas as pd
df = pd.DataFrame({'Text' : ['PHYSICIAN: Jon J Smith was here today', 
                                   'And Mary Lisa Rider found here', 
                                   'Her PHYSICIAN: Jane A Doe is also here',
                                ' She was seen by  PHYSICIAN: Tom Tucker '], 

                      'P_ID': [1,2,3,4],
                      'N_ID' : ['A1', 'A2', 'A3', 'A4']

                     })

#rearrange columns
df = df[['Text','N_ID', 'P_ID']]
df

                                     Text         N_ID  P_ID
0   PHYSICIAN: Jon J Smith was here today           A1  1
1   And Mary Lisa Rider found here                  A2  2
2   Her PHYSICIAN: Jane A Doe is also here          A3  3
3   She was seen by PHYSICIAN: Tom Tucker           A4  4

Goal

1) Replace the names that follow the word PHYSICIAN (e.g. PHYSICIAN: Jon J Smith) with PHYSICIAN: **BLOCK**

2) Create a new column named Text_Phys

Desired Output

                                  Text            N_ID P_ID  Text_Phys
0   PHYSICIAN: Jon J Smith was here today           A1  1   PHYSICIAN: **BLOCK** was here today
1   And Mary Lisa Rider found here                  A2  2   And Mary Lisa Rider found here
2   Her PHYSICIAN: Jane A Doe is also here          A3  3   Her PHYSICIAN: **BLOCK** is also here
3   She was seen by PHYSICIAN: Tom Tucker           A4  4   She was seen by PHYSICIAN: **BLOCK**

I have tried the following

1) df['Text_Phys'] = df['Text'].replace(r'ABC.*', 'ABC: ***BLOCK***', regex=True)

2) df['Text_Phys'] = df['Text'].replace(r'ABC\s+', 'ABC: ***BLOCK***', regex=True)

But they don't seem to quite work

Question

How do I achieve my desired output?

11
  • It should be working as df['Text'] = df['Text'].replace(r'PHYSICIAN', 'PHYSICIAN: ***PHI***', regex=True) and df['Text'] = df['Text'].replace(r'Physician', 'Physician: ***PHI***', regex=True) Commented Jul 15, 2019 at 2:06
  • I tried but that doesn't quite work Commented Jul 15, 2019 at 2:10
  • How about import re then df['Text_Phys'] = df['Text'].str.replace('PHYSICIAN', 'PHYSICIAN: ***PHI***', flags=re.I) but it will make the case in upper. However, earlier works fine for me, What version of pandas you are using. Commented Jul 15, 2019 at 2:12
  • how do you identify which part of the text is the name of the physician? Commented Jul 15, 2019 at 2:13
  • 1
    It is easy to get the substring part after PHYSICIAN: . However, It is almost impossible to identify Jon J Smith, Jane A Doe, and Tom Tucker within the subtring. How do you know they are the names to replace unless you have some rules to identify them? Commented Jul 15, 2019 at 2:25

1 Answer 1

2

Try this: Use regex to define the words you want to match and where you want to stop the search ( you could generate a list of all words occurring after "** " to further automate the code). instead of the quick hard code I did "Found|was |is " for sake of time.

enter image description here

code below:

import pandas as pd
df = pd.DataFrame({'Text' : ['PHYSICIAN: Jon J Smith was here today', 
                                   'And his Physician: Mary Lisa Rider found here', 
                                   'Her PHYSICIAN: Jane A Doe is also here',
                                ' She was seen by  PHYSICIAN: Tom Tucker '], 

                      'P_ID': [1,2,3,4],
                      'N_ID' : ['A1', 'A2', 'A3', 'A4']

                     })

df = df[['Text','N_ID', 'P_ID']]
df
    Text    N_ID    P_ID
0   PHYSICIAN: Jon J Smith was here today   A1  1
1   And his Physician: Mary Lisa Rider found here   A2  2
2   Her PHYSICIAN: Jane A Doe is also here  A3  3
3   She was seen by PHYSICIAN: Tom Tucker   A4  4

word_before = r'PHYSICIAN:'
words_after = r'.*?(?=found |was |is )'
words_all =r'PHYSICIAN:[\w\s]+'

import re

pattern = re.compile(word_before+words_after, re.IGNORECASE)
pattern2 = re.compile(words_all, re.IGNORECASE)

for i in range(len(df['Text'])):
    df.iloc[i,0] = re.sub(pattern,"PHYSICIAN: **BLOCK** ", df["Text"][i])
    if 'PHYSICIAN: **BLOCK**' not in df.iloc[i,0]:
        df.iloc[i,0] = re.sub(pattern2,"PHYSICIAN: **BLOCK** ", df["Text"][i])

df
    Text    N_ID    P_ID
0   PHYSICIAN: **BLOCK** was here today A1  1
1   And his PHYSICIAN: **BLOCK** found here A2  2
2   Her PHYSICIAN: **BLOCK** is also here   A3  3
3   She was seen by PHYSICIAN: **BLOCK**    A4  4
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.