1

I have a dataset with over 100,000rows and 300 columns,

Here is the sample dataset:

pd.options.display.max_colwidth = 1000

df = pd.DataFrame({'EVENT_DTL':['1. Name : John Johns \n2. Date : 05 March 2013 \n3. founded : 75075 Plano, Dallas Texas \n4. Charactor : Impersive \n5. Corona corelation : Cannot be found',
                               '1. Name : Mark Dwaine \n2. Date : 13 January 2020 \n3. founded : 45184 Miami, Florida \n4. Charactor : Slow learner \n5. Corona corelation : Suicide because of the economic difficulty',
                               '1. Name : Janny chung \n2. Date : 11 December 2011 \n3. founded : 77543 Bay area, San Fransisco \n4. Charactor : Always ambitious \n5. Corona corelation : Cannot be found but probably related to epidemic',
                               '1. Name : Sally \n2. Date : 11 December 2021 \n3. founded : 75074 Saginow, Fort Worth \n4. Charactor : energetic \n5. Corona corelation : Her friends guess it is because of corona'],
                   'EVENT_DTL_2':['He is always fast mover','He is brillient, smart','she is kind of person who is always eager to learn new subejct','he was a lunatic, his neighber said']})
df.loc[2,'EVENT_DTL_2'] = np.nan

df

I'm trying to insert 'EVENT_DTL_2' to 'EVENT_DTL' but next to the \n4. Charactor : xxx substring

The desired output is:

df2 = pd.DataFrame({'EVENT_DTL':['1. Name : John Johns \n2. Date : 05 March 2013 \n3. founded : 75075 Plano, Dallas Texas \n4. Charactor : Impersive He is always fast mover\n5. Corona corelation : Cannot be found',
                               '1. Name : Mark Dwaine \n2. Date : 13 January 2020 \n3. founded : 45184 Miami, Florida \n4. Charactor : Slow learner He is brillient, smart\n5. Corona corelation : Suicide because of the economic difficulty',
                               '1. Name : Janny chung \n2. Date : 11 December 2011 \n3. founded : 77543 Bay area, San Fransisco \n4. Charactor : Always ambitious \n5. Corona corelation : Cannot be found but probably related to epidemic',
                               '1. Name : Sally \n2. Date : 11 December 2021 \n3. founded : 75074 Saginow, Fort Worth \n4. Charactor : energetic he was a lunatic, his neighber said\n5. Corona corelation : Her friends guess it is because of corona'],
                   'EVENT_DTL_2':['He is always fast mover','He is brillient, smart',np.nan,'he was a lunatic, his neighber said']})
df2

I need a efficient way since I need to apply the method the very large dataset.

1 Answer 1

1

You can split and merge again:

df2 = df['EVENT_DTL'].str.split('(?<=\n4\.)', expand=True)
df['EVENT_DTL'] = df2[0]+' '+df['EVENT_DTL_2']+' '+df2[1]
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.