extract substrings using python regex

Question

I would like to use a regular expression that matches any text between two strings:

   sample_string= "Message ID: SM9MatRNTnMAYaylR0QgOH///qUUveBCbw==  
    2021-07-10T20:48:23.997Z john s (X Y Bank) -
    [email protected]:  
     [EVENT] 347376954900491 ([email protected]) created room
    (roomName='CSTest' roomDescription='CS Test Chat Room' COPY_DISABLED=false
    READ_ONLY=false DISCOVERABLE=false MEMBER_ADD_USER_ENABLED=false
    roomType=PRIVATE conversationScope=internal owningCompany=X Y
    Bank)
    
    Message ID: nsabNaqeXfuEj9mBEhvS0n///qUUveAhbw==  
    2021-07-10T20:48:23.997Z john s (X Y Bank) -
    [email protected]  
     [EVENT] 347376954900491 ([email protected]) invited 347376954900486
    ([email protected]) to room (CSTest|john s|16091907435583)
    
    Message ID: Nu/EYTkTQ5qdbqzZ0Rig8n///qUUvQ42dA==  
    2021-07-10T20:48:23.997Z john s (X Y Bank) -
    [email protected]  
    
    Catchyou later
    
      
    
    Message ID: dy2yaByqhm+n88Gd3VQOhH///qUUrz8odA==  
    2021-07-10T20:48:23.997Z kerren n (X Y Bank) -
    [email protected]  
    
    KeywordContent_ Cricket is a bat-and-ball game played between two teams of
    eleven players on a field at the centre of which is a 20-metre (22-yard) pitch
    with a wicket at each end, each comprising two bails balanced on three stumps.
    The batting side scores runs by striking the ball bowled at the wicket with
    the bat, while the bowling and fielding side tries to prevent this and dismiss
    each player (so they are "out").
    
      
    
    * * *
    
    Generated by Content Export Service | Stream Type: SymphonyPost |
    Stream ID: ZZo5pRRPFC18uzlonFjya3///qUUveBHdA== | Room Type: Private |
    Conversation Scope: internal | Owning Company: X Y Bank | File
    Generated Date: 2021-07-10T20:48:23.997Z | Content Start Date:
    2021-07-10T20:48:23.997Z | Content Stop Date: 2021-07-10T20:48:23.997Z  
    
    * * *
    
    *** (780787) Disclaimer: 
    (incorporated in paris with Ref. No. ZC18, is authorised by Prudential Regulation
    Authority (PRA) and regulated by Financial Conduct Authority and PRA. oyp and
    its affiliates (We) monitor this confidential message meant for your
    information only. We make no recommendation or offer. You should get
    independent advice. We accept no liability for loss caused hereby. See market
    commentary disclaimers (
    http://wholesalebanking.com/en/utility/Pages/d-mkt.aspx ),
    Dodd-Frank and EMIR disclosures (
    http://wholesalebanking.com/en/capabilities/financialmarkets/Pages/default.aspx
    ) "

In this example, I would like to extract everything after emailID and keyword Messaage ID: so expected output would be:

extracted_list =[':  
 [EVENT] 347376954900491 ([email protected]) created room
(roomName='CSTest' roomDescription='CS Test Chat Room' COPY_DISABLED=false
READ_ONLY=false DISCOVERABLE=false MEMBER_ADD_USER_ENABLED=false
roomType=PRIVATE conversationScope=internal owningCompany=X Y
Bank)','says  
 [EVENT] 347376954900491 ([email protected]) invited 347376954900486
([email protected]) to room (CSTest|john s|16091907435583)','says Catchyou later','says 
KeywordContent_ Cricket is a bat-and-ball game played between two teams of
eleven players on a field at the centre of which is a 20-metre (22-yard) pitch
with a wicket at each end, each comprising two bails balanced on three stumps.
The batting side scores runs by striking the ball bowled at the wicket with
the bat, while the bowling and fielding side tries to prevent this and dismiss
each player (so they are "out").']

Note: everything after *** at last is not the part of text

What I tried so far is:

text = re.findall(r'\S+@\S+\s+(.*)Message ID', sample_string)
print (text)
##output: []

So, basically your question is: How to extract a part of a text (string), starting from emailID up until Messaage ID? Always try to provide a minimal example, not a big wall of text. — Markus Weninger
– Markus Weninger, Commented Oct 14, 2021 at 6:13
@MarkusWeninger Yes, sorry I have just started to use this platform. — newbie
– newbie, Commented Oct 14, 2021 at 6:17
I think you mean [^\s@]+@[^\s@]+\s(.*?)\bMessage ID\b regex101.com/r/zd5w8v/1 But you have to add re.DOTALL as the last parameter of re.findall — The fourth bird
– The fourth bird, Commented Oct 14, 2021 at 7:41

Wiktor Stribiżew · Accepted Answer · 2021-10-14 08:40:54Z

1

You can use

(?s)\S+@\S+?((?:says?|:)?\s.*?)\s+(?:Message ID|\* +\* +\*)

See the regex demo.

Details:

(?s) - same as re.DOTALL, inline modifier to make . match across line breaks
\S+ - one or more non-whitespace chars (can be replaced with [^\s@]+)
@ - a @ char
\S+? - one or more non-whitespace chars as few as possible
((?:says?|:)?\s.*?) - Group 1: an optional says/say/: and then a whitespace and then any zero or more chars as few as possible
\s+ - one or more whitespaces
(?:Message ID|\* +\* +\*) - either Message ID or * * * like substring.

answered Oct 14, 2021 at 8:40

Wiktor Stribiżew

631k41 gold badges501 silver badges629 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

extract substrings using python regex

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related