0

I would like to use a regular expression that matches any text between two strings:

   sample_string= "Message ID: SM9MatRNTnMAYaylR0QgOH///qUUveBCbw==  
    2021-07-10T20:48:23.997Z john s (X Y Bank) -
    [email protected]:  
     [EVENT] 347376954900491 ([email protected]) created room
    (roomName='CSTest' roomDescription='CS Test Chat Room' COPY_DISABLED=false
    READ_ONLY=false DISCOVERABLE=false MEMBER_ADD_USER_ENABLED=false
    roomType=PRIVATE conversationScope=internal owningCompany=X Y
    Bank)
    
    Message ID: nsabNaqeXfuEj9mBEhvS0n///qUUveAhbw==  
    2021-07-10T20:48:23.997Z john s (X Y Bank) -
    [email protected]  
     [EVENT] 347376954900491 ([email protected]) invited 347376954900486
    ([email protected]) to room (CSTest|john s|16091907435583)
    
    Message ID: Nu/EYTkTQ5qdbqzZ0Rig8n///qUUvQ42dA==  
    2021-07-10T20:48:23.997Z john s (X Y Bank) -
    [email protected]  
    
    Catchyou later
    
      
    
    Message ID: dy2yaByqhm+n88Gd3VQOhH///qUUrz8odA==  
    2021-07-10T20:48:23.997Z kerren n (X Y Bank) -
    [email protected]  
    
    KeywordContent_ Cricket is a bat-and-ball game played between two teams of
    eleven players on a field at the centre of which is a 20-metre (22-yard) pitch
    with a wicket at each end, each comprising two bails balanced on three stumps.
    The batting side scores runs by striking the ball bowled at the wicket with
    the bat, while the bowling and fielding side tries to prevent this and dismiss
    each player (so they are "out").
    
      
    
    * * *
    
    Generated by Content Export Service | Stream Type: SymphonyPost |
    Stream ID: ZZo5pRRPFC18uzlonFjya3///qUUveBHdA== | Room Type: Private |
    Conversation Scope: internal | Owning Company: X Y Bank | File
    Generated Date: 2021-07-10T20:48:23.997Z | Content Start Date:
    2021-07-10T20:48:23.997Z | Content Stop Date: 2021-07-10T20:48:23.997Z  
    
    * * *
    
    *** (780787) Disclaimer: 
    (incorporated in paris with Ref. No. ZC18, is authorised by Prudential Regulation
    Authority (PRA) and regulated by Financial Conduct Authority and PRA. oyp and
    its affiliates (We) monitor this confidential message meant for your
    information only. We make no recommendation or offer. You should get
    independent advice. We accept no liability for loss caused hereby. See market
    commentary disclaimers (
    http://wholesalebanking.com/en/utility/Pages/d-mkt.aspx ),
    Dodd-Frank and EMIR disclosures (
    http://wholesalebanking.com/en/capabilities/financialmarkets/Pages/default.aspx
    ) "

In this example, I would like to extract everything after emailID and keyword Messaage ID: so expected output would be:

extracted_list =[':  
 [EVENT] 347376954900491 ([email protected]) created room
(roomName='CSTest' roomDescription='CS Test Chat Room' COPY_DISABLED=false
READ_ONLY=false DISCOVERABLE=false MEMBER_ADD_USER_ENABLED=false
roomType=PRIVATE conversationScope=internal owningCompany=X Y
Bank)','says  
 [EVENT] 347376954900491 ([email protected]) invited 347376954900486
([email protected]) to room (CSTest|john s|16091907435583)','says Catchyou later','says 
KeywordContent_ Cricket is a bat-and-ball game played between two teams of
eleven players on a field at the centre of which is a 20-metre (22-yard) pitch
with a wicket at each end, each comprising two bails balanced on three stumps.
The batting side scores runs by striking the ball bowled at the wicket with
the bat, while the bowling and fielding side tries to prevent this and dismiss
each player (so they are "out").']

Note: everything after *** at last is not the part of text

What I tried so far is:

text = re.findall(r'\S+@\S+\s+(.*)Message ID', sample_string)
print (text)
##output: []
6
  • 1
    So, basically your question is: How to extract a part of a text (string), starting from emailID up until Messaage ID? Always try to provide a minimal example, not a big wall of text. Commented Oct 14, 2021 at 6:13
  • 1
    @MarkusWeninger Yes, sorry I have just started to use this platform. Commented Oct 14, 2021 at 6:17
  • Is there supposed to be an emailID somewhere in there? Commented Oct 14, 2021 at 6:29
  • @Jesper emailID is right after (X Y Bank) - Commented Oct 14, 2021 at 6:32
  • I think you mean [^\s@]+@[^\s@]+\s(.*?)\bMessage ID\b regex101.com/r/zd5w8v/1 But you have to add re.DOTALL as the last parameter of re.findall Commented Oct 14, 2021 at 7:41

1 Answer 1

1

You can use

(?s)\S+@\S+?((?:says?|:)?\s.*?)\s+(?:Message ID|\* +\* +\*)

See the regex demo.

Details:

  • (?s) - same as re.DOTALL, inline modifier to make . match across line breaks
  • \S+ - one or more non-whitespace chars (can be replaced with [^\s@]+)
  • @ - a @ char
  • \S+? - one or more non-whitespace chars as few as possible
  • ((?:says?|:)?\s.*?) - Group 1: an optional says/say/: and then a whitespace and then any zero or more chars as few as possible
  • \s+ - one or more whitespaces
  • (?:Message ID|\* +\* +\*) - either Message ID or * * * like substring.
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.