1

I'm trying to extract a substring from a large string that matches my pattern.

text = 'This is a large subsring. bla bla bla AND www.dumbweb.com/Dumbo and www.otherLinks.com...'

pattern = 'dumbweb.com'

here i'm trying to find the string that matches pattern

theLink = re.findall(pattern, text)
print(theLink)  //output: dumbweb.com

but i'm only able to find the exact text that i'm searching with, i'm trying to get the full string split by space

desired output:

theLink //www.dumbweb.com/Dumbo

i tired searching for similar question but i'm not able to phrase it right, i even looked up the Python Regex still not able to achieve what i'm looking for.

1
  • 2
    You literally mentioned split by space, so try: print([k for k in text.split() if 'dumbweb.com' in k]) Commented Jun 23, 2021 at 7:11

5 Answers 5

4

You may consider this approach:

import re
text = 'This is a large subsring. bla bla bla AND www.dumbweb.com/Dumbo and www.otherLinks.com...'
pattern = 'dumbweb.com'

rex = re.compile(r'\b' + r'\S*' + re.escape(pattern) + r'\S*')
print (rex.findall(text))

Output:

['dumbweb.com/Dumbo']

Explanation:

  • re.compile(...): compiles a given string regex pattern
  • r'\b': Word boundary
  • r'\S*': Match 0 or more non-whitespace characters
  • re.escape(pattern): Perform regex escape of the given string
  • r'\S*': Match 0 or more non-whitespace characters
Sign up to request clarification or add additional context in comments.

Comments

1

You could try this:

[^ ]*dumbweb\.com[^ ]*

Note that in regex a . matches any character. You need to use \. to match only a literal period

Comments

1

Try this:

re.search('dumbweb.com[\S]*', text).group() 
# matches your string followed by any character but white space 

Comments

1

Probably not the cleanest solution:

text = 'This is a large subsring. bla bla bla AND www.dumbweb.com/Dumbo and www.otherLinks.com...'

pattern = 'dumbweb.com'

for word in text.split():
    if word.find(pattern) > 0:
        print(word)

Comments

1

Your pattern should be

pattern = "www\.dumbweb\.com[^\\s]*"

This will print the link starting from www.dumbweb.com until there's a trailing space

2 Comments

This will also match wwwwdumbweb$com
can you please check my new answer, is it fine?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.