0

I have html file , I want extract string between pattern . this file look like this :

<span>aghahan.com</span>
<span>pouyamannequin.com</span>

i need that domain with span : aghahan.com , pouyamannequin.com

I am try with this command :

sed -e 's/>!\(.*\)>.com<\/span>/\1/' domain.txt

but I get wrong result . thankful if anybody help me.

1

4 Answers 4

1

As each line begins with <span> and ends with </span>:

sed 's|<span>\(.*\)</span>|\1|' domain.txt

You can also do it this way with awk by setting the field separator as either < or > and printing the third column:

awk -F '[<>]' '{print $3}' domain.txt

Output:

aghahan.com
pouyamannequin.com

These are the simplest ways that it can be done and it will also work if the lines have trailing white space.

0

With sed

 sed 's/\(.*\)>\(.*\)<\(.*\)/\2/g' domain.txt
0

With python and BeautifulSoup:

python -c '
from bs4 import BeautifulSoup
f = open("domain.txt", "r")
soup = BeautifulSoup(f.read(),"html.parser")
for span in soup.find_all("span"):
  print(span.string)
'

Might be a bit overkill for your simple task, but will work much better and will be easier on more difficult tasks, e.g. if you have different html like:

<span>
 aghahan.com
</span>
<span>
 pouyamannequin.com
</span>
0
awk -F ">" '{print $2}' filename| sed "s/<.*//g"

output

aghahan.com
pouyamannequin.com

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.