Grep text with spaces

Question

I searched around and found these 2 topics, however they're different as the number of space is fixed while my sample doesn't have fixed space count.

https://stackoverflow.com/questions/47428445/i-want-grep-to-grep-one-word-which-is-having-spaces-it

https://askubuntu.com/questions/949326/how-to-include-a-space-character-with-grep

Sample text:

<span>Section 1: Plan your day, write out your plan</span>

Desired Output:

Section 1: Plan your day, write out your plan

I would like to grep text only, and not HTML tag. Here is my attempt.

wolf@linux:~$ cat file.txt 
<span>Section 1: Plan your day, write out your plan</span>
wolf@linux:~$ 

wolf@linux:~$ grep -oP 'S\S+ \d: \S+' file.txt 
Section 1: Plan
wolf@linux:~$ 

wolf@linux:~$ grep -oP 'S\S+ \d: \S+ \S+' file.txt 
Section 1: Plan your
wolf@linux:~$

Is there better solution rather than defining \S+ one by one as the length of text is different?

I wouldn't try to use space/non-space as the determiner here (what happens when you get to plan for example, which is all non-space?). Instead, consider using lookarounds ex. grep -oP '(?<=).*(?=)' or even just grep -oP '(?<=>).*(?=<)' — steeldriver
– steeldriver, Commented Feb 21, 2021 at 13:00
Are the tags always   or are they ever something else? — Nasir Riley
– Nasir Riley, Commented Feb 21, 2021 at 13:33
@NasirRiley, this is actually from HTML file, so there are tons of HTML tags in it. The text that I'm looking for always started with Section \d: — Wolf
– Wolf, Commented Feb 21, 2021 at 14:40
For grep, it does not matter if there are html tags if you search for strings like Section 1:. Regarding html tags I would use some of programs for stripping html tags (w3m, html2text), after grep has found the text. You could also first strip html tags and then search for your strings. — nobody
– nobody, Commented Feb 21, 2021 at 16:01

ilkkachu · Accepted Answer · 2021-02-21 15:25:16Z

3

With extended regexes, anchoring on the Section keyword and taking everything after it that's not a <:

$ grep -E -o 'Section [0-9]+:[^<]*' < file.txt
Section 1: Plan your day, write out your plan

I find that anchoring on the surrounding parts is easiest done with Perl, so if that's an option:

$ perl -lne 'print $1 if m,<span>(Section \d+:.*?)</span>,' < file.txt
Section 1: Plan your day, write out your plan

(There are ways to do a similar thing with grep -P, but I find them somewhat hard to read.)

answered Feb 21, 2021 at 15:25

ilkkachu

148k16 gold badges268 silver badges441 bronze badges

Add a comment |

Chris Davies · Accepted Answer · 2021-02-21 18:03:56Z

3

If your HTML is valid XML you can use xmlstarlet to pick out the appropriate element value.

xmlstarlet sel -t -v '//span' -n file.html
Section 1: Plan your day, write out your plan

Without more of your page structure I can't offer a better XPath (//span), but for example, if you know the span is inside a div you could use //div/span. There are many more selection options available

answered Feb 21, 2021 at 18:03

Chris Davies

128k16 gold badges178 silver badges323 bronze badges

Add a comment |

Stéphane Chazelas · Accepted Answer · 2021-02-21 15:21:22Z

2

Sounds to be you want to match sequences of characters other than < and > that contain <number>:, so:

grep -Po '[^<>]* \d+:[^<>]*'

answered Feb 21, 2021 at 15:21

Stéphane Chazelas

585k96 gold badges1.1k silver badges1.7k bronze badges

Add a comment |

Jason Croyle · Accepted Answer · 2021-02-21 17:49:33Z

2

A little bit off topic but could be useful there is a great open source project called "pup" https://github.com/EricChiang/pup it is a command line parser for html and uses css selectors. Small project and deserve more attention then it gets it is basically "jq" for html.

Jason C.

answered Feb 21, 2021 at 17:49

Jason Croyle

8134 silver badges13 bronze badges

didn't hear pup, sounds a good tool, specially grep'ing based on the tags!!

αғsнιη
– αғsнιη

2021-02-21 18:26:01 +00:00
Commented Feb 21, 2021 at 18:26

Add a comment |

DanieleGrassini · Accepted Answer · 2021-02-21 17:58:37Z

0

Perl look(ahead|behind) can be helpfull:

grep -Po "(?<=>).+(?=</)" yourfile

This match anything between html tags and strip out these

answered Feb 21, 2021 at 17:58

DanieleGrassini

2,9048 silver badges18 bronze badges

Add a comment |

Praveen Kumar BS · Accepted Answer · 2021-02-21 18:28:04Z

0

command:

awk -F "[<>]" '{print $3}' filename

output

Section 1: Plan your day, write out your plan

answered Feb 21, 2021 at 18:28

Praveen Kumar BS

5,3112 gold badges11 silver badges16 bronze badges

Add a comment |

Stack Exchange Network

Grep text with spaces

6 Answers 6

You must log in to answer this question.

Hot Network Questions

Grep text with spaces

6 Answers 6

You must log in to answer this question.

Related

Hot Network Questions