Revisions to How to use grep and cut in script to obtain website URLs from an HTML file

add more URL chars

Source Link

edit approved Jun 10, 2020 at 10:35

107
3

Not sure if you are limited on tools:

But regex might not be the best way to go as mentioned, but here is an example that I put together:

cat urls.html | grep -Eo "(http|https)://[a-zA-Z0-9./?=_=_%:-]*" | sort -u

grep -Egrep -E : is the same as egrep
grep -ogrep -o : only outputs what has been grepped
(http|https)(http|https) : is an either / or
a-za-z : is all lower case
A-ZA-Z : is all uperupper case
.. : is dot
// : is the slash
?? : is ?
*= : is equal sign

_ : is underscore

% : is percentage sign

: : is colon

- : is dash

*: is repeat the [...] group
sort -usort -u : will sort & remove any duplicates

Output:

bob@bob-NE722:~s$  wget -qO- https://stackoverflow.com/ | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
https://stackauth.com
https://meta.stackoverflow.com
https://cdn.sstatic.net/Img/svg-icons
https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/users/login?ssrc=head
https://stackoverflow.com/users/signup?ssrc=head
https://stackoverflow.com
https://stackoverflow.com/help
https://chat.stackoverflow.com
https://meta.stackoverflow.com
...

You can also add in \d to catch other numeral types.

Not sure if you are limited on tools:

But regex might not be the best way to go as mentioned, but here is an example that I put together:

cat urls.html | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u

grep -E : is the same as egrep
grep -o : only outputs what has been grepped
(http|https) : is an either / or
a-z : is all lower case
A-Z : is all uper case
. : is dot
/ : is the slash
? : is ?
*: is repeat the [...] group
sort -u : will sort & remove any duplicates

Output:

bob@bob-NE722:~s$  wget -qO- https://stackoverflow.com/ | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
https://stackauth.com
https://meta.stackoverflow.com
https://cdn.sstatic.net/Img/svg-icons
https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/users/login?ssrc=head
https://stackoverflow.com/users/signup?ssrc=head
https://stackoverflow.com
https://stackoverflow.com/help
https://chat.stackoverflow.com
https://meta.stackoverflow.com
...

You can also add in \d to catch other numeral types.

Not sure if you are limited on tools:

But regex might not be the best way to go as mentioned, but here is an example that I put together:

cat urls.html | grep -Eo "(http|https)://[a-zA-Z0-9./?=_%:-]*" | sort -u

grep -E : is the same as egrep
grep -o : only outputs what has been grepped
(http|https) : is an either / or
a-z : is all lower case
A-Z : is all upper case
. : is dot
/ : is the slash
? : is ?
= : is equal sign

_ : is underscore

% : is percentage sign

: : is colon

- : is dash

*: is repeat the [...] group
sort -u : will sort & remove any duplicates

Output:

bob@bob-NE722:~s$  wget -qO- https://stackoverflow.com/ | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
https://stackauth.com
https://meta.stackoverflow.com
https://cdn.sstatic.net/Img/svg-icons
https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/users/login?ssrc=head
https://stackoverflow.com/users/signup?ssrc=head
https://stackoverflow.com
https://stackoverflow.com/help
https://chat.stackoverflow.com
https://meta.stackoverflow.com
...

You can also add in \d to catch other numeral types.

fix description

Source Link

edited Jan 8, 2020 at 7:26

AdminBee

23.6k
25
55
77

Not sure if you are limited on tools:

But regex might not be the best way to go as mentioned, but here is an example that I put together:

cat urls.html | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u

grep -E : is the same as egrep
grep -o : only outputs what has been grepped
(http|https) : is an either / or
a-z : is all lower case
A-Z : is all uper case
. : is dot
/ : is the slash

? : is ?
*: is repeat the [...] group
sort -u : will remove sort & remove any duplicates

Output:

bob@bob-NE722:~s$  wget -qO- https://stackoverflow.com/ | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
https://stackauth.com
https://meta.stackoverflow.com
https://cdn.sstatic.net/Img/svg-icons
https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/users/login?ssrc=head
https://stackoverflow.com/users/signup?ssrc=head
https://stackoverflow.com
https://stackoverflow.com/help
https://chat.stackoverflow.com
https://meta.stackoverflow.com
...

You can also add in \d to catch other numeral types.

Not sure if you are limited on tools:

But regex might not be the best way to go as mentioned, but here is an example that I put together:

cat urls.html | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u

grep -E : is the same as egrep
grep -o : only outputs what has been grepped
(http|https) : is an either / or
a-z : is all lower case
A-Z : is all uper case
. : is dot
/?: is ?
*: is repeat the [...] group
sort -u : will remove sort & remove any duplicates

Output:

bob@bob-NE722:~s$  wget -qO- https://stackoverflow.com/ | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
https://stackauth.com
https://meta.stackoverflow.com
https://cdn.sstatic.net/Img/svg-icons
https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/users/login?ssrc=head
https://stackoverflow.com/users/signup?ssrc=head
https://stackoverflow.com
https://stackoverflow.com/help
https://chat.stackoverflow.com
https://meta.stackoverflow.com
...

You can also add in \d to catch other numeral types.

Not sure if you are limited on tools:

But regex might not be the best way to go as mentioned, but here is an example that I put together:

cat urls.html | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u

grep -E : is the same as egrep
grep -o : only outputs what has been grepped
(http|https) : is an either / or
a-z : is all lower case
A-Z : is all uper case
. : is dot
/ : is the slash

? : is ?
*: is repeat the [...] group
sort -u : will sort & remove any duplicates

Output:

bob@bob-NE722:~s$  wget -qO- https://stackoverflow.com/ | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
https://stackauth.com
https://meta.stackoverflow.com
https://cdn.sstatic.net/Img/svg-icons
https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/users/login?ssrc=head
https://stackoverflow.com/users/signup?ssrc=head
https://stackoverflow.com
https://stackoverflow.com/help
https://chat.stackoverflow.com
https://meta.stackoverflow.com
...

You can also add in \d to catch other numeral types.

fix description

Source Link

edit approved Jan 8, 2020 at 7:26

bbodenmiller

107
3

Not sure if you are limited on tools:

But regex might not be the best way to go as mentioned, but here is an example that I put together:

cat urls.html | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u

grep -E : is the same as egrep
grep -o : only outputs what has been grepped
(http|https) : is an either / or
a-z : is all lower case
A-Z : is all uper case
. : is dot
/?: is ?
*: is repeat the [...] group
uniqsort -u : will remove sort & remove any duplicates

Output:

bob@bob-NE722:~s$  wget -qO- https://stackoverflow.com/ | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
https://stackauth.com
https://meta.stackoverflow.com
https://cdn.sstatic.net/Img/svg-icons
https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/users/login?ssrc=head
https://stackoverflow.com/users/signup?ssrc=head
https://stackoverflow.com
https://stackoverflow.com/help
https://chat.stackoverflow.com
https://meta.stackoverflow.com
...

You can also add in \d to catch other numeral types.

Not sure if you are limited on tools:

But regex might not be the best way to go as mentioned, but here is an example that I put together:

cat urls.html | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u

grep -E : is the same as egrep
grep -o : only outputs what has been grepped
(http|https) : is an either / or
a-z : is all lower case
A-Z : is all uper case
. : is dot
?: is ?
*: is repeat the [...] group
uniq : will remove any duplicates

Output:

bob@bob-NE722:~s$  wget -qO- https://stackoverflow.com/ | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
https://stackauth.com
https://meta.stackoverflow.com
https://cdn.sstatic.net/Img/svg-icons
https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/users/login?ssrc=head
https://stackoverflow.com/users/signup?ssrc=head
https://stackoverflow.com
https://stackoverflow.com/help
https://chat.stackoverflow.com
https://meta.stackoverflow.com
...

You can also add in \d to catch other numeral types.

Not sure if you are limited on tools:

But regex might not be the best way to go as mentioned, but here is an example that I put together:

cat urls.html | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u

grep -E : is the same as egrep
grep -o : only outputs what has been grepped
(http|https) : is an either / or
a-z : is all lower case
A-Z : is all uper case
. : is dot
/?: is ?
*: is repeat the [...] group
sort -u : will remove sort & remove any duplicates

Output:

bob@bob-NE722:~s$  wget -qO- https://stackoverflow.com/ | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
https://stackauth.com
https://meta.stackoverflow.com
https://cdn.sstatic.net/Img/svg-icons
https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/users/login?ssrc=head
https://stackoverflow.com/users/signup?ssrc=head
https://stackoverflow.com
https://stackoverflow.com/help
https://chat.stackoverflow.com
https://meta.stackoverflow.com
...

You can also add in \d to catch other numeral types.

update to working url, update uniq

Source Link

edited Oct 21, 2019 at 14:15

jmunsch

4.5k
3
21
31

Loading

added 3 characters in body

Source Link

edited Jan 27, 2015 at 13:18

jmunsch

4.5k
3
21
31

Loading

Source Link

answered Jan 27, 2015 at 5:02

jmunsch

4.5k
3
21
31

Loading

Stack Exchange Network

Return to Answer