Skip to main content

Not sure if you are limited on tools:

But regex might not be the best way to go as mentioned, but here is an example that I put together:

cat urls.html | grep -Eo "(http|https)://[a-zA-Z0-9./?=_=_%:-]*" | sort -u
  • grep -Egrep -E : is the same as egrep
  • grep -ogrep -o : only outputs what has been grepped
  • (http|https)(http|https) : is an either / or
  • a-za-z : is all lower case
  • A-ZA-Z : is all uperupper case
  • .. : is dot
  • // : is the slash
  • ?? : is ?
  • *= : is equal sign
  • _ : is underscore
  • % : is percentage sign
  • : : is colon
  • - : is dash
  • *: is repeat the [...] group
  • sort -usort -u : will sort & remove any duplicates

Output:

bob@bob-NE722:~s$  wget -qO- https://stackoverflow.com/ | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
https://stackauth.com
https://meta.stackoverflow.com
https://cdn.sstatic.net/Img/svg-icons
https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/users/login?ssrc=head
https://stackoverflow.com/users/signup?ssrc=head
https://stackoverflow.com
https://stackoverflow.com/help
https://chat.stackoverflow.com
https://meta.stackoverflow.com
...

You can also add in \d to catch other numeral types.

Not sure if you are limited on tools:

But regex might not be the best way to go as mentioned, but here is an example that I put together:

cat urls.html | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
  • grep -E : is the same as egrep
  • grep -o : only outputs what has been grepped
  • (http|https) : is an either / or
  • a-z : is all lower case
  • A-Z : is all uper case
  • . : is dot
  • / : is the slash
  • ? : is ?
  • *: is repeat the [...] group
  • sort -u : will sort & remove any duplicates

Output:

bob@bob-NE722:~s$  wget -qO- https://stackoverflow.com/ | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
https://stackauth.com
https://meta.stackoverflow.com
https://cdn.sstatic.net/Img/svg-icons
https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/users/login?ssrc=head
https://stackoverflow.com/users/signup?ssrc=head
https://stackoverflow.com
https://stackoverflow.com/help
https://chat.stackoverflow.com
https://meta.stackoverflow.com
...

You can also add in \d to catch other numeral types.

Not sure if you are limited on tools:

But regex might not be the best way to go as mentioned, but here is an example that I put together:

cat urls.html | grep -Eo "(http|https)://[a-zA-Z0-9./?=_%:-]*" | sort -u
  • grep -E : is the same as egrep
  • grep -o : only outputs what has been grepped
  • (http|https) : is an either / or
  • a-z : is all lower case
  • A-Z : is all upper case
  • . : is dot
  • / : is the slash
  • ? : is ?
  • = : is equal sign
  • _ : is underscore
  • % : is percentage sign
  • : : is colon
  • - : is dash
  • *: is repeat the [...] group
  • sort -u : will sort & remove any duplicates

Output:

bob@bob-NE722:~s$  wget -qO- https://stackoverflow.com/ | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
https://stackauth.com
https://meta.stackoverflow.com
https://cdn.sstatic.net/Img/svg-icons
https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/users/login?ssrc=head
https://stackoverflow.com/users/signup?ssrc=head
https://stackoverflow.com
https://stackoverflow.com/help
https://chat.stackoverflow.com
https://meta.stackoverflow.com
...

You can also add in \d to catch other numeral types.

fix description
Source Link
AdminBee
  • 23.6k
  • 25
  • 55
  • 77

Not sure if you are limited on tools:

But regex might not be the best way to go as mentioned, but here is an example that I put together:

cat urls.html | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
  • grep -E : is the same as egrep
  • grep -o : only outputs what has been grepped
  • (http|https) : is an either / or
  • a-z : is all lower case
  • A-Z : is all uper case
  • . : is dot
  • / : is the slash
  • ?  : is ?
  • *: is repeat the [...] group
  • sort -u : will remove sort & remove any duplicates

Output:

bob@bob-NE722:~s$  wget -qO- https://stackoverflow.com/ | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
https://stackauth.com
https://meta.stackoverflow.com
https://cdn.sstatic.net/Img/svg-icons
https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/users/login?ssrc=head
https://stackoverflow.com/users/signup?ssrc=head
https://stackoverflow.com
https://stackoverflow.com/help
https://chat.stackoverflow.com
https://meta.stackoverflow.com
...

You can also add in \d to catch other numeral types.

Not sure if you are limited on tools:

But regex might not be the best way to go as mentioned, but here is an example that I put together:

cat urls.html | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
  • grep -E : is the same as egrep
  • grep -o : only outputs what has been grepped
  • (http|https) : is an either / or
  • a-z : is all lower case
  • A-Z : is all uper case
  • . : is dot
  • /?: is ?
  • *: is repeat the [...] group
  • sort -u : will remove sort & remove any duplicates

Output:

bob@bob-NE722:~s$  wget -qO- https://stackoverflow.com/ | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
https://stackauth.com
https://meta.stackoverflow.com
https://cdn.sstatic.net/Img/svg-icons
https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/users/login?ssrc=head
https://stackoverflow.com/users/signup?ssrc=head
https://stackoverflow.com
https://stackoverflow.com/help
https://chat.stackoverflow.com
https://meta.stackoverflow.com
...

You can also add in \d to catch other numeral types.

Not sure if you are limited on tools:

But regex might not be the best way to go as mentioned, but here is an example that I put together:

cat urls.html | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
  • grep -E : is the same as egrep
  • grep -o : only outputs what has been grepped
  • (http|https) : is an either / or
  • a-z : is all lower case
  • A-Z : is all uper case
  • . : is dot
  • / : is the slash
  • ?  : is ?
  • *: is repeat the [...] group
  • sort -u : will sort & remove any duplicates

Output:

bob@bob-NE722:~s$  wget -qO- https://stackoverflow.com/ | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
https://stackauth.com
https://meta.stackoverflow.com
https://cdn.sstatic.net/Img/svg-icons
https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/users/login?ssrc=head
https://stackoverflow.com/users/signup?ssrc=head
https://stackoverflow.com
https://stackoverflow.com/help
https://chat.stackoverflow.com
https://meta.stackoverflow.com
...

You can also add in \d to catch other numeral types.

Not sure if you are limited on tools:

But regex might not be the best way to go as mentioned, but here is an example that I put together:

cat urls.html | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
  • grep -E : is the same as egrep
  • grep -o : only outputs what has been grepped
  • (http|https) : is an either / or
  • a-z : is all lower case
  • A-Z : is all uper case
  • . : is dot
  • /?: is ?
  • *: is repeat the [...] group
  • uniqsort -u : will remove sort & remove any duplicates

Output:

bob@bob-NE722:~s$  wget -qO- https://stackoverflow.com/ | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
https://stackauth.com
https://meta.stackoverflow.com
https://cdn.sstatic.net/Img/svg-icons
https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/users/login?ssrc=head
https://stackoverflow.com/users/signup?ssrc=head
https://stackoverflow.com
https://stackoverflow.com/help
https://chat.stackoverflow.com
https://meta.stackoverflow.com
...

You can also add in \d to catch other numeral types.

Not sure if you are limited on tools:

But regex might not be the best way to go as mentioned, but here is an example that I put together:

cat urls.html | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
  • grep -E : is the same as egrep
  • grep -o : only outputs what has been grepped
  • (http|https) : is an either / or
  • a-z : is all lower case
  • A-Z : is all uper case
  • . : is dot
  • ?: is ?
  • *: is repeat the [...] group
  • uniq : will remove any duplicates

Output:

bob@bob-NE722:~s$  wget -qO- https://stackoverflow.com/ | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
https://stackauth.com
https://meta.stackoverflow.com
https://cdn.sstatic.net/Img/svg-icons
https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/users/login?ssrc=head
https://stackoverflow.com/users/signup?ssrc=head
https://stackoverflow.com
https://stackoverflow.com/help
https://chat.stackoverflow.com
https://meta.stackoverflow.com
...

You can also add in \d to catch other numeral types.

Not sure if you are limited on tools:

But regex might not be the best way to go as mentioned, but here is an example that I put together:

cat urls.html | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
  • grep -E : is the same as egrep
  • grep -o : only outputs what has been grepped
  • (http|https) : is an either / or
  • a-z : is all lower case
  • A-Z : is all uper case
  • . : is dot
  • /?: is ?
  • *: is repeat the [...] group
  • sort -u : will remove sort & remove any duplicates

Output:

bob@bob-NE722:~s$  wget -qO- https://stackoverflow.com/ | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
https://stackauth.com
https://meta.stackoverflow.com
https://cdn.sstatic.net/Img/svg-icons
https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/users/login?ssrc=head
https://stackoverflow.com/users/signup?ssrc=head
https://stackoverflow.com
https://stackoverflow.com/help
https://chat.stackoverflow.com
https://meta.stackoverflow.com
...

You can also add in \d to catch other numeral types.

update to working url, update uniq
Source Link
jmunsch
  • 4.5k
  • 3
  • 21
  • 31
Loading
added 3 characters in body
Source Link
jmunsch
  • 4.5k
  • 3
  • 21
  • 31
Loading
Source Link
jmunsch
  • 4.5k
  • 3
  • 21
  • 31
Loading