Not sure if you are limited on tools:
But regex might not be the best way to go as mentioned, but here is an example that I put together:
cat urls.html | grep -Eo "(http|https)://[a-zA-Z0-9./?=_=_%:-]*" | sort -u
- grep -E
grep -E: is the same as egrep - grep -o
grep -o: only outputs what has been grepped - (http|https)
(http|https): is an either / or - a-z
a-z: is all lower case - A-Z
A-Z: is all uperupper case - .
.: is dot - /
/: is the slash - ?
?: is ? - *
=: is equal sign _: is underscore%: is percentage sign:: is colon-: is dash*: is repeat the [...] group- sort -u
sort -u: will sort & remove any duplicates
Output:
bob@bob-NE722:~s$ wget -qO- https://stackoverflow.com/ | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
https://stackauth.com
https://meta.stackoverflow.com
https://cdn.sstatic.net/Img/svg-icons
https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/users/login?ssrc=head
https://stackoverflow.com/users/signup?ssrc=head
https://stackoverflow.com
https://stackoverflow.com/help
https://chat.stackoverflow.com
https://meta.stackoverflow.com
...
You can also add in \d to catch other numeral types.