I have a file that is mixed with both normal text I need and html-tags. I know that with REGEX it is possible to recognize html tags and with sed one can swap those for an empty string, but I do not know how to apply it concretely.
3 Answers
If you are not insisting on sed, the best thing to do this would be lynx.
lynx --dump <filename>.html
This will output the content of the html file in the format the html code was intending to display. The only condition is that the filename should have a .html or .htm extension.
As long as your HTML tags are confined to a single line, the following will work:
sed 's/<[^>]*>//g'
-
8What will happen with
<tag attribute="legal use of >" foo=bar>? html.spec.whatwg.org/multipage/…goldilocks– goldilocks2015-02-16 13:31:52 +00:00Commented Feb 16, 2015 at 13:31 -
4This will not handle comments correctly. Example:
<!-- > -->.Thom Smith– Thom Smith2015-02-16 18:38:56 +00:00Commented Feb 16, 2015 at 18:38
sedis not good tool to remove html tags.