3

I want to process some HTML code and remove the tags as in the example:

"<p><b>This</b> is a very interesting paragraph.</p>" results in "This is a very interesting paragraph."

I'm using Python as technology; do you know any framework I may use to remove the HTML tags?

Thanks!

5 Answers 5

5

This question may help you: Strip HTML from strings in Python

No matter what solution you choose, I'd recommend avoiding regular expressions. They can be slow when processing large strings, they might not work due to invalid HTML, and stripping HTML with regex isn't always secure or reliable.

Sign up to request clarification or add additional context in comments.

2 Comments

It's not merely the case that parsing HTML with regexen is difficult, slow, or inadvisable. The problem is that parsing HTML with regexen is literally impossible.
@Antal - Good point :) I've changed "parsing" to "stripping" in my question to make it accurate.
4

BeautifulSoup

Comments

1
import libxml2

text = "<p><b>This</b> is a very interesting paragraph.</p>"
root = libxml2.parseDoc(text)
print root.content

# 'This is a very interesting paragraph.'

Comments

1

Depending on your needs, you could just use the regular expression /<(.|\n)*?>/ and replace all matches with empty strings. This works perfectly for manual cases, but if you're building this as an application feature then you'll need a more robust and secure option.

Comments

1

you can use lxml.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.