HTML code processing

Question

I want to process some HTML code and remove the tags as in the example:

"<p><b>This</b> is a very interesting paragraph.</p>" results in "This is a very interesting paragraph."

I'm using Python as technology; do you know any framework I may use to remove the HTML tags?

Thanks!

Community · Accepted Answer · 2017-05-23 11:56:09Z

5

This question may help you: Strip HTML from strings in Python

No matter what solution you choose, I'd recommend avoiding regular expressions. They can be slow when processing large strings, they might not work due to invalid HTML, and stripping HTML with regex isn't always secure or reliable.

edited May 23, 2017 at 11:56

CommunityBot

11 silver badge

answered Oct 22, 2010 at 15:11

Colin O'Dell

8,7178 gold badges41 silver badges75 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Antal Spector-Zabusky Over a year ago

It's not merely the case that parsing HTML with regexen is difficult, slow, or inadvisable. The problem is that parsing HTML with regexen is literally impossible.

Colin O'Dell Over a year ago

@Antal - Good point :) I've changed "parsing" to "stripping" in my question to make it accurate.

kevingessner · Accepted Answer · 2010-10-22 15:11:27Z

4

BeautifulSoup

answered Oct 22, 2010 at 15:11

kevingessner

19k6 gold badges48 silver badges63 bronze badges

Comments

eumiro · Accepted Answer · 2010-10-22 15:14:02Z

1

import libxml2

text = "<p><b>This</b> is a very interesting paragraph.</p>"
root = libxml2.parseDoc(text)
print root.content

# 'This is a very interesting paragraph.'

answered Oct 22, 2010 at 15:14

eumiro

214k36 gold badges307 silver badges264 bronze badges

Comments

Daniel Mendel · Accepted Answer · 2010-10-22 15:16:06Z

1

Depending on your needs, you could just use the regular expression /<(.|\n)*?>/ and replace all matches with empty strings. This works perfectly for manual cases, but if you're building this as an application feature then you'll need a more robust and secure option.

answered Oct 22, 2010 at 15:16

Daniel Mendel

10k1 gold badge27 silver badges37 bronze badges

Comments

ghostdog74 · Accepted Answer · 2010-10-22 15:26:25Z

1

you can use lxml.

answered Oct 22, 2010 at 15:26

ghostdog74

346k62 gold badges264 silver badges349 bronze badges

Collectives™ on Stack Overflow

HTML code processing

5 Answers 5

2 Comments

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

Comments

Comments

Comments

Comments

Linked

Related