Processing HTML files Python

Question

I dont know much about html... How do you remove just text from the page? For example if the html page reads as:

<meta name="title" content="How can I make money at home online? No gimmacks please? - Yahoo! Answers">
<title>How can I make money at home online? No gimmicks please? - Yahoo! Answers</title>

I just want to extract this.

How can I make money at home online? No gimmicks please? - Yahoo! Answers

I am using re function:

def striphtml(data):
  p = re.compile(r'<.*?>')
  return p.sub(' ',data)

but still it's not doing what I intend it to do..?

The above function is called as:

for lines in filehandle.readlines():

        #k = str(section[6].strip())
        myFile.write(lines)

        lines = striphtml(lines)
        content.append(lines)

possible duplicate of Parsing HTML in Python, Processing a HTML file using Python — Sathyajith Bhat
– Sathyajith Bhat, Commented Jan 9, 2012 at 2:45

Fabián Heredia Montiel · Accepted Answer · 2012-01-09 02:54:25Z

2

Don't use Regular expressions for HTML/XML parsing. Try http://www.crummy.com/software/BeautifulSoup/ instead.

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('Your resource<title>hi</title>')
soup.title.string # Your title string.

edited Jan 9, 2012 at 2:54

answered Jan 9, 2012 at 2:47

Fabián Heredia Montiel

1,6871 gold badge16 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

volvox Over a year ago

Update: try from bs4 import BeautifulSoup

soulcheck · Accepted Answer · 2012-01-09 02:58:21Z

2

Use an html parser for that. One could be BeautifulSoup

To get text content of the page:

 from BeautifulSoup import BeautifulSoup


 soup = BeautifulSoup(your_html)
 text_nodes = soup.findAll(text = True)
 retult = ' '.join(text_nodes)

answered Jan 9, 2012 at 2:58

soulcheck

36.9k6 gold badges95 silver badges91 bronze badges

Comments

Arthur Neves · Accepted Answer · 2012-01-09 02:56:31Z

1

I usually use http://lxml.de/ for html parsing! it is really easy to use, and pretty much to get tags you can use xpath for it! which just make things easy as well as fast.

I have a example of use, in a script that I did to read a xml feed and count the words:

https://gist.github.com/1425228

Also you can find more examples in the documentation: http://lxml.de/lxmlhtml.html

answered Jan 9, 2012 at 2:56

Arthur Neves

12.2k8 gold badges63 silver badges74 bronze badges

Collectives™ on Stack Overflow

Processing HTML files Python

3 Answers 3

1 Comment

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Linked

Related