python html parsing

Question

i need do some html parsing use python .if i have a html file like bellow:

《body》
   《div class="mydiv"》
      《p》i want got it《/p》
      《div》
           《p》 good 《/p》
           《a》 boy  《/a》
      《/div》
   《/div》
《/body》

how can i get the content of 《div class="mydiv"》 ,say , i want got .

      《p》i want got it《/p》
      《div》
           《p》 good 《/p》
           《a》 boy 《/a》
      《/div》

i have try HTMLParser， but i fount it can't. anyway else ? thanks!

I'm looking at the Related section on the right, and...

Ignacio Vazquez-Abrams
– Ignacio Vazquez-Abrams

2011-06-01 08:10:42 +00:00
Commented Jun 1, 2011 at 8:10 — Ignacio Vazquez-Abrams
– Ignacio Vazquez-Abrams, Commented Jun 1, 2011 at 8:10

gennad · Accepted Answer · 2011-06-01 08:41:11Z

5

With BeautifulSoup it is as simple as:

from BeautifulSoup import BeautifulSoup
    html = """
      <body>
        <div class="mydiv">
          <p>i want got it</p>
          <div>
            <p> good </p>
            <a> boy  </a>
          </div>
        </div>
      </body>
    """

    soup = BeautifulSoup(html)
    result = soup.findAll('div', {'class': 'mydiv'})
    tag = result[0]
    print tag.contents
    [u'\n', <p>i want got it</p>, u'\n', <div>
    <p> good </p>
    <a> boy  </a>
    </div>, u'\n']

answered Jun 1, 2011 at 8:41

gennad

5,68512 gold badges46 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

mike Over a year ago

but, what i got is a list , how can it convert this list to a text file of html format?

gennad Over a year ago

from BeautifulSoup import Tag; st = ''.join([str(t) for t in tag if type(t) == Tag]). Then write it: with open('somename.html', 'w') as f: f.write(st). Something like this

Fred Foo · Accepted Answer · 2011-06-01 08:12:25Z

4

Use lxml. Or BeautifulSoup.

answered Jun 1, 2011 at 8:12

Fred Foo

365k80 gold badges765 silver badges852 bronze badges

Comments

taijirobot2 · Accepted Answer · 2011-06-01 09:29:42Z

1

I would prefer lxml.html.

import lxml.html as H
doc  = H.fromstring(html)
node = doc.xpath("//div[@class='mydiv']")

answered Jun 1, 2011 at 9:29

taijirobot2

3761 gold badge3 silver badges6 bronze badges

Collectives™ on Stack Overflow

python html parsing

3 Answers 3

2 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Linked

Related