Python - HTML Parsing with Tidy

Question

This code takes a bit of bad html, uses the Tidy library to clean it up and then passes it to an HtmlLib.Reader().

import tidy
options = dict(output_xhtml=1, 
                add_xml_decl=1, 
                indent=1, 
                tidy_mark=0)

from xml.dom.ext.reader import HtmlLib
reader = HtmlLib.Reader()

doc = reader.fromString(tidy.parseString("<Html>Bad Html.", **options))

I'm not passing fromString with the right type, it seems, with this Traceback:

Traceback (most recent call last):
  File "getComicEmbed.py", line 33, in <module>
    doc = reader.fromString(tidy.parseString("<Html>Bad Html.</b>", **options))
  File "C:\Python26\lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line 67, in fromString
stream = reader.StrStream(str)
  File "C:\Python26\lib\site-packages\_xmlplus\dom\ext\reader\__init__.py", line 24, in StrStream
return cStringIO.StringIO(st)
TypeError: expected read buffer, _Document found

What should I do differently? Thanks!

Which tidy module are you importing? PyPI shows at least two, and I'm not sure if the one that's included with the tidy source distribution (for ubuntu's tidy package) is one of those. — intuited
– intuited, Commented Oct 15, 2010 at 9:55

AndiDog · Accepted Answer · 2010-10-15 09:55:22Z

4

tidy's parseString function returns a _Document instance which implements __str__ but not a buffer interface. Therefore HtmlLib.Reader().fromString cannot create a StringIO object out of it.

This should be fairly simple, change:

doc = reader.fromString(tidy.parseString("<Html>Bad Html.", **options))

to

doc = reader.fromString(str(tidy.parseString("<Html>Bad Html.", **options)))

answered Oct 15, 2010 at 9:55

AndiDog

70.5k21 gold badges166 silver badges208 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

intuited · Accepted Answer · 2011-09-18 13:24:23Z

I haven't used the Python tidy module, and am not sure how to find it, but it looks like you need to call something like toString on the result of tidy.fromString to convert your parsed document back into XHTML.

For a different approach, you could consider using lxml.html, which is decent at parsing broken markup and provides you with a great ElementTree API for working with the result. It can also pretty-print *ML, which makes it sort of a superset of tidy, though perhaps not with quite the same ability to navigate incoherent markup.

Also: lxml is written in C (actually, like the python tidy module(s), just wraps a C library) so it's much faster than some of the other python modules for working with XML.

Collectives™ on Stack Overflow

Python - HTML Parsing with Tidy

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related