1

This code takes a bit of bad html, uses the Tidy library to clean it up and then passes it to an HtmlLib.Reader().

import tidy
options = dict(output_xhtml=1, 
                add_xml_decl=1, 
                indent=1, 
                tidy_mark=0)

from xml.dom.ext.reader import HtmlLib
reader = HtmlLib.Reader()

doc = reader.fromString(tidy.parseString("<Html>Bad Html.", **options))

I'm not passing fromString with the right type, it seems, with this Traceback:

Traceback (most recent call last):
  File "getComicEmbed.py", line 33, in <module>
    doc = reader.fromString(tidy.parseString("<Html>Bad Html.</b>", **options))
  File "C:\Python26\lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line 67, in fromString
stream = reader.StrStream(str)
  File "C:\Python26\lib\site-packages\_xmlplus\dom\ext\reader\__init__.py", line 24, in StrStream
return cStringIO.StringIO(st)
TypeError: expected read buffer, _Document found

What should I do differently? Thanks!

1
  • 1
    Which tidy module are you importing? PyPI shows at least two, and I'm not sure if the one that's included with the tidy source distribution (for ubuntu's tidy package) is one of those. Commented Oct 15, 2010 at 9:55

2 Answers 2

4

tidy's parseString function returns a _Document instance which implements __str__ but not a buffer interface. Therefore HtmlLib.Reader().fromString cannot create a StringIO object out of it.

This should be fairly simple, change:

doc = reader.fromString(tidy.parseString("<Html>Bad Html.", **options))

to

doc = reader.fromString(str(tidy.parseString("<Html>Bad Html.", **options)))
Sign up to request clarification or add additional context in comments.

Comments

1

I haven't used the Python tidy module, and am not sure how to find it, but it looks like you need to call something like toString on the result of tidy.fromString to convert your parsed document back into XHTML.

For a different approach, you could consider using lxml.html, which is decent at parsing broken markup and provides you with a great ElementTree API for working with the result. It can also pretty-print *ML, which makes it sort of a superset of tidy, though perhaps not with quite the same ability to navigate incoherent markup.

Also: lxml is written in C (actually, like the python tidy module(s), just wraps a C library) so it's much faster than some of the other python modules for working with XML.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.