3

Using RDF::RDFa::Parser module in perl to parse rdf data out of website. On website with with !DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> it works, but on sites using xhtml !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> no output...

test website -> http://www.filmstarts.de/kritiken/186918.html

use RDF::RDFa::Parser;

my $url     = 'http://www.filmstarts.de/kritiken/186918.html';
my $options = RDF::RDFa::Parser::Config->tagsoup;
my $rdfa    = RDF::RDFa::Parser->new_from_url($url, $options);

print $rdfa->opengraph('image');
print $rdfa->opengraph('description');

1 Answer 1

5

(I'm the author of RDF::RDFa::Parser.)

It looks like the HTML parser used by the RDFa parser is failing on that page. (I'm also the maintainer of the HTML parser in question, so I can't shift the blame onto anyone else!) Thus, by the time the RDFa parsing starts, all it sees is an empty DOM tree.

The page is quite hideously invalid XHTML yet still I would have expected the HTML parser to do a reasonable job. I've filed a bug report for you.

In the mean time, a workaround might be to build the XML::LibXML DOM tree outside of RDF::RDFa::Parser (perhaps using libxml's built-in HTML parser?). You could pass that tree directly to the RDFa parser:

use RDF::RDFa::Parser;
use LWP::Simple qw(get);

my $url     = 'http://www.filmstarts.de/kritiken/186918.html';
my $xhtml   = get($url);
my $dom     = somehow_build_a_dom_tree($xhtml);  # hand-waving!!
my $options = RDF::RDFa::Parser::Config->tagsoup;
my $rdfa    = RDF::RDFa::Parser->new($dom, $url, $options);

print $rdfa->opengraph('image');
print $rdfa->opengraph('description');

I hope that helps!

Update: here's a possible implementation of somehow_build_a_dom_tree...

sub somehow_build_a_dom_tree {
    my $p = XML::LibXML->new;
    $p->recover_silently(1);
    $p->load_html( string => @_ );
}
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks but, how should i build a dom tree? Here is a working link with html 4.01 -> videoworld.de/DVD~179029~vw~30615/DVD-Verleih-Elysium.html thats works perfect...

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.