corpus

I have mostly tested trafilatura on a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web pages and cases in other languages for which the extraction doesn't work so far.

Corresponding bug reports can either be filed as a list in an issue like this one or in the code as XPath expressions in [xpaths.py](https://github.com

Jun	JUL	Aug
	10
2021	2022	2023

corpus

Here are 703 public repositories matching this topic...

brightmart / nlp_chinese_corpus

dariusk / corpora

Add "feeling cold, warm, cool, hot" in the activities.json file

Add Scents

Plural and invariable nouns?

wainshine / Chinese-Names-Corpus

CLUEbenchmark / CLUE

endymecy / awesome-deeplearning-resources

CLUEbenchmark / CLUEDatasetSearch

jinfagang / weibo_terminater

fendouai / Awesome-Chatbot

candlewill / Dialog_Corpus

gunthercox / chatterbot-corpus

wainshine / Company-Names-Corpus

chatopera / insuranceqa-corpus-zh

tensorlayer / seq2seq-chatbot

quanteda / quanteda

CLUEbenchmark / CLUEPretrainedModels

OYE93 / Chinese-NLP-Corpus

mhbashari / awesome-persian-nlp-ir

soskek / bookcorpus

BLKSerene / Wordless

adbar / trafilatura

List of smaller extraction bugs (text & metadata)

nonamestreet / weixin_public_corpus

crownpku / Small-Chinese-Corpus

ko-nlp / Korpora

CLUEbenchmark / CLUECorpus2020

MozillaSecurity / fuzzdata

CBLUEbenchmark / CBLUE

several27 / FakeNewsCorpus

NiuTrans / Classical-Modern

chatopera / efaqa-corpus-zh

louisowen6 / NLP_bahasa_resources

Improve this page

Add this topic to your repo