The Wayback Machine - https://web.archive.org/web/20211226185622/https://github.com/topics/article-extractor
Skip to content
#

article-extractor

Here are 33 public repositories matching this topic...

adbar
adbar commented Jan 9, 2020

I have mostly tested trafilatura on a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web pages and cases in other languages for which the extraction doesn't work so far.

Corresponding bug reports can either be filed as a list in an issue like this one or in the code as XPath expressions in [xpaths.py](https://github.com

The program can be used to scrape the content from an article from web by an input of a set of URLs in a text file or a URL. This project uses newspaper3k and python-docx libraries. The output of this program will give a neatly modified Word Document in '.docx' format with the contents of the article.

  • Updated Aug 5, 2020
  • Python

Improve this page

Add a description, image, and links to the article-extractor topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the article-extractor topic, visit your repo's landing page and select "manage topics."

Learn more