-1

If one downloads a webpage with curl or wget it comes down as html.

But if I wish to download it as plain text (i.e. no HTML parsing whatsoever), exactly or almost exactly as it would be plainly read in a web browser (with any image/video/audio omitted of course), what would be a way to do that?

2
  • Since the question is different, no. It might have data that would have prevented me to ask the current question. Commented Mar 14, 2022 at 12:54
  • 1
    A question doesn't have to be identical in every respect to be a duplicate. You're expected to read and understand and extrapolate, then apply what you've learned to your specific situation. The linked question is similar enough to yours to be a dupe. Commented Mar 15, 2022 at 4:44

2 Answers 2

1

you can't download that, it doesn't exist on the server. The server sends the HTML, the browser's job is to display it. And part of that (can be) is showing the text.

In fact, many web pages are rather empty, and load the relevant content as you read along.

So, what you'll need is a working browser, which displays your text, then you need to get that text.

You'd usually do that by actually remote-controlling a browser from a scripting language: you start the browser in a special "daemon" mode, you connect to it, and using a specially crafted browser control interface (WebDriver) you tell it to go to a URL, wait a second to let the browser render what you'd see on screen, normally, and then tell it to save as a plain text file.

5
  • 2
    There's always good old lynx. Commented Mar 14, 2022 at 11:28
  • 1
    @Martin feel free to try to get the text of any AMP site, or twitter, or basically any modern website with lynx. Commented Mar 14, 2022 at 11:37
  • @MarcusMüller shouldn't lynx be fine for such all textual webpage? mediawiki.org/w/… Commented Mar 14, 2022 at 11:44
  • 1
    @Lahor sure. Mediawiki itself even offers a special format that makes it easier to extract the plain text through their REST API. However, in general, things don't work as smoothly. Things in general are meant to be displayed in a modern browser, and "what you can read in a browser" is not what you requested from a single URL as HTML stripped off all formatting, but the result of the browser running the program that a modern website is. Look at twitter, as an example: there's no single HTML site you get, as you scroll, things get loaded and rendered. Quite often, that rendering is not Commented Mar 14, 2022 at 11:51
  • done in terms of text-carrying HTML elements, but rather custom elements, or even directly onto a canvas of sorts. Commented Mar 14, 2022 at 11:51
1

Personally, I'd use pandoc for that.

pandoc -t plain 'https://example.com/something/'

To save to a file:

pandoc -t plain 'https://example.com/something/' -o output.txt

Obviously this is only going to work well for mostly text websites that don't rely on javascript to populate the page.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.