How to download an HTML file as plain text? [duplicate]

Question

If one downloads a webpage with curl or wget it comes down as html.

But if I wish to download it as plain text (i.e. no HTML parsing whatsoever), exactly or almost exactly as it would be plainly read in a web browser (with any image/video/audio omitted of course), what would be a way to do that?

Since the question is different, no. It might have data that would have prevented me to ask the current question. — Lahor
– Lahor, Commented Mar 14, 2022 at 12:54
A question doesn't have to be identical in every respect to be a duplicate. You're expected to read and understand and extrapolate, then apply what you've learned to your specific situation. The linked question is similar enough to yours to be a dupe. — cas
– cas, Commented Mar 15, 2022 at 4:44

Marcus Müller · Accepted Answer · 2022-03-14 11:25:49Z

1

you can't download that, it doesn't exist on the server. The server sends the HTML, the browser's job is to display it. And part of that (can be) is showing the text.

In fact, many web pages are rather empty, and load the relevant content as you read along.

So, what you'll need is a working browser, which displays your text, then you need to get that text.

You'd usually do that by actually remote-controlling a browser from a scripting language: you start the browser in a special "daemon" mode, you connect to it, and using a specially crafted browser control interface (WebDriver) you tell it to go to a URL, wait a second to let the browser render what you'd see on screen, normally, and then tell it to save as a plain text file.

answered Mar 14, 2022 at 11:25

Marcus Müller

51.5k4 gold badges79 silver badges121 bronze badges

2

There's always good old lynx.

user516667
– user516667

2022-03-14 11:28:55 +00:00
Commented Mar 14, 2022 at 11:28
1

@Martin feel free to try to get the text of any AMP site, or twitter, or basically any modern website with lynx.

Marcus Müller
– Marcus Müller

2022-03-14 11:37:15 +00:00
Commented Mar 14, 2022 at 11:37
@MarcusMüller shouldn't lynx be fine for such all textual webpage? mediawiki.org/w/…

Lahor
– Lahor

2022-03-14 11:44:57 +00:00
Commented Mar 14, 2022 at 11:44
1

@Lahor sure. Mediawiki itself even offers a special format that makes it easier to extract the plain text through their REST API. However, in general, things don't work as smoothly. Things in general are meant to be displayed in a modern browser, and "what you can read in a browser" is not what you requested from a single URL as HTML stripped off all formatting, but the result of the browser running the program that a modern website is. Look at twitter, as an example: there's no single HTML site you get, as you scroll, things get loaded and rendered. Quite often, that rendering is not

Marcus Müller
– Marcus Müller

2022-03-14 11:51:01 +00:00
Commented Mar 14, 2022 at 11:51
done in terms of text-carrying HTML elements, but rather custom elements, or even directly onto a canvas of sorts.

Marcus Müller
– Marcus Müller

2022-03-14 11:51:31 +00:00
Commented Mar 14, 2022 at 11:51

Add a comment |

frabjous · Accepted Answer · 2022-03-14 15:07:53Z

1

Personally, I'd use pandoc for that.

pandoc -t plain 'https://example.com/something/'

To save to a file:

pandoc -t plain 'https://example.com/something/' -o output.txt

Obviously this is only going to work well for mostly text websites that don't rely on javascript to populate the page.

answered Mar 14, 2022 at 15:07

frabjous

9,1861 gold badge38 silver badges34 bronze badges

Add a comment |

Stack Exchange Network

How to download an HTML file as plain text? [duplicate]

2 Answers 2

Linked

Hot Network Questions

How to download an HTML file as plain text? [duplicate]

2 Answers 2

Linked

Related

Hot Network Questions