Timeline for Python project to scrape webpages and build text datasets for ML purposes

8 events

when toggle format	what		by	license	comment
Oct 31 at 15:22	comment	added	Ben Voigt		Another evidence in favor of "cannot be done correctly without using an actual HTML parser" is what the code currently does if there are multiple script or multiple style tags, not contiguous.
Oct 30 at 17:38	history	edited	Booboo	CC BY-SA 4.0	deleted 2 characters in body
Oct 30 at 9:00	comment	added	Stef		Although I don't actually know how much this actually saves - this still cuts the text in n+1 strings, and presumably the last string, which we don't need, will be very long and I don't know whether it's copied or if python is smart enough to use the same underlying char array as the original text.
Oct 29 at 14:57	history	edited	Booboo	CC BY-SA 4.0	More efficient splitting of the text.
Oct 29 at 14:54	comment	added	Booboo		@Stef Good point! Thanks.
Oct 29 at 13:08	comment	added	Stef		Regarding test_fingerprint: `words = text.lower().split()[:n]` can be replaced with `words = text.lower().split(maxsplit=n)[:n]` to avoid splitting the whole text when only the first few words are wanted
Oct 29 at 11:55	history	edited	Booboo	CC BY-SA 4.0	added 112 characters in body
Oct 29 at 11:49	history	answered	Booboo	CC BY-SA 4.0