Timeline for Python project to scrape webpages and build text datasets for ML purposes
Current License: CC BY-SA 4.0
8 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Oct 31 at 15:22 | comment | added | Ben Voigt | Another evidence in favor of "cannot be done correctly without using an actual HTML parser" is what the code currently does if there are multiple script or multiple style tags, not contiguous. | |
| Oct 30 at 17:38 | history | edited | Booboo | CC BY-SA 4.0 |
deleted 2 characters in body
|
| Oct 30 at 9:00 | comment | added | Stef | Although I don't actually know how much this actually saves - this still cuts the text in n+1 strings, and presumably the last string, which we don't need, will be very long and I don't know whether it's copied or if python is smart enough to use the same underlying char array as the original text. | |
| Oct 29 at 14:57 | history | edited | Booboo | CC BY-SA 4.0 |
More efficient splitting of the text.
|
| Oct 29 at 14:54 | comment | added | Booboo | @Stef Good point! Thanks. | |
| Oct 29 at 13:08 | comment | added | Stef |
Regarding test_fingerprint: words = text.lower().split()[:n] can be replaced with words = text.lower().split(maxsplit=n)[:n] to avoid splitting the whole text when only the first few words are wanted
|
|
| Oct 29 at 11:55 | history | edited | Booboo | CC BY-SA 4.0 |
added 112 characters in body
|
| Oct 29 at 11:49 | history | answered | Booboo | CC BY-SA 4.0 |