comments
Lines of code: ~200
I'm sure that was true when you wrote it. Looks like you've added a hundred or so lines since then. Comments tend to bit rot as the codebase evolves, which is why we might be reluctant to put too many specifics into them.
deps
Dependencies: ZERO
This is certainly true.
I get it that we had a design goal.
I'm not sure I agree with the goal.
It's the rare project that doesn't need its own .venv/
Someone who uses this will likely want to pull in
the good old familiar requests package before long.
And I can't imagine why you wouldn't want some help
with throttling request rate or obeying /robots.txt,
since they're essential and they are not this project's core concerns.
It's just table stakes -- a big crawler has to be a good netizen.
I get the sense that you wished to avoid depending upon
lazynlp.
But I wish that you had.
More on that below.
Then we would have an up-to-date TLDextract dep, plus justext,
which I think is the big thing you wanted to evict from the deps
and which seems a nice enough library to me.
BTW, though it's not published on pypi,
you can still use a GitHub repo URL to depend on lazynlp.
You can even bake in a particular immutable commit hash.
pypi
Recommend you publish version 0.1.0 sooner than later, to reserve the project name.
Recommend that you git rm setup.py.
In the modern packaging ecosystem
I thought we try to simply rely on the pyproject.toml config,
which looks good to me.
Keeping redundant
boilerplate in two files instead of one seems undesirable.
ASCII art
Recommend you avoid long decorative ========== lines.
Plus, most of those comments are redundant with
the (well chosen) function names.
The only place where they're helpfully organizing things is for dedup.
Recommend that you take advantage of the language's ability
to organize code concepts,
by evicting text_fingerprint() and is_duplicate()
to a new dedup.py module.
BTW thank you, the utils.py module looks well organized.
type annotation
def download_text(url, timeout=30):
"""Download and extract text from a URL.
Returns:
str or None: Cleaned text content, or None if failed
"""
Rather than offer some "str or None" narrative text
for humans to read, prefer to put url: str | None in the signature.
Then everyone can read it, including mypy and pyrefly.
Down in clean_html() it's enough to say we accept and return str values.
design of Public API
It's unclear why return None is an advantage to the caller.
Sounds like it's just one more thing to check, one more possible pitfall.
The single call site already does essentially:
if text is None or len(text)==0: # report download failure
But of course an URL could plausibly give an empty 200 Success document.
I'm just not seeing how distinguishing None from "" helps us.
request pool
We don't appear to be holding open a port 443 connection
when e.g. we ask Gutenberg for Tom Sawyer and then for Huck Finn.
Given that this project is fundamentally about interacting
with web servers, I have to disagree with your decision to
jettison the familiar and helpful requests package.
The only reason it's not in Batteries Included is it needs
to release more often than annual interpreter releases,
given how Internet protocols and conditions keep changing so often.
Creating lots of SSL contexts seems needlessly painful
and nitty gritty. Just pass in a verify=False parameter
and let the network library sweat those details.
Consider using import protego
for help with robots.txt and request rate.
async
Ship version 0.1.0 to pypi using the current design.
But consider relying on import httpx in a subsequent iteration,
so you can have e.g. a connection to Reddit and a connection to Gutenberg
getting useful download work done concurrently.
error handler
except Exception:
return None
Consider logging elapsed time, url, and the numeric 400- or 500-status that came back. For example, we may want to know the site throttled us due to too many requests.
slop comments
# Unescape HTML entities
...
# Remove script and style tags
...
# Remove HTML tags
...
# Normalize whitespace
...
# Remove leading/trailing whitespace
Yeah, yeah, we were vibing with an LLM, I get it.
But those remarks are vacuous and are redundant with
what the source code assignments eloquently state.
Delete such remarks prior to a git commit.
unit tests
I didn't see any. Automated tests should be exercising those regexes, to verify they behave as you think they behave.
fingerprints
Summarizing a blog post or a novel by just its eight initial words is, ummmm, surprising. I'm especially worried that boilerplate website navbar, or repeated license / copyright notice, will make many documents on a given site appear "identical".
unused param
def is_duplicate(text, seen_fingerprints, threshold=0.8):
Remove the unused threshold, please.
No need to have an IDE auto-complete that for some hapless app author.
one-liner
exclude_extensions = set(ext.lower().lstrip('.') for ext in (exclude_extensions or [])) | set(DEFAULT_EXCLUDE_EXTENSIONS)
This is not at all easy to read. Prefer to let black -S *.py worry about laying out code within some reasonable line width:
exclude_extensions = (
set(ext.lower().lstrip('.') for ext in (exclude_extensions or []))
) | set(DEFAULT_EXCLUDE_EXTENSIONS)
design of Public API
Please elide the use_default_excludes parameter.
Better to let caller pass in DEFAULT_EXCLUDE_DOMAINS
and/or DEFAULT_EXCLUDE_EXTENSIONS.
Also, I wish we were pulling one or both of those in from lazynlp, rather than copy-n-pasting them into this project.
pathlib
... = open(os.path.join(output_dir, 'success.txt'), 'w')
... = open(os.path.join(output_dir, 'failed.txt'), 'w')
The ancient os.path module works well enough.
But when authoring new code, prefer from pathlib import Path.
imports at top
# Check exclusion filters
from urllib.parse import urlparse
parsed = urlparse(url)
No.
Put the import where it belongs.
Also, the comment suggests that we should Extract Helper function,
in the hopes of letting the body of the for loop appear in a single screenful
without vertical scrolling.
def main():
"""Command-line interface."""
import sys
import argparse
Again, no.
Those two belong at top of module.
import logging
failed_log.write(f"{url}\texcluded_domain\n")
Consider doing that with a logger,
so you consistently get automatic timestamped log entries.
That will help you understand why a given crawl was fast or slow.
cracking argv
if len(sys.argv) >= 2 and sys.argv[1] == 'stats':
if len(sys.argv) < 3:
...
if len(sys.argv) >= 2 and sys.argv[1] == 'merge':
if len(sys.argv) < 4:
I imagine those work properly?
But it seems like you're working too hard.
Why didn't argparse deal with optional items for you already?
When I import typer I always get appropriate CLI diagnostics
displayed automatically, without jumping through such hoops.
be lazy
After you (quickly) ship version 0.1.0,
I urge you to consider using uv to manage dependencies
listed in pyproject.toml, such as httpx.
Consider adding a make install Makefile, or a shell script,
that shows how to pull in deps and assemble a small text corpus.
This project should focus on its core value-add, which is managing large text datasets. To the extent that you can outsource any of the network minutiae to some well tested library that has already worked out the details, I encourage you to do so.