Skip to main content
3 of 4
added 129 characters in body
J_H
  • 42.3k
  • 3
  • 38
  • 157

comments

Lines of code: ~200

I'm sure that was true when you wrote it. Looks like you've added a hundred or so lines since then. Comments tend to bit rot as the codebase evolves, which is why we might be reluctant to put too many specifics into them.

deps

Dependencies: ZERO

This is certainly true. I get it that we had a design goal. I'm not sure I agree with the goal. It's the rare project that doesn't need its own .venv/

Someone who uses this will likely want to pull in the good old familiar requests package before long. And I can't imagine why you wouldn't want some help with throttling request rate or obeying /robots.txt, since they're essential and they are not this project's core concerns. It's just table stakes -- a big crawler has to be a good netizen.

I get the sense that you wished to avoid depending upon lazynlp. But I wish that you had. More on that below. Then we would have an up-to-date TLDextract dep, plus justext, which I think is the big thing you wanted to evict from the deps and which seems a nice enough library to me. BTW, though it's not published on pypi, you can still use a GitHub repo URL to depend on lazynlp. You can even bake in a particular immutable commit hash.

pypi

Recommend you publish version 0.1.0 sooner than later, to reserve the project name.

Recommend that you git rm setup.py. In the modern packaging ecosystem I thought we try to simply rely on the pyproject.toml config, which looks good to me. Keeping redundant boilerplate in two files instead of one seems undesirable.

ASCII art

Recommend you avoid long decorative ========== lines. Plus, most of those comments are redundant with the (well chosen) function names.

The only place where they're helpfully organizing things is for dedup. Recommend that you take advantage of the language's ability to organize code concepts, by evicting text_fingerprint() and is_duplicate() to a new dedup.py module.

BTW thank you, the utils.py module looks well organized.

type annotation

def download_text(url, timeout=30):
    """Download and extract text from a URL.

    Returns:
        str or None: Cleaned text content, or None if failed
    """

Rather than offer some "str or None" narrative text for humans to read, prefer to put url: str | None in the signature. Then everyone can read it, including mypy and pyrefly.

Down in clean_html() it's enough to say we accept and return str values.

design of Public API

It's unclear why return None is an advantage to the caller. Sounds like it's just one more thing to check, one more possible pitfall.

The single call site already does essentially:

    if text is None or len(text)==0:  # report download failure

But of course an URL could plausibly give an empty 200 Success document. I'm just not seeing how distinguishing None from "" helps us.

request pool

We don't appear to be holding open a port 443 connection when e.g. we ask Gutenberg for Tom Sawyer and then for Huck Finn. Given that this project is fundamentally about interacting with web servers, I have to disagree with your decision to jettison the familiar and helpful requests package. The only reason it's not in Batteries Included is it needs to release more often than annual interpreter releases, given how Internet protocols and conditions keep changing so often.

Creating lots of SSL contexts seems needlessly painful and nitty gritty. Just pass in a verify=False parameter and let the network library sweat those details.

Consider using import protego for help with robots.txt and request rate.

async

Ship version 0.1.0 to pypi using the current design. But consider relying on import httpx in a subsequent iteration, so you can have e.g. a connection to Reddit and a connection to Gutenberg getting useful download work done concurrently.

error handler

    except Exception:
        return None

Consider logging elapsed time, url, and the numeric 400- or 500-status that came back. For example, we may want to know the site throttled us due to too many requests.

slop comments

    # Unescape HTML entities
    ...
    # Remove script and style tags
    ...
    # Remove HTML tags
    ...
    # Normalize whitespace
    ...
    # Remove leading/trailing whitespace

Yeah, yeah, we were vibing with an LLM, I get it. But those remarks are vacuous and are redundant with what the source code assignments eloquently state. Delete such remarks prior to a git commit.

unit tests

I didn't see any. Automated tests should be exercising those regexes, to verify they behave as you think they behave.

fingerprints

Summarizing a blog post or a novel by just its eight initial words is, ummmm, surprising. I'm especially worried that boilerplate website navbar, or repeated license / copyright notice, will make many documents on a given site appear "identical".

unused param

def is_duplicate(text, seen_fingerprints, threshold=0.8):

Remove the unused threshold, please. No need to have an IDE auto-complete that for some hapless app author.

one-liner

        exclude_extensions = set(ext.lower().lstrip('.') for ext in (exclude_extensions or [])) | set(DEFAULT_EXCLUDE_EXTENSIONS)

This is not at all easy to read. Prefer to let black -S *.py worry about laying out code within some reasonable line width:

        exclude_extensions = (
            set(ext.lower().lstrip('.') for ext in (exclude_extensions or []))
        ) | set(DEFAULT_EXCLUDE_EXTENSIONS)

design of Public API

Please elide the use_default_excludes parameter. Better to let caller pass in DEFAULT_EXCLUDE_DOMAINS and/or DEFAULT_EXCLUDE_EXTENSIONS.

Also, I wish we were pulling one or both of those in from lazynlp, rather than copy-n-pasting them into this project.

pathlib

    ... = open(os.path.join(output_dir, 'success.txt'), 'w')
    ... = open(os.path.join(output_dir, 'failed.txt'), 'w')

The ancient os.path module works well enough. But when authoring new code, prefer from pathlib import Path.

imports at top

        # Check exclusion filters
        from urllib.parse import urlparse
        parsed = urlparse(url)

No.

Put the import where it belongs.

Also, the comment suggests that we should Extract Helper function, in the hopes of letting the body of the for loop appear in a single screenful without vertical scrolling.

def main():
    """Command-line interface."""
    import sys
    import argparse

Again, no.

Those two belong at top of module.

import logging

            failed_log.write(f"{url}\texcluded_domain\n")

Consider doing that with a logger, so you consistently get automatic timestamped log entries. That will help you understand why a given crawl was fast or slow.

cracking argv

    if len(sys.argv) >= 2 and sys.argv[1] == 'stats':
        if len(sys.argv) < 3:
            ...
    if len(sys.argv) >= 2 and sys.argv[1] == 'merge':
        if len(sys.argv) < 4:

I imagine those work properly? But it seems like you're working too hard. Why didn't argparse deal with optional items for you already? When I import typer I always get appropriate CLI diagnostics displayed automatically, without jumping through such hoops.


be lazy

After you (quickly) ship version 0.1.0, I urge you to consider using uv to manage dependencies listed in pyproject.toml, such as httpx. Consider adding a make install Makefile, or a shell script, that shows how to pull in deps and assemble a small text corpus.

This project should focus on its core value-add, which is managing large text datasets. To the extent that you can outsource any of the network minutiae to some well tested library that has already worked out the details, I encourage you to do so.

J_H
  • 42.3k
  • 3
  • 38
  • 157