Simple Word-Based Text Truncator

Question

I created a Python 3.11 utility that truncates an input string to a fixed word count—splitting on any whitespace, collapsing runs, and dropping trailing stop-words—so you get clean, concise snippets ready for downstream NLP tasks.

What it should do:

Truncate an input string to at most max_words words.
Split on any whitespace (collapsing runs of spaces, tabs, newlines).
If the last retained word is a common stop-word (e.g. “of”, “the”), drop it so you don’t end on an article/preposition.
Return a single string of the truncated words.
Handle edge cases: empty input, exact fits, max_words = 1, and invalid parameters (max_words < 1).

Environment & background

Python 3.11
Preprocessing predicates or short text snippets before feeding into downstream logic (e.g. building knowledge-graph edges).
Not a homework or interview question—just looking for best practices and bug-checks.

from typing import Set


_STOP_WORDS: Set[str] = {
    "a", 
    "an",
    "the",
    "of",
    "with",
    "by",
    "to",
    "from",
    "in",
    "on",
    "for",
}


def truncate(
    text: str,
    max_words: int = 3
) -> str:
    """
    Truncate `text` to at most `max_words` whitespace-separated words,
    dropping a trailing common stop-word if present.

    Splits on any whitespace (spaces, tabs, newlines), collapsing runs
    into single separators.

    Args:
        text: Input string to truncate.
        max_words: Maximum number of words to retain (must be ≥1).

    Returns:
        A string consisting of up to `max_words` words joined by single spaces.

    Raises:
        ValueError: if `max_words < 1`.

    Examples:
        >>> truncate("run in the park", 3)
        "run in"

        >>> truncate("of the", 2)
        "of the"
    """
    if max_words < 1:
        raise ValueError("max_words must be ≥ 1")

    words = text.strip().split()
    if len(words) <= max_words:
        return " ".join(words)

    head = words[:max_words]
    # Drop trailing stop-word so we don’t end on “of”, “the”, etc.
    if head and head[-1].lower() in _STOP_WORDS:
        head.pop()

    return " ".join(head)

Unit Tests (basic)

import pytest


def test_error_on_invalid_max_words():
    try:
        truncate("some text", 0)
        assert False, "Expected ValueError for max_words < 1"
    except ValueError as e:
        assert "max_words" in str(e)

def test_no_truncation_if_shorter():
    assert truncate("one two", 3) == "one two"
    assert truncate("a b c", 3) == "a b c"

def test_simple_truncation():
    assert truncate("one two three four", 2) == "one two"

def test_drop_trailing_stopword():
    # 'in' is a stopword, should be removed
    assert truncate("alpha beta in the", 3) == "alpha beta"
    # last kept word not a stopword, so stays
    assert truncate("alpha in beta gamma", 3) == "alpha in beta"

def test_strip_whitespace_and_split():
    # leading/trailing spaces collapse
    assert truncate("   hello   world  ", 1) == "hello"

def test_mixed_case_and_stopword():
    # stopword removal is case-insensitive
    assert truncate("Run In The Park", 3) == "Run In"


if __name__ == "__main__":
    pytest.main()

Small potatoes: Defaulting the length (to 3) seems absolutely arbitrary. Can't think of a usage where any particular default value would be useful... Seems an unnecessary wrinkle, (imho)... None of the presented "test cases" capitalise on the default being there.... — user272752
– user272752, Commented Jun 13 at 2:55

J_H · Accepted Answer · 2025-06-13 02:12:17Z

Thank you for the unit tests, they are helpful.

The requirements didn't comment on punctuation; I will assume those characters have already been filtered out.

special casing

    if len(words) <= max_words:
        return " ".join(words)

I have trouble believing that is correct, in the sense of implementing your stated requirements. Consider an input of "I have nothing to write with" where max_words is at least six. My reading of the requirements is the final "with" stop word should be trimmed, but the OP code won't do that.

Immediately afterward we assign head = words[:max_words]. I would prefer to have just a single path through the code, so a unit test doesn't have to worry about exercising that early return.

In the OP code your head list is shorter than words. I'm suggesting that there's no need for that distinction, and it would suffice to always work with words:
words = text.strip().split()[:max_words]

If ignoring the end of a long text makes a difference in your running time, then using the maxsplit parameter might be of interest.

implicit typing

I'm glad to see that you're linting and type checking!

This is very clear, but a little more verbose than needed.

from typing import Set


_STOP_WORDS: Set[str] = {

Old interpreters needed that import, but modern python interpreters work fine with lowercase set[str].

And, at least with pyright and mypy --strict, I would expect that with a simple assignment
_STOP_WORDS = { "a", ... "for" }
the linter would infer the type. It's not like we have to worry about inheritance, here.

Imagine we looped to read those words from a text file. In that case you'd need to give the linter some help:
stop_words: set[str] = {}
Why? Because an empty container doesn't give the linter much to go on, so you may need to spell it out.

toolic · Accepted Answer · 2025-06-12 12:13:43Z

Tests

You could add a few tests to verify the default max_words input works, such as:

def test_default():
    assert truncate("Run In The Park") == "Run In"

Documentation

Consider amending the docstring to mention the special case where the returned string could end in a stop-word if the last 2 words before truncation happen to be stop-words. Both examples show the returned value ending in a stop-word.

Naming

It is common to pluralize array variables, as you did with words. However, head does not comply with that policy. I suggest you change:

words = text.strip().split()

to:

all_words = text.strip().split()

Then, use words to mean the truncated words:

words = all_words[:max_words]

Portability

I'm not a big fan of fancy Unicode characters in source code, like the symbol for "greater than or equal to". Sometimes they don't render well in editors, and other times they don't render well in output generated by the code.

This is an alternative, for example:

max_words: Maximum number of words to retain (must be > 0).

raise ValueError("max_words must be > 0")

Stack Exchange Network

Simple Word-Based Text Truncator

2 Answers 2

special casing

implicit typing

Tests

Documentation

Naming

Portability

You must log in to answer this question.

Hot Network Questions

Simple Word-Based Text Truncator

2 Answers 2

special casing

implicit typing

Tests

Documentation

Naming

Portability

You must log in to answer this question.

Related

Hot Network Questions