Creating an AI Assistant for Technical Documentation – Part 2.1: Introduction to the Crawler

#python #ai #opensource #programming

This post is also available in Portuguese: Read in Portuguese

In the previous post, I shared the motivation and idea behind Project Insight — an open source project I'm developing with the goal of building an intelligent assistant capable of understanding and documenting source code interactively.

In this post, we'll take a look at the first technical component of the project: the crawler.

What is the crawler and why is it important?

The crawler is the core of Project Insight’s static analysis phase. It’s responsible for:

Navigating through Java project files.
Identifying and extracting key information such as:
- Class and method names.
- Modifiers (public, private, etc).
- Return types.
- The line where each item appears in the code.

These details are stored in a local database and will later be used by the AI to answer project-related questions in a contextualized way.

Project structure overview

To keep the project clean and maintainable from the start, I structured the crawler into separate modules, each with a clear responsibility:

project-insight-crawler/
├── crawler/
│   ├── database/       # Database creation and connection (SQLite)
│   ├── logger/         # Centralized logger for the project
│   ├── models/         # Models for Java classes and methods
│   ├── parser/         # Parser that extracts data from .java files
│   ├── use_cases/      # Use cases like saving data to the database
│   └── __init__.py
├── tests/              # Unit tests for main modules
├── crawler.db          # Local SQLite database
├── main.py             # Project CLI interface
├── runner.py           # Main script to execute the crawler
├── LICENSE
├── Makefile
├── poetry.lock
├── pyproject.toml
├── README.md

This structure allows each part to evolve independently, keeping the codebase modular and testable.

Initial architecture decisions

From the beginning, I made some decisions to simplify development and ensure quality:

Language: Python, for its familiarity and fast prototyping capabilities.
Database: SQLite, lightweight and easy to set up — ideal for MVPs.
Dependency management: poetry, for streamlined installation and packaging.
Code quality tools: already set up with:
- ruff (linter and formatter)
- mypy (type checking)
- pre-commit hooks
Automation and maintenance:
- GitHub Actions: for running tests on every push.
- Dependabot: to keep dependencies safely up to date.

All of this helps maintain consistent code quality and reduces future headaches.

What’s next

In the next post (Part 2.2), I’ll dive deeper into how the parser works — reading .java files line by line, detecting relevant code blocks, and extracting the information that feeds the database.

What do you think?

If you enjoyed this post so far, feel free to follow me here on Dev.to to keep up with the rest of the Project Insight series. I'm building this in real-time, so any feedback, questions, or suggestions are more than welcome in the comments!

Link to the project: https://github.com/gustavogutkoski/project-insight-crawler

⚙️ This post was written with the help of an AI assistant for writing and editing. All ideas, project structure, and technical implementations are my own.