DEV Community

Cover image for πŸ›‘οΈ A Swahili SMS Scam Dataset and a Machine Learning Tool to Use It
Henry Dioniz
Henry Dioniz

Posted on

πŸ›‘οΈ A Swahili SMS Scam Dataset and a Machine Learning Tool to Use It

In Tanzania πŸ‡ΉπŸ‡Ώ, scammers are getting smarter. They often pretend to be someone you know or trust a relative, a friend, a landlord, or even a job recruiter. Their goal? To trick you into sending them money.

You’ve probably seen texts like:

  • β€œNi tumie kwa namba hii Jina litakuja SALOME KALUNGA, hiyo ni namba yangu mpya ya Halotel”
  • β€œUtanitumia kwenye ii 0615810764 airtel jina MARIAM NDUGAI namba yangu inadeni usiitumie”
  • β€œMZEE LUKA KIMBANGU tiba asili biashala kazi masomo utajili kesi kuludisha mke&mume piga (0787-406-889)(0787-406-889)”
  • β€œ666,KARIBU FREEMASON UTIMIZE NDOTO KATIKA BIASHARA, KILIMO,UFUGAJI,MACHI MBO,MICHEZO N.K KWAMHITAJI KUJIUNGA PG: 0786543210 AU 0786543210”

These messages are dangerous, deceptive, and sadly, very common.

As a Tanzanian tech enthusiast and developer, I wanted to do something about it.
So I created Bongoscam dataset an open dataset of over 1,500 Swahili SMS scam examples, and a basic machine learning model to help detect them.

πŸ“Š The Dataset: Swahili SMS Detection

I collected and labeled 1,508 real Swahili messages, split into two categories:

  • scam: Suspicious, misleading, or fraudulent messages.
  • trust: Legitimate or safe messages.

Example entries:

category sms
scam "IYO PESA ITUME KWENYE NAMBA HII 0657538690 JINA ITALETA Magomba Maila"
trust "Nashukuru kwa kupokea simu yangu. Tutalifanyia kazi."

➑️ Download the dataset on Kaggle:
πŸ“₯ swahili-sms-detection

🧠 The Model: Simple but Effective

To demonstrate what’s possible, I built a lightweight machine learning model using:

  • 🧹 CountVectorizer for converting text to numeric features
  • πŸ€– Multinomial Naive Bayes classifier
  • πŸ“ˆ 98.7% accuracy on test data

The model is wrapped in a Flask API and deployed as a simple website for public use.

You can test it live here:
πŸ‘‰ bongoscam.vercel.app

πŸ“¦ Project Structure

You can explore or contribute via GitHub:

πŸ”— GitHub: BongoScamDetection

# Clone the repo
git clone https://github.com/Henryle-hd/BongoScamDetection
cd bongoscam

# Install frontend
cd frontend
npm install

# Install backend
cd backend
pip install -r requirements.txt

# Run backend
python main.py

# Run frontend
npm run dev
Enter fullscreen mode Exit fullscreen mode

πŸ”Œ API Example

Endpoint: POST /api/predict
Request:

{
  "sms": "Iyo ela tuma humu kwenye vodacom 0655251448 Jina lije ALLY ISSA"
}
Enter fullscreen mode Exit fullscreen mode

Response:

{
  "prediction": "scam",
  "sms": "Iyo ela tuma humu kwenye vodacom 0655251448 Jina lije ALLY ISSA"
}
Enter fullscreen mode Exit fullscreen mode

πŸ”„ Why This Matters

This project isn’t just about coding. It’s about digital safety.
Millions of people in East Africa rely on SMS for communication.
Without strong tools or education, they’re vulnerable.

By:

  • Open-sourcing the data
  • Making the model public
  • Supporting Swahili language

...I'm hoping this becomes a starting point for more localized ML solutions β€” in Swahili, for Africa, by Africans.

✍️ Final Thoughts

BongoScam dataset is a small step toward fighting digital fraud in Tanzania, but I believe it can grow with your input.
If you're a:

  • Developer πŸ§‘β€πŸ’»
  • Linguist 🌍
  • Security researcher πŸ”
  • Student πŸ“š

…there’s something in this project for you.

πŸ‘‰ Test the tool at bongoscam.vercel.app
πŸ‘‰ Explore the dataset on Kaggle
πŸ‘‰ Contribute code via GitHub

πŸ’¬ Got feedback or want to collaborate? Drop a comment or find me on LinkedIn or GitHub.

Let’s build AI that speaks Swahili and protects people, not just data.

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.