How We Built UI Bug Detection from Scratch: What Worked and What Didn't

#webdev #ai #testing #ui

When we first started planning our own test automation product, our core goal was full end-to-end testing — a system that could test any website automatically, with minimal manual setup. Ideally, it would be as simple as providing a link and letting the system handle the rest. To move faster, we decided to start with what looked like low-hanging fruit: UI bug detection.

It sounded simple enough. But once we got into it, we realized just how tricky it was. We explored multiple approaches, ran into licensing and model limitations, spent weeks generating datasets, and rebuilt parts of the system more than once.

This is a step-by-step look behind the scenes at how we designed and developed our UI bug detection system — and what we learned along the way.

Challenges on Our Path to UI Bug Detection

Designing the System Architecture

From the beginning, we aimed to keep the architecture lightweight — a modular system made of small, simple functional pieces. Cloud API showed potential, but it was too expensive for production use. So, the next logical step was to train an object detection model on our custom dataset to reduce costs while keeping performance.

Licensing: The Roadblock We Didn’t Expect

We chose YOLO as our base model for detecting visual bugs. It was fast, well-documented, and great for object detection tasks — exactly what we needed.

At first, we focused on YOLO NAS, since it was one of the few variants with a business-friendly license - Apache. Everything looked good, so we integrated it into our pipeline and started working with the pre-trained model provided.

Later, when we took a closer look at the full licensing terms, things got tricky.

While the core YOLO NAS framework had a permissive license, the pre-trained model weights were licensed differently. According to the terms, using them in a product required us to open-source our own code — something we clearly couldn’t do.

This wasn’t obvious at the start, and it wasn’t mentioned front and center. But once we read the fine print in the documentation, the problem became clear: we couldn’t legally use those pretrained models in a commercial product.

So we had to change direction. We retrained the model from scratch using only our own data and infrastructure — no third-party weights involved. We thought that switching from a pretrained model to re-trained from scratch by us would decrease the quality of the output, BUT the results were even a bit better

Takeaway: When working with any open-source model, check every layer — not just the framework, but the weights, datasets, and any dependencies. Licensing issues can sneak in where you least expect them.

Building the Dataset from Scratch

Once we decided to train the model on our own dataset, we faced the hardest part — creating a dataset from scratch, because there was no ready-to-use data due to the type of UI bugs we were planning to detect.

At first, we built our own crawler. It could automatically browse websites, inject scripts, modify elements, and generate labeled screenshots. We even made sure it could pick up where it left off if something crashed, which happened a lot. Still, the entire process was slow and fragile.

Second, we thought it would be simple: just break some styles, take screenshots, and start training. But of course, it turned out to be much more complex.

We wrote scripts that used JavaScript and Selenium to manipulate live websites. We disabled images, shifted elements so they overlapped, and tweaked layouts in weird ways. After that, we captured screenshots and recorded the exact coordinates of each visual change. That gave us the raw materials for training, but it was painfully slow. And also, there were many errors - we tried to randomly modify the web page with JS, but markdown mutated differently.

Each sample took about three seconds to generate. And we needed thousands. Tens of thousands. The more we tried to scale it, the more we realized: this was going to eat up time, memory, and patience.
Still, the entire process was slow and fragile.

Synthetic Data Generation

Eventually, we hit a wall. Scraping real websites and breaking them on the fly was too slow and inconsistent, because while using real screenshots, we had to filter appropriate sets manually. That’s when we added synthetic data.

In addition to using existing websites, we began creating simple UI layouts on canvas from scratch. We manually placed overlapping texts, “broke” images by overlaying error graphics, and created fake popups with randomized elements. We started simulating UI bugs ourselves, in a fully controlled environment.

With synthetic data, we didn’t have to worry about waiting for page loads, dealing with broken links, or handling unpredictable website structures. We could generate examples quickly, with the exact bugs we wanted, and in the right format for training.

It wasn’t just faster — it was also cleaner. The model got better training inputs, and we spent way less time cleaning up bad screenshots or fixing crawler bugs.
The issue was that synthetic data only covered a limited range of UI distortions - about 60%. Additionally, the dataset elements were too similar to each other — we needed more variety.

From then on, we used a mixed approach: part real websites, part modified sites, and part synthetic data. And that’s when we finally started making real progress.

Making the Model Smarter About UI Bugs

As the project developed, our definition of a "UI bug" naturally expanded. We started with the most visible problems — unreadable text, overlapping elements, and broken layouts. But soon we realized that there were many other subtle yet impactful issues worth catching.

Things like inconsistent letter spacing, unnecessary scrollbars caused by layout shifts, or mismatched font sizes across components began to surface as meaningful categories. Popups — such as cookie banners and modal dialogs — also became part of our scope, since they often interfere with user interaction.
To detect these, we generated custom synthetic data. We built simplified UI layouts, layered elements in different ways, and added visual details like shadows to mimic real-world styles. This gave the model a wide variety of examples to learn from.
We also recognized that not all bugs need bounding boxes. Some problems, like missing content or font inconsistencies, affect the entire page rather than a specific area. These worked better as image classification tasks, assigning a single label to the whole screenshot.

In the end, we built a two-track system:

Object detection for localized, visual issues like overlapping elements or broken images;
Page-level classification for broader layout or content problems.

This combined approach gave us more flexibility and accuracy. It allowed us to match the right detection method to each bug type, which turned out to be crucial for building something reliable. So, the final list of UI bugs looks as follows:

Broken Image
Missing Content
Unnecessary Scroll
Letter spacing issue
Inconsisent font size
Outdated style
Inconsistent color scheme
Empty layout
Broken layout
Overlapping content
Unnecessary horizontal scroll

Lessons Learned

Data is the real challenge — Model training is easy compared to generating a diverse, high-quality dataset. Most of our time went into building and refining the data pipeline.
Licensing matters more than you think — Even with open-source tools, you can run into restrictions. Always check licenses for models, weights, and datasets before integrating them.
Synthetic + real = best results — A mix of real websites, synthetic layouts, and manual edge cases gave us the most reliable coverage and flexibility.
UI bug detection isn’t just one feature — it’s a system. Without the right data, even the best model won’t help.

Curious how this all turned out?
We’re turning these ideas into a real tool at Treegress — a no-code platform for end-to-end testing with built-in visual bug detection.