Overcoming vision deep learning challenges: why bigger models aren’t always better

February 19, 2026
6
minute read

We're launching a new Engineering Blog series, where Helin engineers share practical insights from building real-world AI systems. Modern vision AI may look like plug-and-play, but high-performing models still depend on one critical factor: data quality. In this article, Jason, Tech Lead at Helin, explores the impact of “dirty” data, the limits of automated labelling, and how targeted dataset clean-up can turn a good model into a great one.

Why bigger models aren’t always better

Integration vs Innovation in modern vision deep learning

Most applications of computer vision and machine learning today are more an issue of integration than innovation. Tried and tested architectures and pre-trained models dominate the scene, and most applications follow a familiar flow: identify the objects to be detected or classified, find an existing pre-trained model, and integrate it into the system.

This is becoming even more true in 2026. With tech giants pumping out foundational models such as Meta’s Segment Anything Model (SAM), it’s only a matter of time until training your own vision model becomes a relic of the past… right?

Well, part of me is waiting for the day that happens.

Each time I read news of a new method or model being released, I rush to set up a quick trial. Each time, I’m left wanting more, as the more difficult examples I have on hand continue to slip through the “powerful” detection capabilities of these models. It seems, at least for now, that application-specific models are here to stay, along with all the challenges they bring.

Why data quality still defines model performance

Anyone who has spent time in the trenches knows that one of the most important aspects of developing an accurate model is none other than the word much of the modern world revolves around: data.

There is certainly a mental shift when moving from academia to industry in that regard. If you have ever wondered why the same datasets show up again and again in experiments, you might discover how tedious it actually is to create a dataset worth experimenting on. You must be confident that the data you have created is of high enough quality that it will not distort your results when evaluating how well an architecture learns.

The general consensus seems to be that the more data, the better.

Think of it in human terms. The more experience you have with a certain task, the better you become at performing it. If you are shown the difference between an orange and a mandarin enough times, you learn how to distinguish the two.

We often equate the machine learning process with natural human learning. This analogy works to an extent, but it breaks down if we overlook the fact that our own learning rests on a foundation of years of experience and interaction with countless objects, sensations, and teachings.

In a sense, the data we consume as humans is always “perfect” because we interact directly with reality.

If someone holds up two red balls and claims they are different colors, I can refute that.

I know from the data my eyes provide that they are the same. Neural networks, on the other hand, cannot refute information during training.

If you feed a model the same image twice and label one “red ball” and the other “green ball,” it will be no closer to discovering what is real and what is not.

The hidden cost of dirty data

Now extend that example to thousands, or hundreds of thousands, of images. A few mislabeled examples surely would not matter too much, right? After all, there are many other correct examples.

Just let the model learn.

Well, learn it does.

You may very well end up with a good model.

But removing those few dirty data points can turn a good model into a great one.

The labelling challenge at scale

So we want perfect data, and we want a lot of it.

How do we get there?

People often think it is infeasible to label hundreds of thousands of frames by hand, especially when dealing not only with classification but also with object detection within complex scenes. The truth is that most shortcuts or automation in labeling data do not lead to a better model.

If you use a model to automatically label your dataset, any model trained on that data will struggle to outperform the model that generated the labels. New labels must be introduced by humans to eliminate bias and add genuinely new information. There is a very good reason entire industries exist solely to produce labels for machine learning.

That said, using a model to enhance already human-labeled data can be extremely effective. A model trained on high-quality human annotations can detect objects that humans occasionally miss, which in turn can further improve overall dataset quality.

Using models to improve dataset quality

Our friends at 3LC recognized this and understood how significant the problem of dirty data is in real-world machine learning applications. One of the tools they provide allows us to iterate on large datasets that have grown too large to manage manually. It uses trained models to flag potential flaws, which can then be reviewed by humans to improve overall data quality.

We wanted to find out exactly how dirty our data was and how much cleaning it up would affect our model performance. After running several experiments, we were somewhat surprised to discover that nearly 6% of our frame data contained missing labels or incorrect bounding boxes. That realization changed our perspective on data and reinforced how important it is to ensure labels are sourced and validated correctly.

After some tool-assisted but still tedious spring cleaning, early results show that a new model trained on a dataset an order of magnitude smaller than our previous ones performs just as well as, if not slightly better than, our previous best model.

How do we know it is better? That is a conversation for another day. For now, this has opened a new path in our pursuit of continuous improvement, so watch this space.

In the meantime, if you are walking a similar path, think twice about your data and how good it really is.

Until then, keep learning and keep building cool things!

Jason

Want to read more?

Stay up to date with the latest trends and developments on the topic of industrial edge computing, monitoring and intelligence.

Get real-time insights

into your remote assets

Combine local artificial intelligence with a centrally managed data infrastructure for more accuracy, reduced congestion on your network and lower costs.