ML Training Signals Are Better Predictors Than Controllers

What I learned trying to use per-sample data quality signals to control machine learning training — and what they appear to do well instead.

There's a common idea in machine learning that goes something like this:

If we could measure what each data point contributes to learning, we could train better models — by reweighting, reordering, or removing the samples that aren't pulling their weight.

This idea has a long history. It runs from influence functions and Data Shapley through curriculum learning and importance sampling, all the way to modern data-centric AI. The intuition is reasonable: not all data is equally useful, so we should be able to do something about it.

I'd been kicking this idea around for a while, and finally sat down to test whether it works.

I built a small library of per-sample data quality signals — measures of redundancy, informativeness, and prediction stability — and asked the obvious follow-up question: can these signals be used to improve training?

I tried five different mechanisms. Static scheduling that frontloaded samples scored as high-value. Dynamic rescoring that updates value estimates as the model learns. Entropy-based rescoring. Mild gradient weighting. Aggressive gradient weighting.

None of them beat uniform training — not in any setup I tested.

Not in convergence speed. Not in final accuracy. Not in area under the learning curve. The signals themselves were real — they correlated with the things you'd expect them to correlate with, and they characterized the dataset accurately. But the moment I tried to use them as control levers, the effect disappeared.

Across two text classification datasets and several intervention schemes, the improvements consistently collapsed back toward baseline uniform training. In some cases the learning curve shape shifted slightly under intervention — but never enough to produce reliable gains.

I went back and checked for the usual suspects — embedding bugs, normalization mistakes, baseline errors. The result held. Across three classifier scales, multiple seeds, and five different intervention designs, the same flat line.

So I sat with it and tried to understand why.

Why does this happen?

I don't have a complete mechanistic story. But here's the picture that fits the evidence.

Training dynamics are robust to data-level intervention in a way that's easy to underestimate. The optimizer is doing a lot of work that the per-sample view misses.

Consider gradient averaging. Inside a mini-batch, the gradients from individual samples get averaged together before any parameter update happens. Whatever per-sample weighting you apply on top of that averaging gets partially washed out by the averaging itself. You're scaling something the optimizer is already implicitly smoothing.

Then there's the question of direction. Whether a sample's gradient actually pushes the model somewhere useful depends on the model's current position. A sample that would help from one starting point can be useless from another, even if the sample itself hasn't changed.

And there's a temporal issue. A sample that looks redundant in embedding space at epoch 3 may be exactly what stabilizes a decision boundary at epoch 12. The same data point plays different roles at different points in training.

The model's current state determines what each sample contributes, and that state changes faster than any static or slowly-updated valuation can track.

Data value isn't a property of data. It's a property of the interaction between data and model. And the interaction is moving.

That's why static scheduling fails — it commits before knowing. That's why dynamic rescoring fails — the rescoring is always behind the model. That's why gradient weighting fails — you're scaling something the optimizer is already implicitly weighting through batch averaging. The control levers exist, but they're pushing on a system that has its own dynamics.

A different question

After sitting with the null result, I asked a different question:

If these signals describe the dataset accurately but can't control training, what can they tell us about the training run itself?

This turned out to be more productive.

Per-sample signals — combined with how a run is unfolding — carry information about where it's going. Not how to push it somewhere, but where it's heading on its own. Things like: is the model still in the rapid-learning regime, or already deep into diminishing returns? Will another ten epochs help, or is this run approximately at its ceiling?

I didn't expect this to work. I started looking at it as a way to get some use out of the signal infrastructure I'd built. It worked better than the control experiments ever did, by a wide margin, and the result held up under the validation checks I ran.

Same signals. Different question. Different answer.

A distinction worth making

Optimization frames training as a problem to be solved better. It produces things like better samplers, schedulers, curricula, data selection.

Estimation frames training as a process to be observed and predicted. It produces things like trajectory predictors, plateau forecasts, ROI estimates.

These look similar. They use overlapping signals. But the failure mode of the first is exactly what makes the second work: training is robust to interventions because it has consistent, structured dynamics — and structured dynamics are what makes prediction possible.

They're easy to confuse, and I spent a fair amount of time pushing on the wrong lever before realizing I'd mixed them up.

What I want to look at next

A few threads I'd like to follow up on:

What training trajectories look like, and why their shape carries predictive signal.
How to validate prediction results on small datasets without fooling yourself — permutation tests, counter-tests for confounds, leave-one-out as a sanity check on k-fold optimism. This is the part I was most worried about getting wrong, so it's where I want to be most careful.
What this reframe means for early stopping, which looks like a solved version of the same problem but isn't.

If you've seen similar behavior — signals that look good descriptively but fail when you try to use them as control variables — I'd be interested to hear about it. Especially in domains other than supervised text classification, which is where I've been working.

Short version of what I took from this: when something measures your system well but won't change it, that's not a failure. It's a sign you're asking the wrong question.

ML Training Signals Are Better Predictors Than Controllers

Why does this happen?

A different question

A distinction worth making

What I want to look at next

Read more

The easiest person to fool in an ML experiment is yourself

Five scheduling experiments, one stubborn baseline

The operating envelope of data pruning: what holds, and what breaks