Amrit Kumar

The operating envelope of data pruning: what holds, and what breaks

Notes from a multi-dataset NLP study on how aggressively training data can be trimmed before accuracy starts to slip. Anyone who's trained a model at production scale has done some version of the math: GPU-hours times dollars per hour times how many runs you'll

ML Training Signals Are Better Predictors Than Controllers

What I learned trying to use per-sample data quality signals to control machine learning training — and what they appear to do well instead. There's a common idea in machine learning that goes something like this: If we could measure what each data point contributes to learning, we