How I think about feature leakage
A short checklist I run through to catch data leakage before it inflates my validation scores.
- machine learning
- evaluation
- best practices
Placeholder note — edit freely.
Leakage is when information that wouldn’t be available at prediction time sneaks into training, making validation scores look great and production results look terrible. My quick checklist:
- Split first, transform second. Fit scalers, encoders, and imputers on the training split only, then apply to validation/test.
- Watch time. For anything temporal, split by time, not randomly. No peeking at the future.
- Group leakage. If rows share an entity (a patient, a user), keep all of an entity’s rows in the same split.
- Target-derived features. Be suspicious of any feature computed using the target (target encoding, aggregates over labels) without proper folds.
- Suspiciously high scores. If a metric looks too good, assume leakage until proven otherwise.
The cheapest fix is wrapping preprocessing in a pipeline so transforms only ever see training data inside each fold.