How I think about feature leakage

· 1 min read

A short checklist I run through to catch data leakage before it inflates my validation scores.

  • machine learning
  • evaluation
  • best practices

Placeholder note — edit freely.

Leakage is when information that wouldn’t be available at prediction time sneaks into training, making validation scores look great and production results look terrible. My quick checklist:

  1. Split first, transform second. Fit scalers, encoders, and imputers on the training split only, then apply to validation/test.
  2. Watch time. For anything temporal, split by time, not randomly. No peeking at the future.
  3. Group leakage. If rows share an entity (a patient, a user), keep all of an entity’s rows in the same split.
  4. Target-derived features. Be suspicious of any feature computed using the target (target encoding, aggregates over labels) without proper folds.
  5. Suspiciously high scores. If a metric looks too good, assume leakage until proven otherwise.

The cheapest fix is wrapping preprocessing in a pipeline so transforms only ever see training data inside each fold.