How I think about feature leakage

Placeholder note — edit freely.

Leakage is when information that wouldn’t be available at prediction time sneaks into training, making validation scores look great and production results look terrible. My quick checklist:

Split first, transform second. Fit scalers, encoders, and imputers on the training split only, then apply to validation/test.
Watch time. For anything temporal, split by time, not randomly. No peeking at the future.
Group leakage. If rows share an entity (a patient, a user), keep all of an entity’s rows in the same split.
Target-derived features. Be suspicious of any feature computed using the target (target encoding, aggregates over labels) without proper folds.
Suspiciously high scores. If a metric looks too good, assume leakage until proven otherwise.

The cheapest fix is wrapping preprocessing in a pipeline so transforms only ever see training data inside each fold.