This is for transformations like Normalization, Standardization, and TF-IDF etc.
For example, you have this numerical data, you apply MinMax normalization before splitting; now this normalization has seen all data, which means all the test data will range in 0-1, which in real cases might not be possible.
Consider another situation where you have text data; you apply TF-IDF before splitting the data into train and test set. This trained TF-IDF has no seen all the words in your data, even the test one in which you later going to test your data, in real life – upcoming text can have totally new words or frequency of words can differ from the data we have trained one.
This is called data leakage where you are testing on the one which you have trained.
- Split the data into train and test set.
- Apply transformation on training data.(eg. fit_transform() )
- Use the above transformation instance.(eg. transform() )
- Use the same training time instance on real upcoming .(eg. transform() )