HomeMachine LearningData Leakage: Do not apply Transformation before Splitting Training - Test set

Data Leakage: Do not apply Transformation before Splitting Training – Test set

This is for transformations like Normalization, Standardization, and TF-IDF etc.

For example, you have this numerical data, you apply MinMax normalization before splitting; now this normalization has seen all data, which means all the test data will range in 0-1, which in real cases might not be possible.

Consider another situation where you have text data; you apply TF-IDF before splitting the data into train and test set. This trained TF-IDF has no seen all the words in your data, even the test one in which you later going to test your data, in real life – upcoming text can have totally new words or frequency of words can differ from the data we have trained one.

This is called data leakage where you are testing on the one which you have trained.

Right way:

  • Split the data into train and test set.
  • Apply transformation on training data.(eg. fit_transform() )
  • Use the above transformation instance.(eg. transform() )
  • Use the same training time instance on real upcoming .(eg. transform() )

Bitsdroid is a quality space for the latest breakthroughs in Technology, Science and Machine Learning. We at Bitsdroid are constantly upgrading and curving the world with us.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Recent Articles