Daily Data Science Tip #11

Why do we shuffle the dataset?

--

At the training phase of a neural network, if unshuffled data is fed forward, it would be observed that the neural network will learn features that are closely correlated to the class it was initially exposed to. This will increase the difficulty of an optimisation algorithm discovering an optimal solution for the entire dataset.

By shuffling the dataset, we ensure two key things:

1. There is large enough variance within the dataset that enables each data point within the training data to have an independent effect on the network.

2. Our validation partition of the dataset is obtained from the training data; if we fail to shuffle the dataset appropriately, our validation dataset will not represent the training data.

--

--

Richmond Alake

Machine Learning Content Creator with 1M+ views— Computer Vision Engineer. Interested in gaining and sharing knowledge on Technology and Finance