The Environment Shifts (Machine Learning and Shifting Variables)

in machinelearning •  6 years ago 

Organisations developing machine learning models will, in general, have a business objective in mind when they set about creating them. Normally, the end goal for the data science team and the model is relatively clear, and so it should then be a case of ensuring the sufficient quality and quantity of data to train the model; leaving it to the data scientist to determine the most appropriate tools and frameworks to use.

As best practice, a portion of the data should also be held back to support verification and testing, perhaps using a K-fold cross-validation technique. The aim of verification and testing before deployment should be obvious to most, but for non-data scientists, it is worth mentioning that models can suffer from a problem called “overfitting”, which is when a model or function is too closely fit to a limited set of data points. In effect, the model has memorised the training data, and is great at recalling that, but fails when it comes to making useful future predictions.

Assuming your data science team have carried out the right steps to verify and test the veracity of the model, this then brings you to the next challenge of machine learning. And this is that what you are trying to predict might be within a changing environment or subject to a relationship that changes over time. These moving targets are commonly called “shifts.”

Models can typically suffer from three forms of shift, which means that the model will require some retraining. These are:

  • Covariate Shift: this refers to changes in the distribution of the input variables used in the model. In the real world, an example might include a medical researcher trying to predict the health needs of a town and finding that the population make-up of the town has changed (i.e. shifted) due to new house building bringing in new incomers with different age profiles, ethnicities, dietary preferences and incomes. For this reason, this type of shift can also be called a population drift. A similar example can arise when looking to filter spam email; over time the format and nature of these emails may change in a similar way to the town’s population.

  • Prior Probability Shift: This is when the there is a change in the outcome distribution without a change in the input. It may be observed as a change in the outcomes between the training datasets and the test data set. An example of this is where a health model is trained with data for one town and is then applied to another, but it turns out that despite having similar inputs distributions (e.g. age, ethnicity, wealth, education levels etc.) there is a change in the outcome distribution. This may be due to hidden features not captured in the data.

  • Concept Drift: This type of shift is when there is a change in the probability of the target variable over time. These changes can often impact the distribution in unforeseen ways. The clearest examples of this can be changes in consumer behaviour due to changes in fashion, while it can also be due to cyclical matters such as the run-up Black Friday or another event such as the World Cup.

For the data scientist, it is important to be able to understand the above shifts, as well as how any actions around the data that might impact results: for instance, selection bias, or errors in the input data such as movie reviews on a website contaminated with computer game reviews.

Solving the above challenges is very much the role of the data scientist, but what the commissioning company/organisation needs to acknowledge is that machine learning, and deep learning models are not static pieces of code to be run for years without change. The changing nature of the data and the distributions within it, mean that investment in data science and the supporting infrastructure and software is an ongoing activity and commitment. But those that succeed the rewards are significant!

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!
Sort Order:  

Congratulations @kapsalisv! You have completed the following achievement on the Steem blockchain and have been rewarded with new badge(s) :

You published your First Post
You made your First Vote
You got a First Vote

Click on the badge to view your Board of Honor.
If you no longer want to receive notifications, reply to this comment with the word STOP

Do not miss the last post from @steemitboard:

SteemitBoard Ranking update - Steem Power, Followers and Following added

Support SteemitBoard's project! Vote for its witness and get one more award!