
Member-only story
The dream of democratising ML never seemed closer! Tutorial on how to get labelled data for your data hungry models.
Note: snorkel offers labelling functions and models that can be easily replaced by a few python lines of code. The Main advantage of using snorkel is optimisation and simple and flexible interface.
To come up with a nicely tuned and somewhat generalisable models we need tons of labelled data — that is very expensive and time-consuming. This led to a situation that for many years only large corporations, governments or research institutes had a possibility to get enough data to train ML models. Fortunately, in some domains there exist public datasets that contributed to groundbreaking advancements in ML, but they are usually useful only for an academic research and basic business applications.
These times are over.
In this tutorial I will show how you can create labelled data for your supervised model with a little effort.
In 2016 at Stanford started a project called Snorkel. They did a simple technical bet: that it would increasingly be the training data, not the models, algorithms, or infrastructure, that decided whether a machine learning project succeeded or failed.
Currently, the project is a complete success and widely used by many organizations. Most probably because of its simple interface and nice optimisation.
In this tutorial I will introduce labelling functions offered by snorkel:
Here you can find the official (but very nice) tutorial on labeling functions — snorkel tutorial link.
The goal here is to show you intuitive thinking on how to apply labels in any scenario. Snorkel offers very flexible interface, but my personal advice is to start with the basics.
The basic snorkel labelling functions looks as following:
from snorkel.labeling import labeling_function@labeling_function()
def labelling_function_name(x):
# Return a LABEL otherwise ABSTAIN
return LABEL if “condition” in x.column_name else ABSTAIN
where:
- x — pandas dataframe
- LABEL — means that this data point will be labeled for this…