A Quick Machine Learning Modelling Tutorial with Python and Scikit-Learn

This notebook goes through a range of common and useful featues of the Scikit-Learn library.

It’s long but it’s called quick because of how vast the Scikit-Learn library is. Covering everything requires a [full-blown documentation](https://scikit-learn.org/stable/user_guide.html), of which, if you ever get stuck, you should read.

What is Scikit-Learn (sklearn)?

[Scikit-Learn](https://scikit-learn.org/stable/index.html), also referred to as `sklearn`, is an open-source Python machine learning library.

It’s built on top on NumPy (Python library for numerical computing) and Matplotlib (Python library for data visualization).

<img src=”../images/sklearn-6-step-ml-framework-tools-scikit-learn-highlight.png” alt=”a 6 step machine learning framework along will tools you can use for each step” width=”700"/>

Why Scikit-Learn?

Although the field of machine learning is vast, the main goal is finding patterns within data and then using those patterns to make predictions.

And there are certain categories which a majority of problems fall into.

If you’re trying to create a machine learning model to predict whether an email is spam and or not spam, you’re working on a classification problem (whether something is something(s) or another).

If you’re trying to create a machine learning model to predict the price of houses given their characteristics, you’re working on a regression problem (predicting a number).

Once you know what kind of problem you’re working on, there are also similar steps you’ll take for each. Steps like splitting the data into different sets, one for your machine learning algorithms to learn on and another to test them on.
Choosing a machine learning model and then evaluating whether or not your model has learned anything.

Scikit-Learn offers Python implementations for doing all of these kinds of tasks. Saving you having to build them from scratch.

What does this notebook cover?

The Scikit-Learn library is very capable. However, learning everything off by heart isn’t necessary. Instead, this notebook focuses some of the main use cases of the library.

More specifically, we’ll cover:

<img src=”../images/sklearn-workflow-title.png” alt=”a 6 step scikit-learn workflow”/>

0. An end-to-end Scikit-Learn worfklow
1. Getting the data ready
2. Choosing the right maching learning estimator/aglorithm/model for your problem
3. Fitting your chosen machine learning model to data and using it to make a prediction
4. Evaluting a machine learning model
5. Improving predictions through experimentation (hyperparameter tuning)
6. Saving and loading a pretrained model
7. Putting it all together in a pipeline

**Note:** all of the steps in this notebook are focused on **supervised learning** (having data and labels).

After going through it, you’ll have the base knolwedge of Scikit-Learn you need to keep moving forward.

Where can I get help?
If you get stuck or think of something you’d like to do which this notebook doesn’t cover, don’t fear!

The recommended steps you take are:
1. **Try it** — Since Scikit-Learn has been designed with usability in mind, your first step should be to use what you know and try figure out the answer to your own question (getting it wrong is part of the process). If in doubt, run your code.
2. **Press SHIFT+TAB** — See you can the docstring of a function (information on what the function does) by pressing **SHIFT + TAB** inside it. Doing this is a good habit to develop. It’ll improve your research skills and give you a better understanding of the library.
3. **Search for it** — If trying it on your own doesn’t work, since someone else has probably tried to do something similar, try searching for your problem. You’ll likely end up in 1 of 2 places:
* [Scikit-Learn documentation/user guide](https://scikit-learn.org/stable/user_guide.html) — the most extensive resource you’ll find for Scikit-Learn information.
* [Stack Overflow](https://stackoverflow.com/) — this is the developers Q&A hub, it’s full of questions and answers of different problems across a wide range of software development topics and chances are, there’s one related to your problem.

An example of searching for a Scikit-Learn solution might be:

> “how to tune the hyperparameters of a sklearn model”

Searching this on Google leads to the Scikit-Learn documentation for the `GridSearchCV` function: http://scikit-learn.org/stable/modules/grid_search.html

The next steps here are to read through the documentation, check the examples and see if they line up to the problem you’re trying to solve. If they do, **rewrite the code** to suit your needs, run it, and see what the outcomes are.

4. **Ask for help** — If you’ve been through the above 3 steps and you’re still stuck, you might want to ask your question on [Stack Overflow](https://www.stackoverflow.com). Be as specific as possible and provide details on what you’ve tried.

Remember, you don’t have to learn all of the functions off by heart to begin with.

What’s most important is continually asking yourself, “what am I trying to do with the data?”.

Start by answering that question and then practicing finding the code which does it.

Let’s get started.

Software Developer