Getting log loss score in scikit-learn

According to this wiki, “Logarithmic loss measures the performance of a classification model where the prediction input is a probability value between 0 and 1”.

Log loss is useful in getting a measure of the performance of a machine learning classifier. The goal is to minimize the log loss value, where 0 is a perfect score (all classification predictions correct).

“Log Loss takes into account the uncertainty of your prediction based on how much it varies from the actual label. This gives us a more nuanced view into the performance of our model.”

It’s easy to get a log loss score in scikit-learn using sklearn.metrics.log_loss. However, it may not be obvious how to get the predictions from your classifier returned as probability values, which log_loss() needs.

Enter predict_proba(), a method that most scikit-learn classifiers implement. You get your predictions using predict_proba(), and use those to get the log loss score, like so:

clf = LogisticRegression(), y)
clf_probs = clf.predict_proba(X_test)
log_loss_score = log_loss(y_test, clf_probs)

As simple as that!

Installing pandas, scipy, numpy, and scikit-learn on AWS EC2

Most of the development/experimentation I was doing with scikit-learn’s machine learning algorithms was on my local development machine. But eventually I needed to do some heavy duty model training / cross validation, which would take weeks on my local machine. So I decided to make use of one of the cheaper compute optimized EC2 instances that AWS offers.

Unfortunately I had some trouble getting scikit-learn to install on a stock Amazon’s EC2 Linux, but I figured it out eventually. I’m sure others will run into this, so I thought I’d write about it.

Note: you can of course get an EC2 community image or an image from the EC2 marketplace that already has Anaconda or scikit-learn and tools installed. This guide is for installing it on a stock Amazon EC2 Linux instance, in case you already have an instance setup you want to use.

In order to get scikit-learn to work, you’ll need to have pandas, scipy and numpy installed too. Fortunately Amazon EC2 Linux comes with python 2.7 already installed, so you don’t need to worry about that.

Start by ssh’ing into your box. Drop into rootshell with the following command (if you’re going to be typing “sudo” before every single command, might as well be root by default anyway, right?)

sudo su

First you need to install some development tools, since you will literally be compiling some libraries in a bit. Run the following commands:

yum groupinstall ‘Development Tools’
yum install python-devel

Next you’ll install the ATLAS and LAPACK libraries, which are needed by numpy and scipy:

yum install atlas-sse3-devel lapack-devel

Now you’re ready to install first all the necessary python libraries and finally scikit-learn:

pip install numpy
pip install scipy
pip install pandas
pip install scikit-learn

Congratulations. You now have scikit-learn installed on the EC2 Linux box!