Installing pandas, scipy, numpy, and scikit-learn on AWS EC2

Most of the development/experimentation I was doing with scikit-learn’s machine learning algorithms was on my local development machine. But eventually I needed to do some heavy duty model training / cross validation, which would take weeks on my local machine. So I decided to make use of one of the cheaper compute optimized EC2 instances that AWS offers.

Unfortunately I had some trouble getting scikit-learn to install on a stock Amazon’s EC2 Linux, but I figured it out eventually. I’m sure others will run into this, so I thought I’d write about it.

Note: you can of course get an EC2 community image or an image from the EC2 marketplace that already has Anaconda or scikit-learn and tools installed. This guide is for installing it on a stock Amazon EC2 Linux instance, in case you already have an instance setup you want to use.

In order to get scikit-learn to work, you’ll need to have pandas, scipy and numpy installed too. Fortunately Amazon EC2 Linux comes with python 2.7 already installed, so you don’t need to worry about that.

Start by ssh’ing into your box. Drop into rootshell with the following command (if you’re going to be typing “sudo” before every single command, might as well be root by default anyway, right?)

sudo su

First you need to install some development tools, since you will literally be compiling some libraries in a bit. Run the following commands:

yum groupinstall ‘Development Tools’
yum install python-devel

Next you’ll install the ATLAS and LAPACK libraries, which are needed by numpy and scipy:

yum install atlas-sse3-devel lapack-devel

Now you’re ready to install first all the necessary python libraries and finally scikit-learn:

pip install numpy
pip install scipy
pip install pandas
pip install scikit-learn

Congratulations. You now have scikit-learn installed on the EC2 Linux box!

Parallelize Pandas map() or apply()

Pandas is a very useful data analysis library for Python. It can be very useful for handling large amounts of data.

Unfortunately Pandas runs on a single thread, and doesn’t parallelize for you. And if you’re doing lots of computation on lots of data, such as for creating features for Machine Learning, it can be pretty slow depending on what you’re doing.

To tackle this problem, you essentially have to break your data into smaller chunks, and compute over them in parallel, making use of the Python multiprocessing library.

Let’s say you have a large Pandas DataFrame:

import pandas as pd

data = pd.DataFrame(...) #Load data

And you want to apply() a function to the data like so:

def work(x):
    # Do something to x
    # return something

data = data.apply(work)

What you can do is break the DataFrame into smaller chunks using numpy, and use a Pool from the multiprocessing library to do work in parallel on each chunk, like so:

import numpy as np
from multiprocessing import cpu_count, Parallel

cores = cpu_count() #Number of CPU cores on your system
partitions = cores #Define as many partitions as you want

def parallelize(data, func):
    data_split = np.array_split(data, partitions)
    pool = Pool(cores)
    data = pd.concat(pool.map(func, data_split))
    pool.close()
    pool.join()
    return data

And that’s it. Now you can call parallelize on your DataFrame like so:

data = parallelize(data, work);

Run it, and watch your system’s CPU utilization shoot up to 100%! And it should finish much faster, depending on how many cores you have. 8 cores should theoretically be 8x faster. Or you could fire up an AWS EC2 instance with 32 cores and run it 32x faster!

Amazon EC2 ssh timeout due to inactivity

Well, this applies to any Linux instance that you may be remotely connected to, depending on how sshd is configured on the remote server. And depending on how your localhost (developer machine) ssh config is done. But essentially in some instances the sshd host you’re connecting to times you out pretty quickly, so you have to reconnect often.

This was bothering me for a while. I usually am off and on all day on Linux shell on EC2 instances. And it seemed every time I come back to it, I’d be timed out, causing me to have to reconnect. Not a huge deal, just a nuisance.

To remedy this, without changing the settings on the remote server’s sshd config, you can add the following line to your localhost ssh config. Edit ~/.ssh/config file and add the following line:

ServerAliveInterval 50

And it’s as simple as that! It seems that AWS EC2s are set up to time you out at 60 seconds. So a 50 second keep-alive interval prevents you from getting timed out so aggressively.