Parallelize Pandas map() and apply() while accounting for future records

A few blog posts ago, I covered how to parallelize Pandas map() and apply(). You can read more about it at http://blog.adeel.io/2016/11/06/parallelize-pandas-map-or-apply/ … Essentially it works by breaking the data into smaller chunks, and using Python’s multiprocessing capabilities you call map() or apply() on the individual chunks of data, in parallel.

This works great, but what if it’s time series data, and part of the data you need to process each record lies in a future record? For example, if you are tracking the change of price from one moment to what it will be in a moment in the future. In this case the approach I laid out about dividing it into chunks will not work, because as you reach the end of a chunk, you will not have the future records to use.

It turns out that there’s a relatively simple way to do this. Essentially you determine how much in the future you need to go, and include those extra records in each chunk (so some records at the edges are duplicated in chunks), and then drop them at the very end.

So let’s say for each record, you also need records from up to 30 seconds in the future, for your calculation. And each record in your data represents 1 second. So essentially you include 30 extra records in each chunk so they are available for the parallel calculations. And then drop them later.

You start by setting up your parallel processor function like so:

import pandas as pd
import multiprocessing

cpu_count = multiprocessing.cpu_count()

def parallelize(data, split_interval):
    splits = range(0, cpu_count)
    parallel_arguments = []
    for split in splits:
        parallel_arguments.append([split, data, split_interval])
    pool = multiprocessing.Pool(cpu_count)
    data_array = pool.map(worker, parallel_arguments)
    pool.close()
    pool.join()
    final_data = pd.concat(data_array)
    final_data = final_data.groupby(final_data.index).max() #This is where duplicates are dropped.
    return final_data.sort_index()

What you’ve done is defined an array of a tuple of arguments (parameters) that can are iterated over, to spawn each parallel worker. In the tuple we pass a reference to the Pandas DataFrame, and the data chunk the worker function should work on. Note that the worker function returns that chunk, and concatenates it back into a final DataFrame. After doing is, note the groupby() function that is called, this is where we drop the duplicate records at the edges that were included in each chunk.

Here’s what your worker would do to work on its chunk:

def worker(params):
    num = params[0]
    data = params[1]
    split_interval = params[2]
    split_start = num*split_interval
    split_end = ((num+1)*split_interval)+30
    this_data = data.iloc[split_start:split_end].copy()
    # ...do work on this_data chunk, which includes records from 30 seconds in the future
    # Add new columns to this_data, or whatever
    return this_data

Note this line: split_end = ((num+1)*split_interval)+30. In the chunk you’re working on, you’re including the next 30 records, which in this example represent the next 30 seconds that you need in your calculations.

And finally to tie it together, you do:

if __name__ == '__main__':
    data = pd.DataFrame(...) #Load data
    data_count = len(data)
    split_interval = data_count / cpu_count
    final_data = handler(data, split_interval) #This is the data with all the work done on it

Responsive bootstrap horizontal alignment

In order for a web application to render on a variety of devices (mobile, tablet, laptop, desktop), it needs to be “responsive”. Meaning the same website can have specific rules written into JavaScript, HTML, and CSS to make it adapt to the size of the screen the browser is running on.

In this post I’ll cover how to make something align on the screen according to what device the user is on. For example, on a desktop view, you may have some header text that you want to align right horizontally. But for the same page and header text rendered on a mobile phone, you may decide it looks better center aligned horizontally.

Let’s use the above example, and see how we can accomplish this. First let’s define the following CSS:

@media (max-width: 768px) {
  .responsive-text-align {
    text-align: center;
  }
}
@media (min-width: 768px) {
  .responsive-text-align {
    text-align: right;
  }
}

We’ve created a CSS class called responsive-text-align. You’ll notice that it has two settings: max-width: 768px and min-width: 768px. This tells your browser which setting to use when the screen size is either lass than 768px, or greater than 768px. If we assume that a standard mobile screen is 768px across, the max-width: 768px setting will get applied to the webpage when it is rendered on a mobile screen.

You’ll also notice that for the two different screen sizes, we specify different text-align properties. This allows the text alignment to be determined dynamically, or rather, “responsively”.

To use this CSS class, here’s what your HTML would look like (using bootstrap):

<div class="row">
   <div class="col-sm-12 responsive-text-align">Align this text responsively</div>
</div>

And voila! The text will start aligning according to the rules for the responsive-text-align CSS class, based on what screen size the webpage is rendered in.

JasperReports nuances

JasperReports is an engine that can allow you to generate reports in HTML, PDF, or many other formats. When I say reports here, I mean reports that are essentially pieces of paper that convey something meant to be reviewed by someone. Invoices, Statements, Forecasts, or anything that needs to be dynamically presented on paper for review is a good candidate.

One positive about JasperReports is that it’s all Java based, and plugs in well into your Java ecosystem. On the other hand, if you’re just now looking for a reporting engine, I may try to dissuade you from using JasperReports. It seems to have become outdated. There have been few updates in the last couple years (if any), and community support seems to be waning (you only find old blog posts, forums threads, etc). And documentation is also a bit lacking. But if you’re already using JasperReports, this post is for you.

It took me a while to figure how to get JasperReports to do certain things, because again the documentation is weak. So I figured I’d share these insights just in case someone else is struggling with the same thing.

  • Setting the foreground/background color of a field programatically
    This was not obvious at all. Sometimes you need to set the background color or text color of a field dynamically, based on some data value. Let’s say a data field literally has the color in it: for example $F{BACKCOLOR}, and you want to set the background color of a field to the value contained in $F{BACKCOLOR} (say “#0000FF”). In order to accomplish this, you’ll need to edit the properties of the TextField, and set this property:

    net.sf.jasperreports.style.backcolor

    …to this value:

    $F{BACKCOLOR}

And similarly to set the foreground color (text color), set this property for the TextField:

net.sf.jasperreports.style.forecolor

…to a field of your choosing that has the color in it. (Or hard code the color by typing into the value for this property, surrounded by double quotes to specify a constant).

  • JSON queries against your DataSet
    JasperReports seems to play pretty well with JSON. But something that isn’t obvious is a way to query/filter the JSON data within the JasperReports engine, which it is populated into a data set.To demonstrate how to do this, let’s take an example JSON data:

[{“letters”:[{“category”:”A to C”,”data”:[“a”,”b”,”c”]},{“category”:”D to F”,”data”:[“d”,”e”,”f”]}]}]

Or more visually friendly, like so:

JasperReportsJsonExample

Now let’s say you want to limit your JasperReports Data Set to “letters”, and furthermore a certain category. What you need to do is edit the query for the Data Set, and specify the following:

letters(category==A to C)

And that’s it!

  • More to come later

AmazonS3Client to loop through batches of S3 files objects

AWS provides the AmazonS3Client class, which is part of the AWS Java SDK. This class can be used to interact with files in S3.

An important feature to note of the AmazonS3Client is that it limits results to batches of 1000. If you have less than 1000 files, then all is good. You can use amazonS3Client.listObjects(bucketName); and it will provide all the objects in a bucket.

But if the bucket contains more than 1000 files, you will need to loop through the files in batches. This is not entirely obvious and can cause you to miss files (as I certainly did)!

To get started, you would initiate AmazonS3Client like so:

AmazonS3Client amazonS3Client = new AmazonS3Client(new BasicAWSCredentials(KEY, SECRET));

The approach I like to take is to first loop through and collect all the files up front like so:

ObjectListing objectListing = amazonS3Client.listObjects(bucketName);
List<S3ObjectSummary> s3ObjectSummaries = objectListing.getObjectSummaries();
while (objectListing.isTruncated()) 
{
   objectListing = amazonS3Client.listNextBatchOfObjects (objectListing);
   s3ObjectSummaries.addAll (objectListing.getObjectSummaries());
}

Note: if memory is a concern or you have an unlimited number of files, you can simply modify the approach to do whatever you need to with each file as you fetch it in batches from the API, instead of collecting them up front.

If you first collected them in a List up front, you can then loop through each file like so:

for(S3ObjectSummary s3ObjectSummary : s3ObjectSummaries)
{
	String s3ObjectKey = s3ObjectSummary.getKey();
	//Do whatever with s3ObjectSummary

 

Installing pandas, scipy, numpy, and scikit-learn on AWS EC2

Most of the development/experimentation I was doing with scikit-learn’s machine learning algorithms was on my local development machine. But eventually I needed to do some heavy duty model training / cross validation, which would take weeks on my local machine. So I decided to make use of one of the cheaper compute optimized EC2 instances that AWS offers.

Unfortunately I had some trouble getting scikit-learn to install on a stock Amazon’s EC2 Linux, but I figured it out eventually. I’m sure others will run into this, so I thought I’d write about it.

Note: you can of course get an EC2 community image or an image from the EC2 marketplace that already has Anaconda or scikit-learn and tools installed. This guide is for installing it on a stock Amazon EC2 Linux instance, in case you already have an instance setup you want to use.

In order to get scikit-learn to work, you’ll need to have pandas, scipy and numpy installed too. Fortunately Amazon EC2 Linux comes with python 2.7 already installed, so you don’t need to worry about that.

Start by ssh’ing into your box. Drop into rootshell with the following command (if you’re going to be typing “sudo” before every single command, might as well be root by default anyway, right?)

sudo su

First you need to install some development tools, since you will literally be compiling some libraries in a bit. Run the following commands:

yum groupinstall ‘Development Tools’
yum install python-devel

Next you’ll install the ATLAS and LAPACK libraries, which are needed by numpy and scipy:

yum install atlas-sse3-devel lapack-devel

Now you’re ready to install first all the necessary python libraries and finally scikit-learn:

pip install numpy
pip install scipy
pip install pandas
pip install scikit-learn

Congratulations. You now have scikit-learn installed on the EC2 Linux box!

Parallelize Pandas map() or apply()

Pandas is a very useful data analysis library for Python. It can be very useful for handling large amounts of data.

Unfortunately Pandas runs on a single thread, and doesn’t parallelize for you. And if you’re doing lots of computation on lots of data, such as for creating features for Machine Learning, it can be pretty slow depending on what you’re doing.

To tackle this problem, you essentially have to break your data into smaller chunks, and compute over them in parallel, making use of the Python multiprocessing library.

Let’s say you have a large Pandas DataFrame:

import pandas as pd

data = pd.DataFrame(...) #Load data

And you want to apply() a function to the data like so:

def work(x):
    # Do something to x
    # return something

data = data.apply(work)

What you can do is break the DataFrame into smaller chunks using numpy, and use a Pool from the multiprocessing library to do work in parallel on each chunk, like so:

import numpy as np
from multiprocessing import cpu_count, Parallel

cores = cpu_count() #Number of CPU cores on your system
partitions = cores #Define as many partitions as you want

def parallelize(data, func):
    data_split = np.array_split(data, partitions)
    pool = Pool(cores)
    data = pd.concat(pool.map(func, data_split))
    pool.close()
    pool.join()
    return data

And that’s it. Now you can call parallelize on your DataFrame like so:

data = parallelize(data, work);

Run it, and watch your system’s CPU utilization shoot up to 100%! And it should finish much faster, depending on how many cores you have. 8 cores should theoretically be 8x faster. Or you could fire up an AWS EC2 instance with 32 cores and run it 32x faster!

Amazon EC2 ssh timeout due to inactivity

Well, this applies to any Linux instance that you may be remotely connected to, depending on how sshd is configured on the remote server. And depending on how your localhost (developer machine) ssh config is done. But essentially in some instances the sshd host you’re connecting to times you out pretty quickly, so you have to reconnect often.

This was bothering me for a while. I usually am off and on all day on Linux shell on EC2 instances. And it seemed every time I come back to it, I’d be timed out, causing me to have to reconnect. Not a huge deal, just a nuisance.

To remedy this, without changing the settings on the remote server’s sshd config, you can add the following line to your localhost ssh config. Edit ~/.ssh/config file and add the following line:

ServerAliveInterval 50

And it’s as simple as that! It seems that AWS EC2s are set up to time you out at 60 seconds. So a 50 second keep-alive interval prevents you from getting timed out so aggressively.