Pandas is a very useful data analysis library for Python. It can be very useful for handling large amounts of data.
Unfortunately Pandas runs on a single thread, and doesn’t parallelize for you. And if you’re doing lots of computation on lots of data, such as for creating features for Machine Learning, it can be pretty slow depending on what you’re doing.
To tackle this problem, you essentially have to break your data into smaller chunks, and compute over them in parallel, making use of the Python multiprocessing library.
Let’s say you have a large Pandas DataFrame:
import pandas as pd data = pd.DataFrame(...) #Load data
And you want to apply() a function to the data like so:
def work(x): # Do something to x # return something data = data.apply(work)
What you can do is break the DataFrame into smaller chunks using numpy, and use a Pool from the multiprocessing library to do work in parallel on each chunk, like so:
import numpy as np from multiprocessing import cpu_count, Parallel cores = cpu_count() #Number of CPU cores on your system partitions = cores #Define as many partitions as you want def parallelize(data, func): data_split = np.array_split(data, partitions) pool = Pool(cores) data = pd.concat(pool.map(func, data_split)) pool.close() pool.join() return data
And that’s it. Now you can call parallelize on your DataFrame like so:
data = parallelize(data, work);
Run it, and watch your system’s CPU utilization shoot up to 100%! And it should finish much faster, depending on how many cores you have. 8 cores should theoretically be 8x faster. Or you could fire up an AWS EC2 instance with 32 cores and run it 32x faster!