Installing pandas, scipy, numpy, and scikit-learn on AWS EC2

Most of the development/experimentation I was doing with scikit-learn’s machine learning algorithms was on my local development machine. But eventually I needed to do some heavy duty model training / cross validation, which would take weeks on my local machine. So I decided to make use of one of the cheaper compute optimized EC2 instances that AWS offers.

Unfortunately I had some trouble getting scikit-learn to install on a stock Amazon’s EC2 Linux, but I figured it out eventually. I’m sure others will run into this, so I thought I’d write about it.

Note: you can of course get an EC2 community image or an image from the EC2 marketplace that already has Anaconda or scikit-learn and tools installed. This guide is for installing it on a stock Amazon EC2 Linux instance, in case you already have an instance setup you want to use.

In order to get scikit-learn to work, you’ll need to have pandas, scipy and numpy installed too. Fortunately Amazon EC2 Linux comes with python 2.7 already installed, so you don’t need to worry about that.

Start by ssh’ing into your box. Drop into rootshell with the following command (if you’re going to be typing “sudo” before every single command, might as well be root by default anyway, right?)

sudo su

First you need to install some development tools, since you will literally be compiling some libraries in a bit. Run the following commands:

yum groupinstall ‘Development Tools’
yum install python-devel

Next you’ll install the ATLAS and LAPACK libraries, which are needed by numpy and scipy:

yum install atlas-sse3-devel lapack-devel

Now you’re ready to install first all the necessary python libraries and finally scikit-learn:

pip install numpy
pip install scipy
pip install pandas
pip install scikit-learn

Congratulations. You now have scikit-learn installed on the EC2 Linux box!

Parallelize Pandas map() or apply()

Pandas is a very useful data analysis library for Python. It can be very useful for handling large amounts of data.

Unfortunately Pandas runs on a single thread, and doesn’t parallelize for you. And if you’re doing lots of computation on lots of data, such as for creating features for Machine Learning, it can be pretty slow depending on what you’re doing.

To tackle this problem, you essentially have to break your data into smaller chunks, and compute over them in parallel, making use of the Python multiprocessing library.

Let’s say you have a large Pandas DataFrame:

import pandas as pd

data = pd.DataFrame(...) #Load data

And you want to apply() a function to the data like so:

def work(x):
    # Do something to x
    # return something

data = data.apply(work)

What you can do is break the DataFrame into smaller chunks using numpy, and use a Pool from the multiprocessing library to do work in parallel on each chunk, like so:

import numpy as np
from multiprocessing import cpu_count, Parallel

cores = cpu_count() #Number of CPU cores on your system
partitions = cores #Define as many partitions as you want

def parallelize(data, func):
    data_split = np.array_split(data, partitions)
    pool = Pool(cores)
    data = pd.concat(pool.map(func, data_split))
    pool.close()
    pool.join()
    return data

And that’s it. Now you can call parallelize on your DataFrame like so:

data = parallelize(data, work);

Run it, and watch your system’s CPU utilization shoot up to 100%! And it should finish much faster, depending on how many cores you have. 8 cores should theoretically be 8x faster. Or you could fire up an AWS EC2 instance with 32 cores and run it 32x faster!

Amazon EC2 ssh timeout due to inactivity

Well, this applies to any Linux instance that you may be remotely connected to, depending on how sshd is configured on the remote server. And depending on how your localhost (developer machine) ssh config is done. But essentially in some instances the sshd host you’re connecting to times you out pretty quickly, so you have to reconnect often.

This was bothering me for a while. I usually am off and on all day on Linux shell on EC2 instances. And it seemed every time I come back to it, I’d be timed out, causing me to have to reconnect. Not a huge deal, just a nuisance.

To remedy this, without changing the settings on the remote server’s sshd config, you can add the following line to your localhost ssh config. Edit ~/.ssh/config file and add the following line:

ServerAliveInterval 50

And it’s as simple as that! It seems that AWS EC2s are set up to time you out at 60 seconds. So a 50 second keep-alive interval prevents you from getting timed out so aggressively.

Removing neighboring (consecutive-only) duplicates in a Pandas DataFrame

Pandas, the Python Data Analysis Library, makes it easy to drop duplicates from a DataFrame, using the drop_duplicates() function (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html).

The problem with this is it removes ALL duplicates anywhere in your DataFrame. Depending on what you’re doing, you may not want to get rid of all duplicates everywhere, but only neighboring duplicates. That is, duplicates that are consecutive. But if there’s a duplicate after a non-duplicate row, that’s okay, for your purpose.

For example, you may have the following data:

1 2 3
1 2 3
1 5 5
1 2 3
1 5 5

You only want to get rid of consecutive duplicates (which in this case are only the first two rows), and get this result:

1 2 3
1 5 5
1 2 3
1 5 5

You can accomplish this using the pandas shift() function, which can be used to get the very next element, like so:

data = data.loc[data.shift() != data]

What this does is for every row in the DataFrame, it compares it to the next row. If all columns are equal to the columns in the next row, the row does not get repeated.

Note: this only works if you have simple elements in your DataFrame that can be checked to be equivalent (in the example above all elements are integers). Otherwise you’ll need to extend the type of element, and implement an equivalency function.

Get Apache to act as a gzip proxy

Let’s say you want to repeatedly get a large amount of data (text, or something not in an already compressed format) from a RESTful webservice. But the webserver doesn’t compress the data when transferred over HTTP, and you have a slow connection on your end machine (such as your development machine). And so it takes your end client a while to load the data on every iteration, thus slowing down whatever you’re doing.

In this scenario you can use a server in the middle which has a fast connection, to act as a proxy and gzip the data for you. This server could be hosted in the cloud somewhere.

First you’ll need to make sure that Apache is installed on this server, and the firewall allows access to port 80, or 443 if you’re going to be using HTTPS. You’ll need to make sure that the following Apache modules are installed: mod_proxy and mod_deflate.

Let’s say the data from the webservice you’re trying to get to is located at the URL http://www.webservice.com/data/json, and returns a giant block of json data. The configuration for this would be simple. Edit /etc/httpd/conf/httpd.conf (or wherever httpd.conf is according to your system) and add the following:

<IfModule mod_proxy.c>
    ProxyRequests On

    <Proxy *>
        Order deny,allow
        Allow from all
    </Proxy>

    ProxyPass /compressed http://www.webservice.com/data/json
    ProxyPassReverse /compressed http://www.webservice.com/data/json

    <Location /data>
        Order allow,deny
        Allow from all
        AddOutputFilterByType DEFLATE application/json
    </Location>
</IfModule>

And voila! Now when you visit the /compressed path on your middle-man Apache server, it will proxy and compress the json data from the upstream server before it ships it to you. So if your servers’s IP is 1.2.3.4, you’d use the URL http://1.2.3.4/compressed, and that will proxy to http://www.webservice.com/data/json and return the data compressed to your end client with the slow Internet connection, which will load much faster. Let’s say it’s 1mb of regular json text data, which should easily compress to 200kb or so, which will load 5x faster!

Note: you’ll of course need to make sure that the HTTP client you’re using supports gzip compression. If you’re doing this programatically, whatever HTTP API you’re using may allow you to do this. Or you’ll need to manually add the “Accept-Encoding: gzip” header, so the middle-man server knows to compress the data, and whatever content you get back, you’ll need to first decompress manually.

Installing MongoDB on AWS EC2 and turning on zlib compression

At this time AWS doesn’t provide an RDS type for MongoDB. So in order to have a MongoDB server on the AWS cloud, you have to install it manually on an EC2 instance.

The full documentation for installing a MongoDB instance on an AWS EC2 can be seen at: https://docs.mongodb.com/v3.0/tutorial/install-mongodb-on-amazon/. Here’s a quick summary though.

First you’ll need to create a Linux EC2 server. Once you have the server created, log in to the machine through secure shell. Drop into root shell using the following command:

sudo su

Next you’ll need to create the repository info for yum to use to download the prebuilt MongoDB packages. You’ll create a file at /etc/yum.repos.d/mongodb-org-3.0.repo:

vi /etc/yum.repos.d/mongodb-org-3.0.repo

And copy/paste the repository:

[mongodb-org-3.0]
name=MongoDB Repository
baseurl=https://repo.mongodb.org/yum/amazon/2013.03/mongodb-org/3.0/x86_64/
gpgcheck=0
enabled=1

Save and exit from vi. And type in the following command to install:

yum install -y mongodb-org

And that’s it! Now you have MongoDB installed on your EC2.

Next, to turn on compression, you’ll need to edit /etc/mongod.conf

vi /etc/mongod.conf

Scroll down to the “storage” directive, and add in this configuration:

engine: "wiredTiger"
wiredTiger:
  collectionConfig:
    blockCompressor: "zlib"

Now any collections you create will be compressed with zlib, which provides the best compression currently.

To turn on your MongoDB instance by typing in this command:

service mongod start

And of course you’ll want to custom configure your MongoDB instance (or not). You can find several guides and tutorials to do that online.

Sorting a JSON Array in Java

There are a number of approaches you can take. But a simple one is to first convert it to a Java Collection, and use Collections.sort() with a custom comparator.

The example I’ll follow consists of an org.json.JSONArray which is has (and only has) org.json.JSONObject’s. So a json array of json objects, which is pretty common. Say you want to sort the JSONObjects in the JSONArray, based on a key in the JSONObject.

Let’s start by converting a JSONArray to a Collection of JSONObjects, using the java List type:

List<JSONObject> myJsonArrayAsList = new ArrayList<JSONObject>();
for (int i = 0; i < myJsonArray.length(); i++)
    myJsonArrayAsList.add(myJsonArray.getJSONObject(i));

Now you can use Collections.sort() with a custom comparator. Let’s say you have a key named “key” in each json object, which maps to an int, and you want to sort on int value. You would use the following code:

Collections.sort(myJsonArrayAsList, new Comparator<JSONObject>() {
    @Override
    public int compare(JSONObject jsonObjectA, JSONObject jsonObjectB) {
    	int compare = 0;
    	try
    	{
    		int keyA = jsonObjectA.getInt("key");
    		int keyB = jsonObjectB.getInt("key");
    		compare = Integer.compare(keyA, keyB);
    	}
    	catch(JSONException e)
    	{
    		e.printStackTrace();
    	}
    	return compare;
    }
});

That’ll do. Now, let’s take it a step further. Let’s say the values in the key field of each json object are Strings, and you want to sort based on the Strings, but in a particular order. You want the string “oranges” to come first, “bananas” to come second, “pineapples” to come third, and “apples” to come last.

An easy way to go about this is to create a HashMap, and assign these strings integer values. Then use those integer values to compare. Here’s what the code would look like for that:

Collections.sort(myJsonArrayAsList, new Comparator<JSONObject>() {
    @Override
    public int compare(JSONObject jsonObjectA, JSONObject jsonObjectB) {
    	int compare = 0;
    	try
    	{
			HashMap<String,Integer> fruitTypeSorts = new HashMap<String,Integer>();
			fruitTypeSorts.put("orange", 1);
			fruitTypeSorts.put("bananas", 2);
			fruitTypeSorts.put("pineapples", 3);
			fruitTypeSorts.put("apples", 4);
			int valueA=fruitTypeSorts.get(jsonObjectA.getString("key"));
			int valueB=fruitTypeSorts.get(jsonObjectB.getString("key"));
			return Integer.compare(valueA, valueB);
    	}
    	catch(JSONException e)
    	{
    		e.printStackTrace();
    	}
    	return compare;
    }
});

And voila! Now you have the List sorted based on the key value, with objects with key values oranges being first, then bananas, and so on.

To tie it all together, you want to convert it back to a JSONArray. To put the sorted JSONObjects back into your original array, simply:

myJsonArray = new JSONArray();
for (int i = 0; i < myJsonArrayAsList.size(); i++) {
	myJsonArray.put(myJsonArrayAsList.get(i));
}

Set exact time for Alarm in Android Lollipop

I struggled with this for a little bit as I was porting my Android application over to use all the latest libraries after 5 years of no updates.

I used to use AlarmManager.setTime() to set my alarms. Particularly, this was for a timer application, where it was critical that the alarm is received by the application exactly at the right time. But once I upgraded to Lollipop, I realized that the alarm was not firing at the right time, and hence the timer app would not alert the user sometimes many seconds or minutes after it had expired.

It turns out that starting from Android Lollipop and later, AlarmManager.setTime() is no longer exact. This was mainly done to preserve battery. So now in Lollipop, you have to use AlarmManager.setExact() if you want the alarm to go off at the exact moment you set it for.

Note: it is discouraged to use setExact() everywhere, unless it’s justifiable for the purpose of the alarm you’re setting. For example in a timer countdown app, it’s important that the alarm goes of at exactly the correct moment (important in sports, for example). So battery life is slightly sacrificed for the purpose. Otherwise you should stick with setTime() and allow Android to figure out when to approximately fire the alarm (probably bundled together with other alarms at roughly the same time) so it can maximize battery. More information at: https://developer.android.com/reference/android/app/AlarmManager.html#setExact(int, long, android.app.PendingIntent)

In your code you can determine the OS version, and if it’s Lollipop or later, you can make it use setExact(), otherwise if older, use setTime() (since setExact wasn’t introduced till Lollipop):

AlarmManager alarmManager =
  (AlarmManager)
       context.getSystemService(Context.ALARM_SERVICE);

if(android.os.Build.VERSION.SDK_INT >= Build.VERSION_CODES.KITKAT)
   alarmManager.setExact(AlarmManager.RTC_WAKEUP,
         timer.stopTime,
         pendingIntent);
else
   alarmManager.set(AlarmManager.RTC_WAKEUP,
         timer.stopTime,
         pendingIntent);

And voila!

Recovering Java keystore password

So I hadn’t updated my Android app in 5 years. I had worked on an update for the past couple months, and was finally ready to upload it. I knew I had my Java keystore backed up, and the password written down. But come time to deploy the updated .apk, I couldn’t get into the keystore! The password I had written down 5 years ago was not it.

I was very frustrated, and disheartened after all the recent effort. Should I delete my old application, and upload the update as a completely new one? But then I’d lose all my current users.

I knew there are only a handful of passwords that I use, sometimes with slight variations, and sometimes multiple of my standard passwords put together. So I thought that I’d try to crack it.

Enter the Java based Android Keystore Password Recovery tool: http://maxcamillo.github.io/android-keystore-password-recover/. It’s a pretty configurable brute forcer. And it worked for me. Though it took several tries to get it just right so it could crack it, and I’ll explore how to best use it in this post.

The best way to use Android Keystore Password Recovery tool is with the Smart Dictionary mode, which tries variations of the words in a dictionary. It appends the words in the dictionary together in all orders, and adds numbers to it. It can also try upper case/lower case for the first letter of each word when trying variations. This is useful when you know roughly what your password was, but you don’t remember the case, or the numbers you added around it. And it’s especially useful if you concatenate two or more of your passwords together, but don’t remember in what order or what variation of your password (as was my case).

To run it, download the .jar (in my case I imported the source into Eclipse and ran it that way). Make sure you have java installed, and you can run it from the command line.

First you’ll want to create a dictionary of passwords that you usually use. Try to think of all variations and break them up into as little of pieces as possible. For example, if your password is CatManDu57, but you don’t remember if you variated it to 57DuManCat or something else, then break the dictionary into the following words:

Cat

Man

Du

57

Try to think of all the variations of cases that you use. A limitation in Android Keystore Password Recovery is that it is only capable of trying the first letter of the word with uppercase and lower case. So you may want to put in different combinations of your word, i.e. cAt, CaT, cAT (all added to the dictionary) should do it. And Android Keystore Password Recovery will try all the variations.

To run it, use the following command:

java -jar AndroidKeystoreBrute_v1.06.jar -m 3 -k “/path/to/mykeystore” -d “/path/to/wordlist.txt”

And voila! It will try all combinations in your wordlist, individually, or concatenated together.

Note that there’s a very useful and configurable feature called “pieces”, which isn’t very obvious from the documentation. You can define the number of pieces from the wordlist that can be concatenated together to form your password. This can narrow down the search for the password a lot so that the tool can go only up to a certain number of combinations. For example if we go back to the CatManDu57 example, that’s 4 total pieces. And since you know that there are at least 4 pieces together, and a maximum of 4 pieces, you’d use between “4 and 4” and it will try all combinations that way. I believe the default number of pieces it goes to is 1 to 64, and that will take much longer to run, so you’re better off limiting this if you can remember the number of pieces. Note: that a piece can be repeated within the same password combination it tries, so for our example with 4 pieces, one of the combinations will be CatCatCatCat, and one will be 57575757, and every combination in between, in every order. So you can see this can be very useful if you have some idea what your password was, especially if you remember pieces of it.

Some other notable parameters are as follow: -l sets the minimum password length. So if you knew it’s at least a certain number of characters, you should use it to narrow it down. -firstchars sets the first few characters of your password, and should be used if you remember for sure what the first few characters of the password were. -p uses common replacements such as “@” or “a”. And -onlylower makes it not mess with the case of the first letter of each word in your dictionary… very useful if you know exactly the case and that’s what you should use in the dictionary for each piece (which was my case).

The better and more precise the parameters you provide based on what password you use, obviously, the more likelihood of result. Try different combinations and try to narrow it down. I had to go through lots of iterations of wordlists and parameters (especially pieces) to get it just right.

Another mode that I found useful was the brute forcing method, but limited to a certain set of characters. This one will require editing the source code. If you knew you used some variation of CatManDu57, and want to simply brute force with just those characters, you can edit BrutePasswd.java in the source, and limit it to the characters:

‘C’,’c’,’A’,’a’,’T’,’t’,’M’,’m’,’N’,’n’,’D’,’d’,’U’,’u’,’5′,’7′.

And this will do a brute of all combinations of all permutations starting from whatever length you specify, and will most certainly crack your password eventually if you’re sure those were the only characters in your password. Of course this becomes a lot slower, and may take weeks or months, the more the characters you have and the longer the length of the password.

Another notable mention for a cracker is John The Ripper (http://www.openwall.com/john/). Most importantly, it has support for Java keystores. Download the community enhanced version, and use the “keystore2john” utility to convert it to John the Rippper (JtR) format. I tried this but unfortunately didn’t quite figure out how to configure it with my own dictionary. Nor could I get it to repeat pieces of the dictionary which I knew was important in my password. I did try it on a bogus keystore I created with a simple password (“test57”), just to test, and it did crack that very quickly, since it does have a built in dictionary.

Performance note: while Android Keystore Password Recovery works pretty fast (I was getting 200,000 combinations of passwords per second), John The Ripper beat that in several orders of magnitude. JtR was guessing over a million combinations of passwords per second! Holy batman.

Last notable mention is Hashcat (https://hashcat.net/hashcat/), which is supposed to be a very fast password cracker, and can also use GPU to do the hashing which will greatly speed it up even more than JtR. With this, you can set up a cluster of GPU-enabled EC2 machines in AWS (or anywhere else), and run it and it will eventually crack. Unfortunately, at the time of writing this post, hashcat does not support Java keystores, though I did see in a few places (such as a poll on what feature to add) that it’s a commonly requested feature in Hashcat. Maybe if I have time, I’ll contribute to that. But for now I’m happy because I have my lost keystore password recovered!

Good luck!

P.S. check out my Android app at: https://play.google.com/store/apps/details?id=com.synappze.stoptimer and https://github.com/AdeelMufti/StopTimer

InputFilter used on Android EditText to limit to a number range

In my Android app, I had a need to limit a number in an EditText widget between a certain range. Namely, 0-59, for time entry pertaining to minutes and seconds. Enter Android’s InputFilters, which are an easy and useful way to limit input in an EditText.

First of all, the EditText is defined in the layout XML as follows:

This ensures that the EditText should be a number, of 2 digit length, with initial value “0”.

Then in your code, you’ll need to define an InputFilter as follows:

public class InputFilterMinMax implements InputFilter {
   private int minimumValue;
   private int maximumValue;

   public InputFilterMinMax(int minimumValue, int maximumValue) {
      this.minimumValue = minimumValue;
      this.maximumValue = maximumValue;
   }

   @Override
   public CharSequence filter(CharSequence source, int start, int end, Spanned dest, int dstart, int dend) {
      try {
         int input = Integer.parseInt(dest.subSequence(0, dstart).toString() + source + dest.subSequence(dend, dest.length()));
         if (isInRange(minimumValue, maximumValue, input))
            return null;
      }
      catch (NumberFormatException nfe) {
      }
      return "";
   }

   private boolean isInRange(int a, int b, int c) {
      return b > a ? c >= a && c <= b : c >= b && c <= a;
   }

}

So now essentially you can define an InputFilter, that initializes with your desired minimum and maximum integer values. It can then be applied to an EditText to limit the entry to a certain range of integer values. So if you have a reference to your EditText in code, for this example it would be initialized in this way:

minsEditText.setFilters(new InputFilter[]{new InputFilterMinMax(0, 59)});

This will limit the entry from 0-59. Of course you can use any number range that you’d like.