October 2016 – Adeel's Corner

Removing neighboring (consecutive-only) duplicates in a Pandas DataFrame

Pandas, the Python Data Analysis Library, makes it easy to drop duplicates from a DataFrame, using the drop_duplicates() function (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html).

The problem with this is it removes ALL duplicates anywhere in your DataFrame. Depending on what you’re doing, you may not want to get rid of all duplicates everywhere, but only neighboring duplicates. That is, duplicates that are consecutive. But if there’s a duplicate after a non-duplicate row, that’s okay, for your purpose.

For example, you may have the following data:

1 2 3
1 2 3
1 5 5
1 2 3
1 5 5

You only want to get rid of consecutive duplicates (which in this case are only the first two rows), and get this result:

1 2 3
1 5 5
1 2 3
1 5 5

You can accomplish this using the pandas shift() function, which can be used to get the very next element, like so:

data = data.loc[data.shift() != data]

What this does is for every row in the DataFrame, it compares it to the next row. If all columns are equal to the columns in the next row, the row does not get repeated.

Note: this only works if you have simple elements in your DataFrame that can be checked to be equivalent (in the example above all elements are integers). Otherwise you’ll need to extend the type of element, and implement an equivalency function.

Get Apache to act as a gzip proxy

Let’s say you want to repeatedly get a large amount of data (text, or something not in an already compressed format) from a RESTful webservice. But the webserver doesn’t compress the data when transferred over HTTP, and you have a slow connection on your end machine (such as your development machine). And so it takes your end client a while to load the data on every iteration, thus slowing down whatever you’re doing.

In this scenario you can use a server in the middle which has a fast connection, to act as a proxy and gzip the data for you. This server could be hosted in the cloud somewhere.

First you’ll need to make sure that Apache is installed on this server, and the firewall allows access to port 80, or 443 if you’re going to be using HTTPS. You’ll need to make sure that the following Apache modules are installed: mod_proxy and mod_deflate.

Let’s say the data from the webservice you’re trying to get to is located at the URL http://www.webservice.com/data/json, and returns a giant block of json data. The configuration for this would be simple. Edit /etc/httpd/conf/httpd.conf (or wherever httpd.conf is according to your system) and add the following:

<IfModule mod_proxy.c>
    ProxyRequests On

    <Proxy *>
        Order deny,allow
        Allow from all
    </Proxy>

    ProxyPass /compressed http://www.webservice.com/data/json
    ProxyPassReverse /compressed http://www.webservice.com/data/json

    <Location /data>
        Order allow,deny
        Allow from all
        AddOutputFilterByType DEFLATE application/json
    </Location>
</IfModule>

And voila! Now when you visit the /compressed path on your middle-man Apache server, it will proxy and compress the json data from the upstream server before it ships it to you. So if your servers’s IP is 1.2.3.4, you’d use the URL http://1.2.3.4/compressed, and that will proxy to http://www.webservice.com/data/json and return the data compressed to your end client with the slow Internet connection, which will load much faster. Let’s say it’s 1mb of regular json text data, which should easily compress to 200kb or so, which will load 5x faster!

Note: you’ll of course need to make sure that the HTTP client you’re using supports gzip compression. If you’re doing this programatically, whatever HTTP API you’re using may allow you to do this. Or you’ll need to manually add the “Accept-Encoding: gzip” header, so the middle-man server knows to compress the data, and whatever content you get back, you’ll need to first decompress manually.

Installing MongoDB on AWS EC2 and turning on zlib compression

At this time AWS doesn’t provide an RDS type for MongoDB. So in order to have a MongoDB server on the AWS cloud, you have to install it manually on an EC2 instance.

The full documentation for installing a MongoDB instance on an AWS EC2 can be seen at: https://docs.mongodb.com/v3.0/tutorial/install-mongodb-on-amazon/. Here’s a quick summary though.

First you’ll need to create a Linux EC2 server. Once you have the server created, log in to the machine through secure shell. Drop into root shell using the following command:

sudo su

Next you’ll need to create the repository info for yum to use to download the prebuilt MongoDB packages. You’ll create a file at /etc/yum.repos.d/mongodb-org-3.0.repo:

vi /etc/yum.repos.d/mongodb-org-3.0.repo

And copy/paste the repository:

[mongodb-org-3.0]
name=MongoDB Repository
baseurl=https://repo.mongodb.org/yum/amazon/2013.03/mongodb-org/3.0/x86_64/
gpgcheck=0
enabled=1

Save and exit from vi. And type in the following command to install:

yum install -y mongodb-org

And that’s it! Now you have MongoDB installed on your EC2.

Next, to turn on compression, you’ll need to edit /etc/mongod.conf

vi /etc/mongod.conf

Scroll down to the “storage” directive, and add in this configuration:

engine: "wiredTiger"
wiredTiger:
  collectionConfig:
    blockCompressor: "zlib"

Now any collections you create will be compressed with zlib, which provides the best compression currently.

To turn on your MongoDB instance by typing in this command:

service mongod start

And of course you’ll want to custom configure your MongoDB instance (or not). You can find several guides and tutorials to do that online.

Sorting a JSON Array in Java

There are a number of approaches you can take. But a simple one is to first convert it to a Java Collection, and use Collections.sort() with a custom comparator.

The example I’ll follow consists of an org.json.JSONArray which is has (and only has) org.json.JSONObject’s. So a json array of json objects, which is pretty common. Say you want to sort the JSONObjects in the JSONArray, based on a key in the JSONObject.

Let’s start by converting a JSONArray to a Collection of JSONObjects, using the java List type:

List<JSONObject> myJsonArrayAsList = new ArrayList<JSONObject>();
for (int i = 0; i < myJsonArray.length(); i++)
    myJsonArrayAsList.add(myJsonArray.getJSONObject(i));

Now you can use Collections.sort() with a custom comparator. Let’s say you have a key named “key” in each json object, which maps to an int, and you want to sort on int value. You would use the following code:

Collections.sort(myJsonArrayAsList, new Comparator<JSONObject>() {
    @Override
    public int compare(JSONObject jsonObjectA, JSONObject jsonObjectB) {
    	int compare = 0;
    	try
    	{
    		int keyA = jsonObjectA.getInt("key");
    		int keyB = jsonObjectB.getInt("key");
    		compare = Integer.compare(keyA, keyB);
    	}
    	catch(JSONException e)
    	{
    		e.printStackTrace();
    	}
    	return compare;
    }
});

That’ll do. Now, let’s take it a step further. Let’s say the values in the key field of each json object are Strings, and you want to sort based on the Strings, but in a particular order. You want the string “oranges” to come first, “bananas” to come second, “pineapples” to come third, and “apples” to come last.

An easy way to go about this is to create a HashMap, and assign these strings integer values. Then use those integer values to compare. Here’s what the code would look like for that:

Collections.sort(myJsonArrayAsList, new Comparator<JSONObject>() {
    @Override
    public int compare(JSONObject jsonObjectA, JSONObject jsonObjectB) {
    	int compare = 0;
    	try
    	{
			HashMap<String,Integer> fruitTypeSorts = new HashMap<String,Integer>();
			fruitTypeSorts.put("orange", 1);
			fruitTypeSorts.put("bananas", 2);
			fruitTypeSorts.put("pineapples", 3);
			fruitTypeSorts.put("apples", 4);
			int valueA=fruitTypeSorts.get(jsonObjectA.getString("key"));
			int valueB=fruitTypeSorts.get(jsonObjectB.getString("key"));
			return Integer.compare(valueA, valueB);
    	}
    	catch(JSONException e)
    	{
    		e.printStackTrace();
    	}
    	return compare;
    }
});

And voila! Now you have the List sorted based on the key value, with objects with key values oranges being first, then bananas, and so on.

To tie it all together, you want to convert it back to a JSONArray. To put the sorted JSONObjects back into your original array, simply:

myJsonArray = new JSONArray();
for (int i = 0; i < myJsonArrayAsList.size(); i++) {
	myJsonArray.put(myJsonArrayAsList.get(i));
}

Set exact time for Alarm in Android Lollipop

I struggled with this for a little bit as I was porting my Android application over to use all the latest libraries after 5 years of no updates.

I used to use AlarmManager.setTime() to set my alarms. Particularly, this was for a timer application, where it was critical that the alarm is received by the application exactly at the right time. But once I upgraded to Lollipop, I realized that the alarm was not firing at the right time, and hence the timer app would not alert the user sometimes many seconds or minutes after it had expired.

It turns out that starting from Android Lollipop and later, AlarmManager.setTime() is no longer exact. This was mainly done to preserve battery. So now in Lollipop, you have to use AlarmManager.setExact() if you want the alarm to go off at the exact moment you set it for.

Note: it is discouraged to use setExact() everywhere, unless it’s justifiable for the purpose of the alarm you’re setting. For example in a timer countdown app, it’s important that the alarm goes of at exactly the correct moment (important in sports, for example). So battery life is slightly sacrificed for the purpose. Otherwise you should stick with setTime() and allow Android to figure out when to approximately fire the alarm (probably bundled together with other alarms at roughly the same time) so it can maximize battery. More information at: https://developer.android.com/reference/android/app/AlarmManager.html#setExact(int, long, android.app.PendingIntent)

In your code you can determine the OS version, and if it’s Lollipop or later, you can make it use setExact(), otherwise if older, use setTime() (since setExact wasn’t introduced till Lollipop):

AlarmManager alarmManager =
  (AlarmManager)
       context.getSystemService(Context.ALARM_SERVICE);

if(android.os.Build.VERSION.SDK_INT >= Build.VERSION_CODES.KITKAT)
   alarmManager.setExact(AlarmManager.RTC_WAKEUP,
         timer.stopTime,
         pendingIntent);
else
   alarmManager.set(AlarmManager.RTC_WAKEUP,
         timer.stopTime,
         pendingIntent);

And voila!