Removing neighboring (consecutive-only) duplicates in a Pandas DataFrame

Pandas, the Python Data Analysis Library, makes it easy to drop duplicates from a DataFrame, using the drop_duplicates() function (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html).

The problem with this is it removes ALL duplicates anywhere in your DataFrame. Depending on what you’re doing, you may not want to get rid of all duplicates everywhere, but only neighboring duplicates. That is, duplicates that are consecutive. But if there’s a duplicate after a non-duplicate row, that’s okay, for your purpose.

For example, you may have the following data:

1 2 3
1 2 3
1 5 5
1 2 3
1 5 5

You only want to get rid of consecutive duplicates (which in this case are only the first two rows), and get this result:

1 2 3
1 5 5
1 2 3
1 5 5

You can accomplish this using the pandas shift() function, which can be used to get the very next element, like so:

data = data.loc[data.shift() != data]

What this does is for every row in the DataFrame, it compares it to the next row. If all columns are equal to the columns in the next row, the row does not get repeated.

Note: this only works if you have simple elements in your DataFrame that can be checked to be equivalent (in the example above all elements are integers). Otherwise you’ll need to extend the type of element, and implement an equivalency function.

Leave a Reply

Your email address will not be published. Required fields are marked *