AmazonS3Client to loop through batches of S3 files objects

AWS provides the AmazonS3Client class, which is part of the AWS Java SDK. This class can be used to interact with files in S3.

An important feature to note of the AmazonS3Client is that it limits results to batches of 1000. If you have less than 1000 files, then all is good. You can use amazonS3Client.listObjects(bucketName); and it will provide all the objects in a bucket.

But if the bucket contains more than 1000 files, you will need to loop through the files in batches. This is not entirely obvious and can cause you to miss files (as I certainly did)!

To get started, you would initiate AmazonS3Client like so:

AmazonS3Client amazonS3Client = new AmazonS3Client(new BasicAWSCredentials(KEY, SECRET));

The approach I like to take is to first loop through and collect all the files up front like so:

ObjectListing objectListing = amazonS3Client.listObjects(bucketName);
List<S3ObjectSummary> s3ObjectSummaries = objectListing.getObjectSummaries();
while (objectListing.isTruncated()) 
{
   objectListing = amazonS3Client.listNextBatchOfObjects (objectListing);
   s3ObjectSummaries.addAll (objectListing.getObjectSummaries());
}

Note: if memory is a concern or you have an unlimited number of files, you can simply modify the approach to do whatever you need to with each file as you fetch it in batches from the API, instead of collecting them up front.

If you first collected them in a List up front, you can then loop through each file like so:

for(S3ObjectSummary s3ObjectSummary : s3ObjectSummaries)
{
	String s3ObjectKey = s3ObjectSummary.getKey();
	//Do whatever with s3ObjectSummary

 

Setting up AWS CLI and dumping a S3 bucket

AWS CLI (command line interface) is very useful when you want to automate certain tasks. This post is about dumping a whole S3 bucket from the command line. This could be for any purpose, such as creating a backup.

First of all, if you don’t already have it installed, you’ll need to download and install the AWS CLI. More information here: http://docs.aws.amazon.com/cli/latest/userguide/installing.html

To configure AWS CLI, type the command:

aws configure

It will ask for credentials: the Access Key ID, and the Access Secret Key. More information on how to set up a key is here: http://docs.aws.amazon.com/general/latest/gr/managing-aws-access-keys.html

And that’s it! You now have the power of manipulating your AWS environment from your command line.

In order to dump a bucket, you’ll need to first make sure that the account belonging to the AWS Key you generated has read access to the bucket. More on setting up permissions in S3 here: http://docs.aws.amazon.com/AmazonS3/latest/dev/s3-access-control.html

To dump the whole contents of an S3 bucket, you can use the following command:

aws s3 cp –quiet –recursive s3:///

This will copy the entire contents of the bucket to your local directory. As easy as that!

Encrypting already existing files in AWS S3 using the AWS Java API

In my last post I covered how to server-side encrypt files in S3 using the AWS Java API. Unfortunately, if you didn’t turn on encryption from the very first day when uploading to S3, you may have some files that are not encrypted. This post will cover an easy block of Java code which you can use to server-side encrypt any existing files that aren’t already, using the AWS Java API.

In summary, you need to loop through all existing files in a bucket, and see which one is not encrypted. And if not encrypted, you set the metadata to turn on server-side encryption, and have to save the file again in S3. Note: this may change the timestamps on your files, but this is essentially the only way through the API to save the metadata for a file to turn on encryption.

Here is the code:

public S3EncryptionMigrator(String bucketName) {
 Logger.getLogger("com.amazonaws.http.AmazonHttpClient").setLevel(Level.OFF); //AWS API outputs too much information, totally flodding the console. Turn it off

 AmazonS3Client amazonS3Client = new AmazonS3Client(...);

 ObjectListing objectListing = amazonS3Client.listObjects(bucketName);
 List s3ObjectSummaries = objectListing.getObjectSummaries();
 while (objectListing.isTruncated()) {
  objectListing = amazonS3Client.listNextBatchOfObjects(objectListing);
  s3ObjectSummaries.addAll(objectListing.getObjectSummaries());
 }

 for (S3ObjectSummary s3ObjectSummary: s3ObjectSummaries) {
  String s3ObjectKey = s3ObjectSummary.getKey();
  S3Object unecryptedS3Object = amazonS3Client.getObject(bucketName, s3ObjectKey);
  ObjectMetadata meta = unecryptedS3Object.getObjectMetadata();
  String currentSSEAlgorithm = meta.getSSEAlgorithm();
  unecryptedS3Object.close();
  if (currentSSEAlgorithm != null && currentSSEAlgorithm.equals(ObjectMetadata.AES_256_SERVER_SIDE_ENCRYPTION))
   continue; //Already encrypted, skip
  meta.setSSEAlgorithm(ObjectMetadata.AES_256_SERVER_SIDE_ENCRYPTION); //set encryption
  CopyObjectRequest copyObjectRequest = new CopyObjectRequest(bucketName, s3ObjectKey, bucketName, s3ObjectKey);
  copyObjectRequest.setNewObjectMetadata(meta);
  amazonS3Client.copyObject(copyObjectRequest); //Save the file
  System.out.println(">> '" + s3ObjectKey + "' encrypted.");
 }
}

Let’s examine the code. First you instantiate AmazonS3Client with the correct credentials. This should be tailored to your S3 authentication setup.  You start by getting a list of all files in a bucket. Note that you have to loop through objectListing.getObjectSummaries() because only 1000 results are returned at a time. In case you have more than 1000 files, you’ll need to loop through the rest until you get all of them.

Then you loop through the list of files. For each file you check if server-side encryption is already turned on by reading the existing metadata of the file. If not, you set the flag for encryption, and then essentially copy the file onto itself. This will save the new metadata, and will turn on server-side encryption.

Encrypting files in AWS S3 using Java API

If you use AWS S3 Java API, and would like to see how you can encrypt files on S3, this post is for you.

First of all, there are two ways you can encrypt files in S3. One is to encrypt files on the server side, and one is to encrypt files on the client side. With using the server side option, you don’t have to worry about too much. S3 encrypts the files for you when they are written to disk, and decrypts them when they are read, seamlessly. With the client side option, the client (your application) has to encrypt files before transmitting them to S3, and decrypt them after receiving the file from S3.

In this post I’ll cover server side encryption. We opted to use this one because it’s just simpler, and seamless. You don’t have to worry about encrypting/decrypting files yourself, nor do you have to worry about the key.

I’m assuming that you’re already familiar with the AWS Java API. For most things related to S3, AWS provides a class called AmazonS3Client. Once you have AmazonS3Client instantiated with your configuration, you will need to enable encryption in the matadata for each file you upload.

Example:

File fileForUpload = new File(...);
AmazonS3Client amazonS3Client = new AmazonS3Client(...);
ObjectMetadata meta = new ObjectMetadata();
meta.setContentType(URLConnection.guessContentTypeFromName(fileForUpload.getName()));
meta.setSSEAlgorithm(ObjectMetadata.AES_256_SERVER_SIDE_ENCRYPTION);
amazonS3Client.putObject(s3Bucket, s3FullDestinationPath, new FileInputStream(fileForUpload), meta);

Let’s examine. First you instantiate the File you want to upload, and AmazonS3Client. Next you set the metadata on the file. This includes setting the content type of the file (important because having the wrong content-type can cause issues down the line), and sets the encryption flag for the file. Then when you upload the file using AmazonS3Client.putObject(…), the file will be encrypted by S3 before it is stored, and automatically decrypted when it is retrieved, all by S3’s servers. And that’s it!

Note that according to AWS Java API documentation, AmazonS3Client uses SSL under the hood so you don’t have to worry about transmitting unencrypted files over the network.