Tuesday 7 June 2016

AWS Tip: Save S3 costs with abort multipart lifecycle policy


S3 multipart uploads provide a number of benefits -- better throughput, recovery from network errors -- and a number of tools will automatically use multipart uploads for larger uploads. The AWS CLI cp, mv, and sync commands all make use of multipart uploads and make a note that "If the process is interrupted by a kill command or system failure, the in-progress multipart upload remains in Amazon S3 and must be cleaned up manually..."

The reason you would want to clean up these failed multipart uploads is because you will be charged for the storage they use while waiting for the upload to be completed (or aborted). This post provides some detail on how to find the incomplete uploads and options for removing them to save storage costs.

Finding incomplete multipart uploads

If you have relatively few buckets or only want to check your biggest buckets (CloudWatch S3 metrics are useful for finding these) the AWS CLI s3api list-multipart-uploads command is a simple check:
aws s3api list-multipart-uploads --bucket [bucket-name]
No output indicates that the bucket does not contain any incomplete uploads, see the list-multipart-uploads documentation linked above for an example of the output on a bucket that does contain an incomplete upload. A simple bash script to check all your buckets:

This script will list your buckets and the first page (out of possibly many more) incomplete multipart upload keys along with the date they were initiated. The region lookup is required to handle bucket names containing dots and buckets in eu-central-1 (SigV4).

Cleaning up

Once you have identified the buckets containing incomplete uploads it is worth investigating some of the recent failed uploads to see whether there is an underlying issue that needs to be addressed, particularly if the uploads relate to backups or important log files. The most typical cause is instances being terminated before completing uploads (look at Lifecycle Hooks to fix this if you are using Auto Scaling) but they may also be result of applications not performing cleanup on failures or not handling errors correctly.

A multipart upload can be aborted using the abort-multipart-upload s3api command in the AWS CLI using the object key and upload ID returned by list-multipart-uploads command. This can be scripted but will take time to complete for buckets containing large numbers of incomplete uploads, fortunately there is an easier way. S3 now supports a bucket lifecycle policy to automatically delete incomplete uploads after a specified period of time. Enabling the policy in the AWS console is fairly quick and easy, see Jeff's blog post for details. A rather messy boto3 example script for enabling the policy on all buckets can be found here, it should work with most bucket configurations but it comes with no guarantees and you use it at your own risk.

Conclusion (why you should do this)

If you are using S3 to store large (> 5MB) objects and you are spending more than a few dollars a month on S3 storage then there is a fairly good chance that you are paying unnecessarily for failed/incomplete multipart uploads. It should only take a few minutes to review your buckets and could potentially have significant monthly savings.