Saturday, 7 September 2019

AWS in the weeds - S3 CloudWatch metrics and lifecycle actions that are in progress

I was recently working with a customer on an issue that highlighted a poorly explained side effect of the reversibility of lifecycle actions. The purpose of this post is to explain this behaviour with the hope that it will save S3 customers unexpected costs. TLDR: S3 CloudWatch metrics don't accurately display metrics about lifecycle actions that are still in progress.

The customer use case was fairly simple, they had a bucket with a large amount of data that needed to be deleted. They added a lifecycle expiration action to the bucket and the following day noticed that the CloudWatch bucket size metric showed the bucket size to have reduced by the expected amount. An example of how this looks in CloudWatch is below, this graph is a reproduction in my own account, the customer had a significantly larger amount of data to delete (petabytes rather than terabytes):


The customer assumed (logically) that the drop in the graph meant that the lifecycle action had completed and they removed the expiration lifecycle action. A couple of weeks later the customer's finance team noticed that their S3 storage costs had not reduced despite the engineering team telling them to expect a significant decrease. Looking at the relevant bucket the CloudWatch graph showed that the bucket size reduction had been reverted and most of the supposedly deleted data was still present. Example graph:


After investigating with the customer it became clear that this was an unintended consequence of the way S3 lifecycle actions are implemented, specifically that:

"When you disable or delete a lifecycle rule, after a small delay Amazon S3 stops scheduling new objects for deletion or transition. Any objects that were already scheduled will be unscheduled and they won't be deleted or transitioned."

The documentation unfortunately neglects to mention that the CloudWatch bucket metrics will update immediately even when the lifecycle action has not actually completed processing. There is currently no way to view progress on the lifecycle action and in this case it took the better part of 2 weeks for the action to complete. This was a fairly expensive lesson for the customer as S3 was 'operating as designed' and the customer was charged for the data storage costs from the time they removed the lifecycle action until they added it back when they noticed the issue.

Takeaways to help avoid this problem:
1. Lifecycle actions can take a long time to complete, potentially weeks for large (petabyte) volumes of data.
2. Don't used CloudWatch bucket metrics as an indicator of lifecycle action progress or completion.
3. After removing or disabling a lifecycle action make sure you check the CloudWatch metrics the next day or two to confirm there is no unexpected reversion of the action.
4. You will be charged for data costs if the removal of the lifecycle action results in the action being reverted for some objects.

Saturday, 16 February 2019

AWS tip: Wildcard characters in S3 lifecycle policy prefixes

A quick word of warning regarding S3's treatment of asterisks (*) in object lifecycle policies. In S3 asterisks are valid 'special' characters and can be used in object key names, this can lead to a lifecycle action not being applied as expected when the prefix contains an asterisk.

Historically an asterisk is treated as a wildcard to pattern match 'any', so you would be able to conveniently match all files for a certain pattern: 'rm *' as an example, would delete all files. This is NOT how an asterisk behaves in S3 lifecycle prefixes. If you specify a prefix of '*' or '/*' it will only be applied to objects that start with an asterisk and not all objects. The '*' prefix rule would be applied to these objects:
example-bucket/*/key
example-bucket/*yek

But would not be applied to:
example-bucket/object
example-bucket/directory/key
example-bucket/tcejbo*

It is not an error to specify an asterisk and it will merely result in the policy not being applied so you may not even know this as an issue. Fortunately it is fairly easy to check for this configuration with the CLI, the following bash one liner will iterate through all buckets owned by the caller and check if the bucket has any policies with an asterisk in their name. It will print out the bucket name and policy if affected or a '.' (to show progress) if not.

You can also check the S3 console, it should display 'Whole bucket' under the 'Applied To' column for any lifecycle rules you intended to have applied to the entire bucket.