Saturday 7 September 2019

AWS in the weeds - S3 CloudWatch metrics and lifecycle actions that are in progress

I was recently working with a customer on an issue that highlighted a poorly explained side effect of the reversibility of lifecycle actions. The purpose of this post is to explain this behaviour with the hope that it will save S3 customers unexpected costs. TLDR: S3 CloudWatch metrics don't accurately display metrics about lifecycle actions that are still in progress.

The customer use case was fairly simple, they had a bucket with a large amount of data that needed to be deleted. They added a lifecycle expiration action to the bucket and the following day noticed that the CloudWatch bucket size metric showed the bucket size to have reduced by the expected amount. An example of how this looks in CloudWatch is below, this graph is a reproduction in my own account, the customer had a significantly larger amount of data to delete (petabytes rather than terabytes):


The customer assumed (logically) that the drop in the graph meant that the lifecycle action had completed and they removed the expiration lifecycle action. A couple of weeks later the customer's finance team noticed that their S3 storage costs had not reduced despite the engineering team telling them to expect a significant decrease. Looking at the relevant bucket the CloudWatch graph showed that the bucket size reduction had been reverted and most of the supposedly deleted data was still present. Example graph:


After investigating with the customer it became clear that this was an unintended consequence of the way S3 lifecycle actions are implemented, specifically that:

"When you disable or delete a lifecycle rule, after a small delay Amazon S3 stops scheduling new objects for deletion or transition. Any objects that were already scheduled will be unscheduled and they won't be deleted or transitioned."

The documentation unfortunately neglects to mention that the CloudWatch bucket metrics will update immediately even when the lifecycle action has not actually completed processing. There is currently no way to view progress on the lifecycle action and in this case it took the better part of 2 weeks for the action to complete. This was a fairly expensive lesson for the customer as S3 was 'operating as designed' and the customer was charged for the data storage costs from the time they removed the lifecycle action until they added it back when they noticed the issue.

Takeaways to help avoid this problem:
1. Lifecycle actions can take a long time to complete, potentially weeks for large (petabyte) volumes of data.
2. Don't used CloudWatch bucket metrics as an indicator of lifecycle action progress or completion.
3. After removing or disabling a lifecycle action make sure you check the CloudWatch metrics the next day or two to confirm there is no unexpected reversion of the action.
4. You will be charged for data costs if the removal of the lifecycle action results in the action being reverted for some objects.