Sunday 8 August 2021

Reducing VPC interface endpoint costs in dev/test environments

Two quick tips for potentially reducing VPC interface endpoint costs.

Firstly an Athena query for finding unused VPC interface endpoints in cost and usage report data. This requires you to have cost and usage reporting enabled (you should already) and configured for querying in Athena, see this guide. The query searches for all VPC interface endpoints that have not accrued any data transfer charges during the query period. VPC endpoints incur both hourly and data transfer charges, if you are not transferring any data it is very likely the endpoint is not being used and can be deleted. This will save around $7 a month ($0.01 x 24hours x 30 days) in each availability zone configured for the unused endpoint (by default all AZs). Annual cost saving potential of around $260 for each (3 AZ) endpoint removed in each account.

select line_item_usage_account_id, line_item_resource_id
from cur
where line_item_usage_type like '%VpcEndpoint-Hour%'
and year='2021'
and month in ('7','07')
and line_item_resource_id not in 
 (select distinct(line_item_resource_id) 
  from cur
  where line_item_usage_type like '%VpcEndpoint-Byte%'
  and year='2021'
  and month in ('7','07'))
group by line_item_usage_account_id, line_item_resource_id

The second approach is only recommended for dev/test environments where availability zone fault tolerance is not a requirement. When you create a VPC interface endpoint an ENI is created in each subnet (availability zone)  unless you explicitly deselect the subnets. This is a best practice for interface endpoints as it provides continued availability in the case of an AZ failure - as long as you are using the (default) regional interface endpoint DNS name with private DNS. When using private DNS the regional endpoint DNS name, for example, is resolved to the private IP of the VPC interface endpoint ENI rather than the public IP of the service. In the case where one AZ is not reachable the request will be routed to an interface endpoint ENI in a different AZ, allowing your client applications to continue functioning. The down side of this is that you will incur intra-region data transfer charges (of $0.01/GB) for sending requests to the ENI in a different AZ. The potential cost saving opportunity is to remove this availability mechanism to reduce the hourly interface endpoint charges. 

As long as the data transferred to and from the interface endpoint is around 1GB/hour it is cheaper to configure the VPC interface endpoint to only be available in one (or two) availability zones. This makes sense for any API specific endpoints (EC2 for example) but may not make sense for high throughput data intensive endpoints (such as S3/Kinesis/Cloudwatch Logs). The cost and usage report data can once again be used to help identify interface endpoints where the data transfer is low enough to justify exposing the endpoint in only one availability zone. Note that the endpoint data charges are aggregated across all ENIs so the query will not correctly identify cases where traffic is not more or less equally balanced across AZs. The query assumes that the cost and usage report is configured for hourly granularity, the line_item_usage_amount would need to be modified for daily/monthly reports and would be less accurate. Annual cost saving potential would depend on data transfer and number of AZs provisioned but should be around $175 for API endpoints reduced from 3 AZs to 1 AZ.

select line_item_usage_account_id, line_item_resource_id
from cur
where line_item_usage_type like '%VpcEndpoint-Byte%'
  and year='2021'
  and month in ('7','07')
  and line_item_usage_amount < 1.0
  and line_item_resource_id not in 
   (select distinct(line_item_resource_id) 
    from cur
    where line_item_usage_type like '%VpcEndpoint-Byte%'
      and year='2021'
      and month in ('7','07')
      and line_item_usage_amount > 1.0)
group by line_item_usage_account_id, line_item_resource_id

Thursday 1 July 2021

Checking regional service availability with the AWS CLI

Using System Manager public parameters a quick CLI one liner for checking if an AWS service is available in a region:
aws ssm get-parameters-by-path --path /aws/service/global-infrastructure/services/[SERVICE]/regions/[REGION] --output text

As an example, checking if Athena is available in ap-northeast-3:
aws ssm get-parameters-by-path --path /aws/service/global-infrastructure/services/athena/regions/ap-northeast-3 --output text

An empty result indicates the service is not available. The list of service names can be retrieved as below (from the documentation linked above):
aws ssm get-parameters-by-path --path /aws/service/global-infrastructure/services --query 'Parameters[].Name | sort(@)'

Similarly the list of regions can be found using:
aws ssm get-parameters-by-path --path /aws/service/global-infrastructure/regions --query 'Parameters[].Name'

Saturday 7 September 2019

AWS in the weeds - S3 CloudWatch metrics and lifecycle actions that are in progress

I was recently working with a customer on an issue that highlighted a poorly explained side effect of the reversibility of lifecycle actions. The purpose of this post is to explain this behaviour with the hope that it will save S3 customers unexpected costs. TLDR: S3 CloudWatch metrics don't accurately display metrics about lifecycle actions that are still in progress.

The customer use case was fairly simple, they had a bucket with a large amount of data that needed to be deleted. They added a lifecycle expiration action to the bucket and the following day noticed that the CloudWatch bucket size metric showed the bucket size to have reduced by the expected amount. An example of how this looks in CloudWatch is below, this graph is a reproduction in my own account, the customer had a significantly larger amount of data to delete (petabytes rather than terabytes):

The customer assumed (logically) that the drop in the graph meant that the lifecycle action had completed and they removed the expiration lifecycle action. A couple of weeks later the customer's finance team noticed that their S3 storage costs had not reduced despite the engineering team telling them to expect a significant decrease. Looking at the relevant bucket the CloudWatch graph showed that the bucket size reduction had been reverted and most of the supposedly deleted data was still present. Example graph:

After investigating with the customer it became clear that this was an unintended consequence of the way S3 lifecycle actions are implemented, specifically that:

"When you disable or delete a lifecycle rule, after a small delay Amazon S3 stops scheduling new objects for deletion or transition. Any objects that were already scheduled will be unscheduled and they won't be deleted or transitioned."

The documentation unfortunately neglects to mention that the CloudWatch bucket metrics will update immediately even when the lifecycle action has not actually completed processing. There is currently no way to view progress on the lifecycle action and in this case it took the better part of 2 weeks for the action to complete. This was a fairly expensive lesson for the customer as S3 was 'operating as designed' and the customer was charged for the data storage costs from the time they removed the lifecycle action until they added it back when they noticed the issue.

Takeaways to help avoid this problem:
1. Lifecycle actions can take a long time to complete, potentially weeks for large (petabyte) volumes of data.
2. Don't used CloudWatch bucket metrics as an indicator of lifecycle action progress or completion.
3. After removing or disabling a lifecycle action make sure you check the CloudWatch metrics the next day or two to confirm there is no unexpected reversion of the action.
4. You will be charged for data costs if the removal of the lifecycle action results in the action being reverted for some objects.

Saturday 16 February 2019

AWS tip: Wildcard characters in S3 lifecycle policy prefixes

A quick word of warning regarding S3's treatment of asterisks (*) in object lifecycle policies. In S3 asterisks are valid 'special' characters and can be used in object key names, this can lead to a lifecycle action not being applied as expected when the prefix contains an asterisk.

Historically an asterisk is treated as a wildcard to pattern match 'any', so you would be able to conveniently match all files for a certain pattern: 'rm *' as an example, would delete all files. This is NOT how an asterisk behaves in S3 lifecycle prefixes. If you specify a prefix of '*' or '/*' it will only be applied to objects that start with an asterisk and not all objects. The '*' prefix rule would be applied to these objects:

But would not be applied to:

It is not an error to specify an asterisk and it will merely result in the policy not being applied so you may not even know this as an issue. Fortunately it is fairly easy to check for this configuration with the CLI, the following bash one liner will iterate through all buckets owned by the caller and check if the bucket has any policies with an asterisk in their name. It will print out the bucket name and policy if affected or a '.' (to show progress) if not.

You can also check the S3 console, it should display 'Whole bucket' under the 'Applied To' column for any lifecycle rules you intended to have applied to the entire bucket.

Sunday 14 October 2018

Finding S3 API requests from previous versions of the AWS CLI and SDKs


Earlier this year the S3 team announced that S3 will stop accepting API requests signed using AWS Signature Version 2 after June 24th, 2019. Customers will need to update their SDKs, CLIs, and custom implementations to make use of AWS Signature Version 4 to avoid impact after this date. It might be difficult to find older applications or instances using outdated versions of the AWS CLI or SDKs that need to be updated, the purpose of this post is to explain how AWS CloudTrail data events and Amazon Athena can be used to help identify applications that may need to be updated. We will cover the setup of the CloudTrail data events, the Athena table creation, and some Athena queries to filter and refine the results to help with this process.

Update (January/February 2019)

S3 recently added a SignatureVersion item to the AdditionalEventData field in the S3 data events, this significantly simplifies the process of finding clients using SigV2. The SQL queries below have been updated to exclude events with a SigV4 signature (additionaleventdata NOT LIKE '%SigV4%'). You can equally search for only '%SigV2%' and skip the CLI version string munging entirely.

Setting up CloudTrail data events in the AWS console

The first step is to create a trail to capture S3 data events. This should be done in the region you plan on running your Athena queries in order to avoid unnecessary data transfer charges. In the CloudTrail console for the region, create a new trail specifying the trail name. The ‘Apply trail to all regions’ option should be left as ‘Yes’ unless you plan on running separate analyses for each region. Given that we are creating a data events trail, select ‘None’ under the Management Events section and check the “Select all S3 buckets in your account” checkbox. Finally select the S3 location where the CloudTrail data will be written, we will create new bucket for simplicity:

Setting up CloudTrail data events using the AWS CLI

If you prefer to create the trail using the AWS CLI then you can use the create-subscription command to create the S3 bucket and trail with the correct permissions, updating it to be a global trail and then adding the S3 data event configuration:

A word on cost

Once the trail has been created, CloudTrail will start recording S3 data events and delivering them to the configured S3 bucket. Data events are currently priced at $0.10 per 100,000 events with the storage costs being the standard S3 data storage charges for the (compressed) events, see the CloudTrail pricing page for additional details. It is recommend that you disable the data event trail once you are satisfied that you have gathered sufficient request data, it can be re-enabled if further analysis is required at a later stage.

Creating the Athena table

The CloudTrail team simplified the process for using Athena to analyse CloudTrail logs by adding a feature to allow customers to create an Athena table directly from the CloudTrail console event history page by simply clicking on the ‘Run advanced queries in Amazon Athena’ link and selecting the corresponding S3 CloudTrail bucket:

An explanation of how to create the Athena table manually can be found in the Athena CloudTrail documentation.

Analysing the data events with Athena

We now have all the components needed to begin searching for clients that may need to be updated. Starting with a basic query that filters out most of the AWS requests (for example the AWS Console, CloudTrail, Athena, Storage Gateway, CloudFront):

These results should mostly be client API/CLI requests but the large number of requests can still be refined by only including regions that actually support AWS Signature Version 2. From the region and endpoint documentation for S3 we can see that we only need to check eight of the regions. We can safely exclude the AWS Signature Version 4 (SigV4) regions as clients would not work correctly against these regions if they did not already have SigV4 support. Let’s also look at distinct user agents and extract the version from the user agent string:

We are unfortunately not able to filter on the calculated ‘version’ column and as it is a string it is also difficult to perform direct numerical version comparison. We can use some arithmetic to create a version number that can be compared. Using the AWS CLI requests as an example for the moment and adding back the source IP address and user identity

The version comparison number (10110108) translates to the version string 1.11.108 which is the first version of AWS CLI supporting SigV4 by default. This results in a list of clients accessing S3 objects in this account using a version of the AWS CLI that needs to be updated:

The same query can be applied to all the AWS CLI and SDK user agent strings by substituting the corresponding agent string and version number for SDK versions using SigV4 by default:

AWS Client
SigV4 default version
User Agent String
Version comparator
Python Botocore
Python Boto3

.NET35,.NET45, and CoreCLR only, PCL, Xamarin, UWP platforms do not support SigV4 at all
All versions of Go and C++ SDKs support SigV4 by default

Additional Note:
There is no need to look at the client version number for new events which will automatically include the SignatureVersion.

Tracing the source of the requests

The source IP address will reflect the private IP of the EC2 instance accessing S3 through a VPC endpoint or the public IP if accessing S3 directly. You can search for either of these IPs in EC2 AWS Console for the corresponding region. For non-EC2 or NAT access you should be able to use the ARN to track down the source of the requests.

Saturday 25 August 2018

AWS S3 event aggregation with Lambda and DynamoDB


S3 has had event notifications since 2014 and for individual object notifications these events work well with Lambda, allowing you to perform an action on every object event in a bucket. It is harder to use this approach when you want to perform an action a limited number of times or at an aggregated bucket level. An example use case would be refreshing a dependency (like Storage Gateway RefreshCache) when you are expecting a large number of objects events in a bucket. Performing a relatively expensive action for every event is not practical or efficient in this case. This post provides a solution for aggregating these events using Lambda, DynamoDB, and SQS.

Update (June/July 2020)

Cache refresh can now be automated. The event aggregation approach below may still be useful depending on your requirements.

The problem

We want to call RefreshCache on our Storage Gateway (SGW) whenever the contents of the S3 bucket it exposes are updated by an external process. If the external process is updating a large number of (small) S3 objects then a large number of S3 events will be triggered. We don't want to overload our SGW with refresh requests so we need a way to aggregate these events to only send occasional refresh requests.

The solution

The solution is fairly simple and uses DynamoDB's Conditional Writes for synchronisation and SQS Message Timers to enable aggregation. When the Lambda function processes a new object event it first checks to see if the event falls within the window of the currently active refresh request. If the event is within the window it will automatically be included when the refresh executes and the event can be ignored. If the event occurred after the last refresh then a new refresh request is sent to an SQS queue with a message timer equal to the refresh window period. This allows for all messages received within a refresh window to be included in a single refresh operation.


At a high level we need to create resources (SQS queue, DynamoDB table, Lambda functions), set up permissions (create and assign IAM roles), and apply some configuration (linking Lambda to S3 event notification and SQS queues). This implementation really belongs in a CloudFormation template (and I may actually create one) but I was interested to try and do this entirely via the AWS CLI, masochistic as that may be. If you are not interested in the gory implementation details then skip ahead to 'Creation and deletion script' section

Let's start with the S3 event aggregation piece. We need:
  1. A DynamoDB table to track state
  2. An SQS queue as a destination for aggregated actions
  3. A Lambda function for processing and aggregating the S3 events
  4. IAM permissions for all of the above
As the DynamoDB table and SQS queue are independent we can create these first:
aws dynamodb create-table --table-name S3EventAggregator --attribute-definitions AttributeName=BucketName,AttributeType=S --key-schema AttributeName=BucketName,KeyType=HASH --provisioned-throughput ReadCapacityUnits=1,WriteCapacityUnits=5

aws sqs create-queue --queue-name S3EventAggregatorActionQueue

Naturally this needs to be done with a user that has sufficient permissions and assumes your default region is set.

The Lambda function is a bit trickier as it requires a role to be created before the function can be created. So let's start with the IAM permissions. First let's create a policy allowing DynamoDB GetItem and UpdateItem to be performed on the DynamoDB table we created earlier. To do this we need a JSON file containing the necessary permissions. The dynamo-writer.json file looks like this:
    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Action": [
            "Resource": "arn:aws:dynamodb:REGION:ACCOUNT_ID:table/S3EventAggregator"

We need to replace REGION and ACCOUNT_ID with the relevant values. As we are aiming at using the command line for this exercise, let's use STS to retrieve our account ID, set our region, and then use sed to substitute both variables:

ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text) 
wget -O dynamo-writer.json 
sed -i "s/ACCOUNT_ID/$ACCOUNT_ID/g" dynamo-writer.json
sed -i "s/REGION/$AWS_DEFAULT_REGION/g" dynamo-writer.json

aws iam create-policy --policy-name S3EventAggregatorDynamo --policy-document file://dynamo-writer.json

We now have a policy that allows the caller (our soon to be created Lambda function in this case) to update items in the S3EventAggregator DynamoDB table. Next we need to create a policy to allow the function to write messages to SQS. The sqs-writer.json policy file contents are similar to the DynamoDB policy:
    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Action": "sqs:SendMessage",
            "Resource": "arn:aws:sqs:REGION:ACCOUNT_ID:S3EventAggregatorActionQueue"

Retrieving the file and substituting the ACCOUNT_ID and REGION using the environment variables we created for the DynamoDB policy:
wget -O sqs-writer.json 
sed -i "s/ACCOUNT_ID/$ACCOUNT_ID/g" sqs-writer.json
sed -i "s/REGION/$AWS_DEFAULT_REGION/g" sqs-writer.json
aws iam create-policy --policy-name S3EventAggregatorSqsWriter --policy-document file://sqs-writer.json

Having defined the two resource access policies let's create an IAM role for the Lambda function. 
wget -O lambda-trust.json
aws iam create-role --role-name S3EventAggregatorLambdaRole --assume-role-policy-document file://lambda-trust.json

The lambda-trust.json policy allows Lambda access to assume roles via STS and looks like this (no substitutions required for this one):
  "Version": "2012-10-17",
  "Statement": [
      "Effect": "Allow",
      "Principal": {
        "Service": ""
      "Action": "sts:AssumeRole"

We can now attach the SQS and DynamoDB policies to the new created Lambda role. We also need the AWSLambdaBasicExecutionRole which is an AWS managed policy providing access to CloudWatch logs and Lambda function execution:
aws iam attach-role-policy --policy-arn arn:aws:iam::$ACCOUNT_ID:policy/S3EventAggregatorDynamo --role-name S3EventAggregatorLambdaRole
aws iam attach-role-policy --policy-arn arn:aws:iam::$ACCOUNT_ID:policy/S3EventAggregatorSqsWriter --role-name S3EventAggregatorLambdaRole
aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole --role-name S3EventAggregatorLambdaRole

We are now finally ready to create the S3EventAggregator Lambda function. Starting by creating the Python deployment package:
wget -O

And then creating the function (using the AWS_DEFAULT_REGION and ACCOUNT_ID environment variables again):
aws lambda create-function --function-name S3EventAggregator --runtime python3.6 --role arn:aws:iam::$ACCOUNT_ID:role/S3EventAggregatorLambdaRole --zip-file fileb:// --handler s3_aggregator.lambda_handler --timeout 10 --environment "Variables={QUEUE_URL=https://sqs.$$ACCOUNT_ID/S3EventAggregatorActionQueue,REFRESH_DELAY_SECONDS=30,LOG_LEVEL=INFO}"
aws lambda put-function-concurrency --function-name S3EventAggregator --reserved-concurrent-executions 1

The function concurrency is set to 1 as there is no benefit to having the function processing S3 events concurrently and 'single threading' the function will limit the maximum concurrent DynamoDB request rate to reduce DynamoDB capacity usage and costs.

All that is left now is to give S3 permission to execute the Lambda function and link the bucket notification events to the S3EventAggregator function. Giving S3 permission on the specific bucket:
aws lambda add-permission --function-name S3EventAggregator --statement-id SID_$BUCKET --action lambda:InvokeFunction --principal --source-account $ACCOUNT_ID --source-arn arn:aws:s3:::$BUCKET

Interestingly, the --source-arn can be omitted to avoid needing to add permissions for each bucket you want the function to operate on but it is required (and must match a specific bucket) for the Lambda Console to display the function and trigger correctly. The S3 event.json configuration creates an event on any object creation or removal events:
    "LambdaFunctionConfigurations": [
            "Id": "s3-event-aggregator",
            "LambdaFunctionArn": "arn:aws:lambda:REGION:ACCOUNT_ID:function:S3EventAggregator",
            "Events": [

Once again substituting the relevant region and account IDs:
wget -O event.json
sed -i "s/ACCOUNT_ID/$ACCOUNT_ID/g" event.json
sed -i "s/REGION/$AWS_DEFAULT_REGION/g" event.json

And linking the event configuration to a bucket:
aws s3api put-bucket-notification-configuration --bucket $BUCKET --notification-configuration file://event.json

Thus concludes the event aggregation part of the solution. A quick test confirms the event aggregation is working as expected:

time for i in $(seq 1 5); do aws s3 cp test.txt s3://$BUCKET/test$i.txt; done
upload: ./test.txt to s3://s3-test-net/test1.txt                  
upload: ./test.txt to s3://s3-test-net/test2.txt                  
upload: ./test.txt to s3://s3-test-net/test3.txt                  
upload: ./test.txt to s3://s3-test-net/test4.txt                  
upload: ./test.txt to s3://s3-test-net/test5.txt                  

real 0m2.106s
user 0m1.227s
sys 0m0.140s

STREAM=$(aws logs describe-log-streams --log-group-name /aws/lambda/S3EventAggregator --order-by LastEventTime --descending --query 'logStreams[0].logStreamName' --output text); aws logs get-log-events --log-group-name /aws/lambda/S3EventAggregator --log-stream-name $STREAM --query 'events[*].{msg:message}' --output text | grep "^\[" | sed 's/\t/ /g'
[INFO] 2018-08-12T18:14:03.647Z 7da07415-9e5b-11e8-ab6d-8f962149ce24 Sending refresh request for bucket: s3-test-net, timestamp: 1534097642149
[INFO] 2018-08-12T18:14:04.207Z 7e1d6ca2-9e5b-11e8-ac9d-e1f0f9729f66 Refresh for bucket: s3-test-net within refresh window, skipping. S3 Event timestamp: 1534097642938
[INFO] 2018-08-12T18:14:04.426Z 7eefb013-9e5b-11e8-ab6d-8f962149ce24 Refresh for bucket: s3-test-net within refresh window, skipping. S3 Event timestamp: 1534097643812
[INFO] 2018-08-12T18:14:04.635Z 7e5b5feb-9e5b-11e8-8aa9-7908c99c450a Refresh for bucket: s3-test-net within refresh window, skipping. S3 Event timestamp: 1534097643371
[INFO] 2018-08-12T18:14:05.915Z 7ddb5a72-9e5b-11e8-80de-0dd6a15c3f62 Refresh for bucket: s3-test-net within refresh window, skipping. S3 Event timestamp: 1534097642517

From the 'within refresh window' log messages we can see 4 of the 5 events were skipped as they fell within the refresh aggregation window. Checking the SQS queue we can see the refresh request event:

aws sqs receive-message --queue-url https://$$ACCOUNT_ID/S3EventAggregatorActionQueue --attribute-names All --message-attribute-names All
    "Messages": [
            "MessageId": "c0027dd2-30bc-48bc-b622-b5c85d862c92",
            "ReceiptHandle": "AQEB9DQXkIWsWn...5XU2a13Q8=",
            "MD5OfBody": "99914b932bd37a50b983c5e7c90ae93b",
            "Body": "{}",
            "Attributes": {
                "SenderId": "AROAI55PXBF63XVSEBNYM:S3EventAggregator",
                "ApproximateFirstReceiveTimestamp": "1534097653846",
                "ApproximateReceiveCount": "1",
                "SentTimestamp": "1534097642728"
            "MD5OfMessageAttributes": "6f6eaf397811cbece985f3e8d87546c3",
            "MessageAttributes": {
                "bucket-name": {
                    "StringValue": "s3-test-net",
                    "DataType": "String"
                "timestamp": {
                    "StringValue": "1534097642149",
                    "DataType": "Number"

Moving onto the final part of the solution, we need a Lambda function that processes the events that the S3EventAggregator function sends to SQS. For the function's permissions we can reuse the S3EventAggregatorDynamo policy for DynamoDB access but will need to create a new policy for reading and deleting SQS messages and refreshing the Storage Gateway cache.

The sgw-refresh.json is as follows, note that SMB file shares are included but the current Lambda execution environment only supports boto3 1.7.30 which does not actually expose the SMB APIs (more on working around this later):
    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Action": [
            "Resource": "*"

Creating the policy:
wget -O sgw-refresh.json
aws iam create-policy --policy-name StorageGatewayRefreshPolicy --policy-document file://sgw-refresh.json

The sqs-reader.json gives the necessary SQS read permissions  on the S3EventAggregatorActionQueue:
    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Action": [
            "Resource": "arn:aws:sqs:REGION:ACCOUNT_ID:S3EventAggregatorActionQueue"

Substituting  and creating the policy:
wget -O sqs-reader.json
sed -i "s/ACCOUNT_ID/$ACCOUNT_ID/g" sqs-reader.json 
sed -i "s/REGION/$AWS_DEFAULT_REGION/g" sqs-reader.json
aws iam create-policy --policy-name S3EventAggregatorSqsReader --policy-document file://sqs-reader.json

And then creating the role and adding the relevant policies:
wget -O lambda-trust.json
aws iam create-role --role-name S3AggregatorActionLambdaRole --assume-role-policy-document file://lambda-trust.json
aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole --role-name S3AggregatorActionLambdaRole
aws iam attach-role-policy --policy-arn arn:aws:iam::$ACCOUNT_ID:policy/S3EventAggregatorSqsReader --role-name S3AggregatorActionLambdaRole
aws iam attach-role-policy --policy-arn arn:aws:iam::$ACCOUNT_ID:policy/S3EventAggregatorDynamo --role-name S3AggregatorActionLambdaRole
aws iam attach-role-policy --policy-arn arn:aws:iam::$ACCOUNT_ID:policy/StorageGatewayRefreshPolicy --role-name S3AggregatorActionLambdaRole

Next we will create the Lambda function, this will however depend on whether or not you require SMB file share support. As mentioned earlier the current Lambda execution environment does not expose the new SMB file share APIs so if you have SMB shares mapped on your Storage Gateway you will have to include the latest botocore and boto3 libraries with your deployment. The disadvantage of this is that you are not able to view the code in the Lambda console (due to the deployment file size limitation). If you are only using NFS shares then you only need the code without the latest libraries but it will break if you add an SMB share before the Lambda execution environment supports it. Including the dependency in the deployment is the preferred option so that is what we are going to do:

mkdir deploy
pip install boto3 botocore -t deploy
wget -O deploy/
cd deploy
zip -r *

aws lambda create-function --function-name S3StorageGatewayRefresh --runtime python3.6 --role arn:aws:iam::$ACCOUNT_ID:role/S3AggregatorActionLambdaRole --zip-file fileb:// --handler s3_sgw_refresh.lambda_handler --timeout 5 --environment "Variables={LOG_LEVEL=INFO}"
aws lambda put-function-concurrency --function-name S3StorageGatewayRefresh --reserved-concurrent-executions 1

And finally create the event mapping to execute the S3StorageGatewayRefresh function when messages are received on the queue:

aws lambda create-event-source-mapping --function-name S3StorageGatewayRefresh --event-source arn:aws:sqs:$AWS_DEFAULT_REGION:$ACCOUNT_ID:S3EventAggregatorActionQueue --batch-size 1

And that is the final solution. Verifying it works as expected, let's mount the NFS share and upload some files via the CLI and confirm the share is refreshed. Mounting the share:

$ sudo mount -t nfs -o nolock share
$ cd share/sgw
$ ls -l
total 0

Uploading the files through the CLI (from a different machine):

$ for i in $(seq 1 20); do aws s3 cp test.txt s3://$BUCKET/sgw/test$i.txt; done
upload: ./test.txt to s3://s3-test-net/test1.txt                
upload: ./test.txt to s3://s3-test-net/test2.txt                
upload: ./test.txt to s3://s3-test-net/test19.txt              
upload: ./test.txt to s3://s3-test-net/test20.txt    

And confirming the refresh of the share:

$ ls -l
total 10
-rw-rw-rw- 1 nobody nogroup 19 Aug 25 07:46 test10.txt
-rw-rw-rw- 1 nobody nogroup 19 Aug 25 07:46 test11.txt
-rw-rw-rw- 1 nobody nogroup 19 Aug 25 07:46 test8.txt
-rw-rw-rw- 1 nobody nogroup 19 Aug 25 07:46 test9.txt

Creation and deletion scripts

For convenience a script to create (and remove) this stack is provided on GitHub. Clone the s3-event-aggregator repository and run the and scripts respectively. You need to have the AWS CLI installed and configured and sed and zip must be available. Be sure to edit the BUCKET variable in the script to match your bucket name and change the REGION if appropriate.

Note that the delete stack script will not remove the S3 event notification configuration by default. There is no safe and convenient way to remove only the S3EventAggregator configuration (other than removing all configuration which may result in unintended loss of other event configuration). If you have other events configured on a bucket it is best to use the AWS Console to remove the s3-event-aggregator event configuration. If there are no other events configured on your bucket you can safely uncomment the relevant line in the deletion script.


The two Lambda functions both have a LOG_LEVEL environment variable to control the details logged to CloudWatch Logs, the functions were created with the level set to INFO but DEBUG may be useful for troubleshooting and WARN is probably appropriate for use in production.

The S3EventAggregator function also has an environment variable called REFRESH_DELAY_SECONDS for controlling the event aggregation window. It was initialised to 30 seconds when the function was created but it may be appropriate to change it depending on your S3 upload pattern. If the uploads are mostly small and complete quickly, or if you need the Storage Gateway to reflect changes quickly then this may be a reasonable value. If you are performing larger uploads or the total upload process takes significantly longer then the refresh window would need to be increased to be longer than the total expected upload time.

The DynamoDB table was created with 5 write capacity units and as the entries are less than 1KB this should be sufficient as long as you are not writing more than 5 objects a second to the Storage Gateway S3 bucket. Writing more than this will required additional write capacity to be provisioned (or auto scaling enabled).

The same code can be used for multiple buckets by simply adding additional bucket event configurations via the CLI put-bucket-notification-configuration as above or using the AWS Console.


There are three component costs involved in this solution, the two Lambda functions, DynamoDB, and SQS. The Lambda and DynamoDB costs will scale fairly linearly with usage with both the S3EventAggregator and DynamoDB being charged for each S3 event that is triggered. To get an idea of the number of events to expect you can enable S3 metrics on the bucket and check the PUT and DELETE counts. The S3StorageGatewayRefresh function and SQS messages will be a fraction of the total S3 event counts and dependent on the REFRESH_DELAY_SECONDS configuration. A longer refresh delay will result in fewer SQS messages and S3StorageGatewayRefresh function executions.

As an example lets use an example of 1000 objects uploaded a day with these being aggregated into 50 refresh events. For simplicity we will also assume that the free tier has been exhausted and that there are 30 days in the month. The total Lambda request count will then be:
(30 days x 1000 S3 events) + (30 days x 50 refresh events)  = 31 500 request
As Lambda requests are charged in 1 million increments this will result in a charge of $0.2 for the requests

The compute charges are based on duration, with the S3EventAggregator executing in less than 100ms for all aggregate events and around 300 - 600ms for the refresh events. The S3StorageGatewayRefesh function takes between 400ms and 800ms. Giving us:
(30 days x 950 S3 requests x 0.1s) + (30 days x 50 S3 requests x 0.5s) + (30 days x 50 refresh events x 0.7s) = 4650 seconds
Lambda compute is charged in GB-s, so:
4650 seconds x 128MB/1024 = 581.25 GB-s at $0.00001667 = $0.0096894
Bringing the total Lambda charges for the month to $0.21

For DynamoDB we have provisioned 5 WCU at $0.000735/WCU/hour and 1 RCU at $0.000147/RCU/hour working out as:
(5 WCU x 0.000735 per hour x 24 hours a day x 30 days) + (1 RCU x 0.000147 per hour x 24 hours x 30 days) = $2.75 a month

SQS charges per million requests with the 1500 send message requests and a further 1500 receive and delete requests all falling under this limit (and thus only costing $0.40 for the month). It is worth noting that Lambda does poll the SQS queue roughly 4 times a minute and this will contribute to your total SQS request costs, using around 172,800 SQS requests a month.

There are some other costs associated with CloudWatch Logs and DynamoDB storage but these should be fairly small compared to the request costs and I would not expect the total cost of the stack to be more than $10 - $15 a month.


And so ends this post, well done for reading to the end. I quite enjoyed building this solution and will look at converting it to a CloudFormation template at a later stage. Feel free to log issues or pull requests against the GitHub repo.