Sunday, 20 August 2017

Understanding EC2 "Up to 10 Gigabit" network performance for R4 instances

This post investigates the network performance of AWS R4 instances with a focus on the "Up to 10 Gigabit" networking expected from smaller (r4.large - r4.4xlarge) instance types. Before starting it should be noted that this post is based on observation and as such is prone to imprecision and variance, it is intended as a guide for what can be expected and not a comprehensive or scientific review.


The R4 instance documentation states "The smaller R4 instance sizes offer peak throughput of 10 Gbps. These instances use a network I/O credit mechanism to allocate network bandwidth to instances based on average bandwidth utilization. These instances accrue credits when their network throughput is below their baseline limits, and can use these credits when they perform network data transfers." This is not particularly helpful in understanding the lower bounds on network performance and gives no indication of the baseline limits with AWS recommending customers benchmark the networking performance of various instances to evaluate whether the instance type and size will meet the application network performance requirements.
Logically we would expect the r4.large to have a fraction of the total 20 Gbps available on an r4.16xlarge. From the instance size normalisation table under the reserved instance modification documentation a *.large instance (factor of 4) should expect 1/32 of the resources available on a *.16xlarge instance (factor of 128) which works out at 0.625 Gbps (20 Gbps / 32) or 625 Mbps.


Testing r4.large baseline network performance

Using iperf3 between two newly launched Amazon Linux r4.large instances in the same availability zone in eu-west-1, we run into the first interesting anomaly with the network stream maxing out at 5 Gbps rather than the expected 10 Gbps:


$ iperf3 -p 5201 -c 172.31.7.67 -i 1 -t 3600 -f m -V 
iperf 3-CURRENT
Linux ip-172-31-10-235 4.9.32-15.41.amzn1.x86_64 #1 SMP Thu Jun 22 06:20:54 UTC 2017 x86_64
Control connection MSS 8949
Time: Sun, 20 Aug 2017 07:35:48 GMT
Connecting to host 172.31.7.67, port 5201
      Cookie: p2v6ry2kzjo2udittrzgmxotz7we3in5etmv
      TCP MSS: 8949 (default)
[  5] local 172.31.10.235 port 41270 connected to 172.31.7.67 port 5201
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 3600 second test, tos 0
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   598 MBytes  5015 Mbits/sec    9    664 KBytes       
[  5]   1.00-2.00   sec   596 MBytes  4999 Mbits/sec    3    559 KBytes       
[  5]   2.00-3.00   sec   595 MBytes  4992 Mbits/sec    9    586 KBytes       
[  5]   3.00-4.00   sec   595 MBytes  4989 Mbits/sec    0    638 KBytes       
[  5]   4.00-5.00   sec   596 MBytes  5000 Mbits/sec    0    638 KBytes       
[  5]   5.00-6.00   sec   595 MBytes  4989 Mbits/sec    0    638 KBytes       
[  5]   6.00-7.00   sec   595 MBytes  4990 Mbits/sec    6    638 KBytes       
[  5]   7.00-8.00   sec   595 MBytes  4990 Mbits/sec    3    524 KBytes       
[  5]   8.00-9.00   sec   596 MBytes  4997 Mbits/sec    0    586 KBytes       
[  5]   9.00-10.00  sec   596 MBytes  4997 Mbits/sec    0    603 KBytes       
[  5]  10.00-11.00  sec   595 MBytes  4990 Mbits/sec    0    638 KBytes       

Interestingly, using 2 parallel streams results in us (mostly) reaching the advertised 10 Gbps:

$ iperf3 -p 5201 -c 172.31.7.67 -i 1 -t 3600 -f m -V -P 2
iperf 3-CURRENT
Linux ip-172-31-10-235 4.9.32-15.41.amzn1.x86_64 #1 SMP Thu Jun 22 06:20:54 UTC 2017 x86_64
Control connection MSS 8949
Time: Sun, 20 Aug 2017 07:37:38 GMT
Connecting to host 172.31.7.67, port 5201
      Cookie: q343avscwpva5uyg2ayeinboxi5pllvw5l7r
      TCP MSS: 8949 (default)
[  5] local 172.31.10.235 port 41274 connected to 172.31.7.67 port 5201
[  7] local 172.31.10.235 port 41276 connected to 172.31.7.67 port 5201
Starting Test: protocol: TCP, 2 streams, 131072 byte blocks, omitting 0 seconds, 3600 second test, tos 0
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   597 MBytes  5010 Mbits/sec    0    690 KBytes       
[  7]   0.00-1.00   sec   592 MBytes  4968 Mbits/sec    0    717 KBytes       
[SUM]   0.00-1.00   sec  1.16 GBytes  9979 Mbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   1.00-2.00   sec   595 MBytes  4994 Mbits/sec    0    690 KBytes       
[  7]   1.00-2.00   sec   592 MBytes  4962 Mbits/sec   18    638 KBytes       
[SUM]   1.00-2.00   sec  1.16 GBytes  9956 Mbits/sec   18             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   2.00-3.00   sec   591 MBytes  4957 Mbits/sec  137    463 KBytes       
[  7]   2.00-3.00   sec   587 MBytes  4924 Mbits/sec   41    725 KBytes       
[SUM]   2.00-3.00   sec  1.15 GBytes  9881 Mbits/sec  178             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   3.00-4.00   sec   593 MBytes  4973 Mbits/sec   46    367 KBytes       
[  7]   3.00-4.00   sec   591 MBytes  4956 Mbits/sec   40    419 KBytes       
[SUM]   3.00-4.00   sec  1.16 GBytes  9929 Mbits/sec   86             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   4.00-5.00   sec   592 MBytes  4968 Mbits/sec  141    542 KBytes       
[  7]   4.00-5.00   sec   591 MBytes  4960 Mbits/sec   36    559 KBytes       
[SUM]   4.00-5.00   sec  1.16 GBytes  9928 Mbits/sec  177             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   5.00-6.00   sec   595 MBytes  4995 Mbits/sec   30    664 KBytes       
[  7]   5.00-6.00   sec   588 MBytes  4934 Mbits/sec    8    568 KBytes       
[SUM]   5.00-6.00   sec  1.16 GBytes  9929 Mbits/sec   38             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   6.00-7.00   sec   596 MBytes  5000 Mbits/sec    0    664 KBytes       
[  7]   6.00-7.00   sec   589 MBytes  4945 Mbits/sec    0    629 KBytes       
[SUM]   6.00-7.00   sec  1.16 GBytes  9945 Mbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   7.00-8.00   sec   588 MBytes  4935 Mbits/sec    7    655 KBytes       
[  7]   7.00-8.00   sec   594 MBytes  4982 Mbits/sec    0    682 KBytes       
[SUM]   7.00-8.00   sec  1.15 GBytes  9917 Mbits/sec    7             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   8.00-9.00   sec   593 MBytes  4974 Mbits/sec    8    620 KBytes       
[  7]   8.00-9.00   sec   593 MBytes  4978 Mbits/sec   12    717 KBytes       
[SUM]   8.00-9.00   sec  1.16 GBytes  9952 Mbits/sec   20             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   9.00-10.00  sec   596 MBytes  4999 Mbits/sec    0    638 KBytes       
[  7]   9.00-10.00  sec   590 MBytes  4951 Mbits/sec    0    717 KBytes       
[SUM]   9.00-10.00  sec  1.16 GBytes  9950 Mbits/sec    0             


This behaviour is not consistent, stopping and restarting the instances often resulted in the full 10 Gbps on a single stream suggesting the issue relates to instance placement, something that appears to be supported by the placement group documentation which states: "Network traffic to and from resources outside the placement group is limited to 5 Gbps." It is also possible that the streams are incorrectly being treated as placement group or public internet flows with different limits. For consistency I have used two parallel streams to avoid this issue in the rest of the article.

The CloudWatch graphs shows us reaching a steady baseline after around eight minutes of starting iperf3:




A quick word on the graph above, firstly it is in bytes and, having enabled detailed monitoring, one minute granularity. For conversion purpose this means we need to divide the value of the metric by 60 to get bytes per second and then multiple by 8 to get bits per seconds. Looking at the actual data from the graph above:

$ aws cloudwatch get-metric-statistics --metric-name NetworkOut --start-time 2017-08-20T08:23:00 --end-time 2017-08-20T08:35:00 --period 60 --namespace AWS/EC2 --statistics Average --dimensions Name=InstanceId,Value=i-0a7e009e7c0bf8fa8  --query 'Datapoints[*].[Timestamp,Average]' --output=text | sort
2017-08-20T08:23:00Z 486.0
2017-08-20T08:24:00Z 5726.0
2017-08-20T08:25:00Z 22711496136.0
2017-08-20T08:26:00Z 76376122845.0
2017-08-20T08:27:00Z 76403033046.0
2017-08-20T08:28:00Z 76357957564.0
2017-08-20T08:29:00Z 76304994405.0
2017-08-20T08:30:00Z 48667898310.0
2017-08-20T08:31:00Z 5776989873.0
2017-08-20T08:32:00Z 5816890095.0
2017-08-20T08:33:00Z 5692555065.0
2017-08-20T08:34:00Z 5692014471.0


The maximum average throughput (between 08:26 and 08:29) is around 76 GByte/minute which works out at around 1.2 GByte/second or approximately 10.1 Gbit/second. Similarly the baseline (from 08:31 onwards) is in the region of 5.7 GByte/minute which translates to around 94 MByte/second or around 750 Mbit/second. These numbers are naturally averages but are fairly close to the actual iperf3 results, with the peak throughput of just over 10Gbit/second:


[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  6]   0.00-1.00   sec   604 MBytes  5065 Mbits/sec    0    551 KBytes       
[  8]   0.00-1.00   sec   604 MBytes  5062 Mbits/sec    0    524 KBytes       
[SUM]   0.00-1.00   sec  1.18 GBytes  10127 Mbits/sec   0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  6]   1.00-2.00   sec   601 MBytes  5046 Mbits/sec    0    551 KBytes       
[  8]   1.00-2.00   sec   602 MBytes  5048 Mbits/sec    0    551 KBytes       
[SUM]   1.00-2.00   sec  1.18 GBytes  10094 Mbits/sec   0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  6]   2.00-3.00   sec   602 MBytes  5046 Mbits/sec    0    577 KBytes       
[  8]   2.00-3.00   sec   602 MBytes  5047 Mbits/sec    0    577 KBytes       
[SUM]   2.00-3.00   sec  1.17 GBytes  10093 Mbits/sec   0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  6]   3.00-4.00   sec   601 MBytes  5045 Mbits/sec    0    577 KBytes       
[  8]   3.00-4.00   sec   602 MBytes  5049 Mbits/sec    0    577 KBytes       
[SUM]   3.00-4.00   sec  1.18 GBytes  10095 Mbits/sec   0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  6]   4.00-5.00   sec   602 MBytes  5049 Mbits/sec    0    577 KBytes       
[  8]   4.00-5.00   sec   601 MBytes  5045 Mbits/sec    0    577 KBytes       
[SUM]   4.00-5.00   sec  1.18 GBytes  10094 Mbits/sec   0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  6]   5.00-6.00   sec   602 MBytes  5049 Mbits/sec    0    577 KBytes       
[  8]   5.00-6.00   sec   601 MBytes  5042 Mbits/sec    0    577 KBytes       
[SUM]   5.00-6.00   sec  1.17 GBytes  10092 Mbits/sec   0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  6]   6.00-7.00   sec   602 MBytes  5046 Mbits/sec    0    577 KBytes       
[  8]   6.00-7.00   sec   602 MBytes  5049 Mbits/sec    0    577 KBytes       
[SUM]   6.00-7.00   sec  1.18 GBytes  10095 Mbits/sec   0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  6]   7.00-8.00   sec   603 MBytes  5063 Mbits/sec    0   1.44 MBytes       
[  8]   7.00-8.00   sec   601 MBytes  5041 Mbits/sec   66    524 KBytes       
[SUM]   7.00-8.00   sec  1.18 GBytes  10104 Mbits/sec  66             
- - - - - - - - - - - - - - - - - - - - - - - - -

And the baseline of around 750Mbit/second:

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  6] 376.00-377.00 sec  43.8 MBytes   367 Mbits/sec  157    114 KBytes       
[  8] 376.00-377.00 sec  43.8 MBytes   367 Mbits/sec  157   78.7 KBytes       
[SUM] 376.00-377.00 sec  87.5 MBytes   734 Mbits/sec  314             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  6] 377.00-378.00 sec  45.0 MBytes   377 Mbits/sec  161   69.9 KBytes       
[  8] 377.00-378.00 sec  45.0 MBytes   377 Mbits/sec  167   78.7 KBytes       
[SUM] 377.00-378.00 sec  90.0 MBytes   755 Mbits/sec  328             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  6] 378.00-379.00 sec  43.8 MBytes   367 Mbits/sec  182   69.9 KBytes       
[  8] 378.00-379.00 sec  45.0 MBytes   377 Mbits/sec  168    105 KBytes       
[SUM] 378.00-379.00 sec  88.8 MBytes   744 Mbits/sec  350             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  6] 379.00-380.00 sec  42.5 MBytes   357 Mbits/sec  150   61.2 KBytes       
[  8] 379.00-380.00 sec  46.2 MBytes   388 Mbits/sec  165   96.1 KBytes       
[SUM] 379.00-380.00 sec  88.8 MBytes   744 Mbits/sec  315             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  6] 380.00-381.00 sec  36.2 MBytes   304 Mbits/sec  129   78.7 KBytes       
[  8] 380.00-381.00 sec  52.5 MBytes   440 Mbits/sec  203    105 KBytes       
[SUM] 380.00-381.00 sec  88.8 MBytes   744 Mbits/sec  332             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  6] 381.00-382.00 sec  36.2 MBytes   304 Mbits/sec  147   96.1 KBytes       
[  8] 381.00-382.00 sec  52.5 MBytes   440 Mbits/sec  220   87.4 KBytes       
[SUM] 381.00-382.00 sec  88.8 MBytes   744 Mbits/sec  367             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  6] 382.00-383.00 sec  46.2 MBytes   388 Mbits/sec  175   52.4 KBytes       
[  8] 382.00-383.00 sec  42.5 MBytes   357 Mbits/sec  167    114 KBytes       
[SUM] 382.00-383.00 sec  88.8 MBytes   744 Mbits/sec  342             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  6] 383.00-384.00 sec  41.2 MBytes   346 Mbits/sec  165   61.2 KBytes       
[  8] 383.00-384.00 sec  47.5 MBytes   398 Mbits/sec  170   96.1 KBytes       
[SUM] 383.00-384.00 sec  88.8 MBytes   744 Mbits/sec  335             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  6] 384.00-385.00 sec  50.0 MBytes   419 Mbits/sec  195   87.4 KBytes       
[  8] 384.00-385.00 sec  38.8 MBytes   325 Mbits/sec  157   52.4 KBytes       
[SUM] 384.00-385.00 sec  88.8 MBytes   744 Mbits/sec  352             
- - - - - - - - - - - - - - - - - - - - - - - - -

Calculating network credit rates

Using the baseline for network performance we can draw some inferences about the rate at which network credits are accrued. For simplicity I am going to define a network credit as having a value of 1 Gbps, so an instance with 10 network credits could transmit at 10 Gbps for 1 second, naturally if the instance network limit is 10 Gbps the maximum rate can't be exceeded even if the instance credit balance is sufficient (20 credits allows 2 seconds at 10 Gbps rather than 1 second at 20 Gbps) . Given the base network performance in the previous section, we can assume that an r4.large has a network credit rate of around 0.75 credits per second. We can also assume a starting balance of around 2700 as we were able to maintain 10 Gbps for around 295 seconds ((10 - 0.75) * 295) at the start of the iperf3 run. Finally it appears the maximum credit balance on the r4.large is the same as the initial balance. Leaving the instances idle for 3 hours should have resulted in a credit balance of around 8100 (0.75 rate * 3600 seconds in an hour * 3 hours) which should have theoretically allowed 810 seconds at 10 Gbps but instead provided only around 295 seconds.

R4 network performance table

Below is a table of the expected performance for R4 instance sizes.

Instance Baseline Gbps (approximate) Initial/Max Credit (approximate) Maximum time
at 10Gbps
(approximate seconds)
r4.large 0.75 2700 295
r4.xlarge 1.25 5145 589
r4.2xlarge 2.5 8925 1191
r4.4xlarge 5 11950 2390

To calculate whether or not an instance will match your network throughput requirements take the difference from the baseline rate and your application base network utilisation and divide by the required burst rate.

For example, if your application requires a baseline of 0.6 Gbps, you would accrue credits at around 0.15 per second allowing you to burst for 10 Gbps for approximately one second every 66 seconds (10 / 0.15) or for 10 seconds every 660 seconds.

Friday, 21 October 2016

AWS CLI - Switching to and from regional EC2 reserved instances

AWS recently announced the availability of regional reserved instances, this post explains how to switch a reservation from AZ specific to regional (and back) using the AWS CLI.

Step 1, find the reservation to modify

$ aws ec2 describe-reserved-instances --filters Name=state,Values=active
{
    "ReservedInstances": [
        {
            "ReservedInstancesId": "c416aeaf-fb64-4218-970f-7426f6f32377", 
            "OfferingType": "No Upfront", 
            "AvailabilityZone": "eu-west-1c", 
            "End": "2017-10-21T08:45:55.000Z", 
            "ProductDescription": "Linux/UNIX", 
            "Scope": "Availability Zone", 
            "UsagePrice": 0.0, 
            "RecurringCharges": [
                {
                    "Amount": 0.01, 
                    "Frequency": "Hourly"
                }
            ], 
            "OfferingClass": "standard", 
            "Start": "2016-10-21T08:45:56.708Z", 
            "State": "active", 
            "FixedPrice": 0.0, 
            "CurrencyCode": "USD", 
            "Duration": 31536000, 
            "InstanceTenancy": "default", 
            "InstanceType": "t2.micro", 
            "InstanceCount": 1
        }
    ]
}

The "Scope" field in the response shows that this reservation is currently specific to an Availability Zone, eu-west-1c in this case.

Step 2, request the modification

$ aws ec2 modify-reserved-instances --reserved-instances-ids c416aeaf-fb64-4218-970f-7426f6f32377 --target-configurations Scope=Region,InstanceCount=1
{
    "ReservedInstancesModificationId": "rimod-aaada6ed-fec9-47c7-92e2-6edf7e61f2ce"
}

The Scope=Region indicates that this reservation should be converted to a regional reservation, InstanceCount is a required parameter to indicate the number of reservations the modification should be applied to.

Step 3, monitor progress

$ aws ec2 describe-reserved-instances-modifications
{
    "ReservedInstancesModifications": [
        {
            "Status": "processing", 
            "ModificationResults": [
                {
                    "ReservedInstancesId": "35f9b908-ae36-41ca-ac0b-4c67c887135b", 
                    "TargetConfiguration": {
                        "InstanceCount": 1
                    }
                }
            ], 
            "EffectiveDate": "2016-10-21T08:45:57.000Z", 
            "CreateDate": "2016-10-21T08:50:28.585Z", 
            "UpdateDate": "2016-10-21T08:50:31.098Z", 
            "ReservedInstancesModificationId": "rimod-aaada6ed-fec9-47c7-92e2-6edf7e61f2ce", 
            "ReservedInstancesIds": [
                {
                    "ReservedInstancesId": "c416aeaf-fb64-4218-970f-7426f6f32377"
                }
            ]
        }
    ]
}

The "Status" in the response will show "processing" until the modification has completed successfully, at which time it will change to "fulfilled":

$ aws ec2 describe-reserved-instances-modifications
{
    "ReservedInstancesModifications": [
        {
            "Status": "fulfilled", 
            "ModificationResults": [
                {
                    "ReservedInstancesId": "35f9b908-ae36-41ca-ac0b-4c67c887135b", 
                    "TargetConfiguration": {
                        "InstanceCount": 1
                    }
                }
            ], 
            "EffectiveDate": "2016-10-21T08:45:57.000Z", 
            "CreateDate": "2016-10-21T08:50:28.585Z", 
            "UpdateDate": "2016-10-21T09:11:33.454Z", 
            "ReservedInstancesModificationId": "rimod-aaada6ed-fec9-47c7-92e2-6edf7e61f2ce", 
            "ReservedInstancesIds": [
                {
                    "ReservedInstancesId": "c416aeaf-fb64-4218-970f-7426f6f32377"
                }
            ]
        }
    ]
}

Step 4, success!

The new reservation is now regional (Scope=Region):

$ aws ec2 describe-reserved-instances --filters Name=state,Values=active
{
    "ReservedInstances": [
        {
            "ReservedInstancesId": "35f9b908-ae36-41ca-ac0b-4c67c887135b", 
            "OfferingType": "No Upfront", 
            "FixedPrice": 0.0, 
            "End": "2017-10-21T08:45:55.000Z", 
            "ProductDescription": "Linux/UNIX", 
            "Scope": "Region", 
            "UsagePrice": 0.0, 
            "RecurringCharges": [
                {
                    "Amount": 0.01, 
                    "Frequency": "Hourly"
                }
            ], 
            "OfferingClass": "standard", 
            "Start": "2016-10-21T08:45:57.000Z", 
            "State": "active", 
            "InstanceCount": 1, 
            "CurrencyCode": "USD", 
            "Duration": 31536000, 
            "InstanceTenancy": "default", 
            "InstanceType": "t2.micro"
        }
    ]
}

Switching back

Follows the same process with the added requirement of specifying which AZ the reservation should be linked to:

$ aws ec2 modify-reserved-instances --reserved-instances-ids 35f9b908-ae36-41ca-ac0b-4c67c887135b --target-configurations Scope="Availability Zone",InstanceCount=1,AvailabilityZone=eu-west-1b
{
    "ReservedInstancesModificationId": "rimod-9e490be9-55a3-48cf-81e9-2662b13db2f8"
}

$ aws ec2 describe-reserved-instances --filters Name=state,Values=active
{
    "ReservedInstances": [
        {
            "ReservedInstancesId": "df70d097-2f33-4962-bca6-37af15ca819e", 
            "OfferingType": "No Upfront", 
            "AvailabilityZone": "eu-west-1b", 
            "End": "2017-10-21T08:45:55.000Z", 
            "ProductDescription": "Linux/UNIX", 
            "Scope": "Availability Zone", 
            "UsagePrice": 0.0, 
            "RecurringCharges": [
                {
                    "Amount": 0.01, 
                    "Frequency": "Hourly"
                }
            ], 
            "OfferingClass": "standard", 
            "Start": "2016-10-21T08:45:58.000Z", 
            "State": "active", 
            "FixedPrice": 0.0, 
            "CurrencyCode": "USD", 
            "Duration": 31536000, 
            "InstanceTenancy": "default", 
            "InstanceType": "t2.micro", 
            "InstanceCount": 1
        }
    ]
}

Wednesday, 31 August 2016

AWS troubleshooting - Lamba deployment package file permissions

When creating your own Lambda deployment packages be aware of the permissions on the files before zipping them. Lambda requires the files to have read access for all users, particularly "other", if this is missing you will receive a non-obvious error when trying to call the function. The fix is simple enough, perform a 'chmod a+r *' before creating your zip file. If the code is visible in the inline editor adding and empty line and saving will also fix the problem, presumably by overwriting the file with the correct permissions.

Below are some examples of errors you will see in the various languages if read permissions are missing. Hopefully this post will have saved you some time debugging.

Java CloudWatch logs:
--
Class not found: example.Hello: class java.lang.ClassNotFoundException
java.lang.ClassNotFoundException: example.Hello
at java.net.URLClassLoader$1.run(URLClassLoader.java:370)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
Caused by: java.io.FileNotFoundException: /var/task/example/Hello.class (Permission denied)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at sun.misc.URLClassPath$FileLoader$1.getInputStream(URLClassPath.java:1251)
at sun.misc.Resource.cachedInputStream(Resource.java:77)
at sun.misc.Resource.getByteBuffer(Resource.java:160)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:454)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
... 7 more
--

Java execution result (testing from console):
{
  "errorMessage": "Class not found: example.Hello",
  "errorType": "class java.lang.ClassNotFoundException"
}

Python CloudWatch logs:
--
Unable to import module 'python-hi': No module named python-hi
--

Python execution result (testing from console):
{
  "errorMessage": "Unable to import module 'python-hi'"
}

Node CloudWatch logs:
--
module initialization error: Error
    at Error (native)
    at Object.fs.openSync (fs.js:549:18)
    at Object.fs.readFileSync (fs.js:397:15)
    at Object.Module._extensions..js (module.js:415:20)
    at Module.load (module.js:343:32)
    at Function.Module._load (module.js:300:12)
    at Module.require (module.js:353:17)
    at require (internal/module.js:12:17)
--

Node execution result (testing from console):
{
  "errorMessage": "EACCES: permission denied, open '/var/task/node-hi.js'",
  "errorType": "Error",
  "stackTrace": [
    "Object.fs.openSync (fs.js:549:18)",
    "Object.fs.readFileSync (fs.js:397:15)",
    "Object.Module._extensions..js (module.js:415:20)",
    "Module.load (module.js:343:32)",
    "Function.Module._load (module.js:300:12)",
    "Module.require (module.js:353:17)",
    "require (internal/module.js:12:17)"
  ]
}

Friday, 12 August 2016

AWS Tip of the day: Tagging EC2 reserved instances

A quick post pointing out that EC2 reserved instances actually support tagging. This functionality is only available on the command line of via the API and not via the console but it still allows to you tag your reservations making it easier to keep track of why a reserved instance was purchased and what component it was intended for. Of course the reservation itself is not actually tied to a running instance in any way, it is merely a billing construct that is applied to any matching instances running in your account but if you are making architectural changes or considering different instance types for specific workloads or components the tags allow you (and your team) to see why the reservation was originally purchased. So for example if you are scaling up the instance sizes of a specific component, lets say from m4.large to m4.xlarge, you can check your reserved instance tags and modify the reservations associated with the component to ensure you continue to benefit from the purchase.

The tagging of reserved instances works the same as tagging other EC2 resources through the AWS CLI's ec2 create-tags command and specifying the reserved instances ID as the resource ID. You can find the reserved instance ID using the CLI's ec2 describe-reserved-instances command. Using an actual example, lets start off finding a reservation:

$ aws ec2 describe-reserved-instances
{
    "ReservedInstances": [
        {
            "ReservedInstancesId": "3d092b71-5243-4e5e-b409-86df342282ab", 
            "OfferingType": "No Upfront", 
            "AvailabilityZone": "eu-west-1c", 
            "End": "2017-08-12T04:48:58.000Z", 
            "ProductDescription": "Linux/UNIX", 
            "UsagePrice": 0.0, 
            "RecurringCharges": [
                {
                    "Amount": 0.01, 
                    "Frequency": "Hourly"
                }
            ], 
            "Start": "2016-08-12T04:48:59.763Z", 
            "State": "active", 
            "FixedPrice": 0.0, 
            "CurrencyCode": "USD", 
            "Duration": 31536000, 
            "InstanceTenancy": "default", 
            "InstanceType": "t2.micro", 
            "InstanceCount": 1
        }
    ]
}

Next, lets add a tag indicating that this reservation is intended for the "production" stack.:
$ aws ec2 create-tags --resources 3d092b71-5243-4e5e-b409-86df342282ab --tags Key=Stack,Value=production


Checking the result:
$ aws ec2 describe-reserved-instances
{
    "ReservedInstances": [
        {
            "ReservedInstancesId": "3d092b71-5243-4e5e-b409-86df342282ab", 
            "OfferingType": "No Upfront", 
            "AvailabilityZone": "eu-west-1c", 
            "End": "2017-08-12T04:48:58.000Z", 
            "ProductDescription": "Linux/UNIX", 
            "Tags": [
                {
                    "Value": "production", 
                    "Key": "Stack"
                }
            ], 
            "UsagePrice": 0.0, 
            "RecurringCharges": [
                {
                    "Amount": 0.01, 
                    "Frequency": "Hourly"
                }
            ], 
            "Start": "2016-08-12T04:48:59.763Z", 
            "State": "active", 
            "FixedPrice": 0.0, 
            "CurrencyCode": "USD", 
            "Duration": 31536000, 
            "InstanceTenancy": "default", 
            "InstanceType": "t2.micro", 
            "InstanceCount": 1
        }
    ]
}

Great we have a tag but what if we have hundreds of reservations, a long list of reservations is not particularly useful for quickly identifying the reservations related to a component or stack. The CLI's query and output functionality can help here:

$ aws ec2 describe-reserved-instances --query 'ReservedInstances[*].{AZ:AvailabilityZone,Type:InstanceType,Expiry:End,stack:Tags[?Key==`Stack`][?Value==`production`]}' --output=table
--------------------------------------------------------
|               DescribeReservedInstances              |
+-------------+----------------------------+-----------+
|     AZ      |          Expiry            |   Type    |
+-------------+----------------------------+-----------+
|  eu-west-1c |  2017-08-12T04:48:58.000Z  |  t2.micro |
+-------------+----------------------------+-----------+

Not quite the console view but easy enough to see that we have one reservation for the "production" Stack.

Tuesday, 7 June 2016

AWS Tip: Save S3 costs with abort multipart lifecycle policy

Introduction

S3 multipart uploads provide a number of benefits -- better throughput, recovery from network errors -- and a number of tools will automatically use multipart uploads for larger uploads. The AWS CLI cp, mv, and sync commands all make use of multipart uploads and make a note that "If the process is interrupted by a kill command or system failure, the in-progress multipart upload remains in Amazon S3 and must be cleaned up manually..."

The reason you would want to clean up these failed multipart uploads is because you will be charged for the storage they use while waiting for the upload to be completed (or aborted). This post provides some detail on how to find the incomplete uploads and options for removing them to save storage costs.

Finding incomplete multipart uploads

If you have relatively few buckets or only want to check your biggest buckets (CloudWatch S3 metrics are useful for finding these) the AWS CLI s3api list-multipart-uploads command is a simple check:
aws s3api list-multipart-uploads --bucket [bucket-name]
No output indicates that the bucket does not contain any incomplete uploads, see the list-multipart-uploads documentation linked above for an example of the output on a bucket that does contain an incomplete upload. A simple bash script to check all your buckets:

This script will list your buckets and the first page (out of possibly many more) incomplete multipart upload keys along with the date they were initiated. The region lookup is required to handle bucket names containing dots and buckets in eu-central-1 (SigV4).

Cleaning up

Once you have identified the buckets containing incomplete uploads it is worth investigating some of the recent failed uploads to see whether there is an underlying issue that needs to be addressed, particularly if the uploads relate to backups or important log files. The most typical cause is instances being terminated before completing uploads (look at Lifecycle Hooks to fix this if you are using Auto Scaling) but they may also be result of applications not performing cleanup on failures or not handling errors correctly.

A multipart upload can be aborted using the abort-multipart-upload s3api command in the AWS CLI using the object key and upload ID returned by list-multipart-uploads command. This can be scripted but will take time to complete for buckets containing large numbers of incomplete uploads, fortunately there is an easier way. S3 now supports a bucket lifecycle policy to automatically delete incomplete uploads after a specified period of time. Enabling the policy in the AWS console is fairly quick and easy, see Jeff's blog post for details. A rather messy boto3 example script for enabling the policy on all buckets can be found here, it should work with most bucket configurations but it comes with no guarantees and you use it at your own risk.

Conclusion (why you should do this)

If you are using S3 to store large (> 5MB) objects and you are spending more than a few dollars a month on S3 storage then there is a fairly good chance that you are paying unnecessarily for failed/incomplete multipart uploads. It should only take a few minutes to review your buckets and could potentially have significant monthly savings.