Monday 13 July 2015

C-states and P-states with Ubuntu 14.04 on Amazon EC2

Introduction

The largest new generation EC2 instance types now expose C-state and P-state control to the operating system, unfortunately additional complexity is introduced with the added control and this post will discuss some of the issues related to this on Ubuntu 14.04.

For comparison I am going to be using a relatively vanilla Ubuntu 14.04.02 (3.13.0-57-generic) AMI running memcached on both a c3.8xl and a c4.8xl. As a memcached workload is generally dependant on memory and network, I have enabled Receive Packet Steering (RPS) as described by this Pinterest article and ensuring it persists across reboots by adding it in /etc/rc.local.

More haste, less speed (TL;DR)

For those without the time or patience to read the full post, intel_pstate is disabled by default in Ubuntu 14.04 and the ondemand rc.d script will prevent optimal performance of a latency sensitive application on C-state/P-state enabled instances. Make these changes to improve latency and reduce variability:

1. Disable ondemand (dynamic) CPU frequency scaling
$ sudo update-rc.d ondemand disable

2. Edit /etc/default/grub.d/50-cloudimg-settings.cfg and update the Grub command line to be:
# Set the default commandline
GRUB_CMDLINE_LINUX_DEFAULT="console=tty1 console=ttyS0 intel_idle.max_cstate=1 intel_pstate=enable"

3. Install the new version of grub
$ sudo update-grub

4. Reboot
$ sudo reboot

Not so idle (for those with time)

The first notable thing is that the idle state on the c4.8xl differs from the c3.8xl according to CloudWatch:

Idle instances c3.8xl (i-2cb4bd86) vs. c4.8xl (i-0e737ba4)

According to sar the c3 is marginally more idle:

$ sar 1 10
Linux 3.13.0-57-generic (ip-172-31-40-39) 07/12/2015 _x86_64_ (32 CPU)

05:11:51 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
05:11:52 PM     all      0.00      0.00      0.00      0.00      0.00    100.00
05:11:53 PM     all      0.00      0.00      0.00      0.00      0.00    100.00
05:11:54 PM     all      0.00      0.00      0.00      0.00      0.00    100.00
05:11:55 PM     all      0.00      0.00      0.00      0.00      0.00    100.00
05:11:56 PM     all      0.00      0.00      0.00      0.00      0.00    100.00
05:11:57 PM     all      0.00      0.00      0.00      0.00      0.00    100.00
05:11:58 PM     all      0.00      0.00      0.00      0.00      0.00    100.00
05:11:59 PM     all      0.00      0.00      0.00      0.00      0.00    100.00
05:12:00 PM     all      0.00      0.00      0.00      0.00      0.00    100.00
05:12:01 PM     all      0.00      0.00      0.00      0.00      0.00    100.00
Average:        all      0.00      0.00      0.00      0.00      0.00    100.00


While the c4 is doing some system related work:

$ sar 1 10
Linux 3.13.0-57-generic (ip-172-31-34-50) 07/12/2015 _x86_64_ (36 CPU)

05:11:46 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
05:11:47 PM     all      0.00      0.00      0.03      0.00      0.00     99.97
05:11:48 PM     all      0.00      0.00      0.00      0.00      0.00    100.00
05:11:49 PM     all      0.00      0.00      0.03      0.00      0.00     99.97
05:11:50 PM     all      0.00      0.00      0.00      0.00      0.00    100.00
05:11:51 PM     all      0.00      0.00      0.00      0.00      0.00    100.00
05:11:52 PM     all      0.03      0.00      0.03      0.00      0.00     99.94
05:11:53 PM     all      0.00      0.00      0.00      0.00      0.00    100.00
05:11:54 PM     all      0.00      0.00      0.03      0.00      0.00     99.97
05:11:55 PM     all      0.00      0.00      0.00      0.00      0.00    100.00
05:11:56 PM     all      0.00      0.00      0.00      0.00      0.00    100.00
Average:        all      0.00      0.00      0.01      0.00      0.00     99.99

Although this does not account for the ~0.5% CPU being reported by CloudWatch, presumably due to some overhead for C-state management. On the c3 we don't seem to have a cpuidle driver:

$ cat /sys/devices/system/cpu/cpuidle/current_driver
none

Whereas on the c4 we have:

$ cat /sys/devices/system/cpu/cpuidle/current_driver
intel_idle

Disabling C-states (intel_idle driver)

Out of interest lets see what the impact of disabling the driver is. In order to do this we need to set some kernel parameters in Grub which live in /etc/default/grub.d/50-cloudimg-settings.cfg on Ubuntu instances running on EC2, the updated file should look like this:

$ cat /etc/default/grub.d/50-cloudimg-settings.cfg
# Cloud Image specific Grub settings for Generic Cloud Images
# CLOUD_IMG: This file was created/modified by the Cloud Image build process

# Set the recordfail timeout
GRUB_RECORDFAIL_TIMEOUT=0

# Do not wait on grub prompt
GRUB_TIMEOUT=0

# Set the default commandline
GRUB_CMDLINE_LINUX_DEFAULT="console=tty1 console=ttyS0 intel_idle.max_cstate=0 processor.max_cstate=0"

# Set the grub console type
GRUB_TERMINAL=console

And can be installed with:
$ sudo update-grub

A relatively good explanation of the kernel options can be found on stackoverflow, note that the idle=poll option will cause CloudWatch to report 100% CPU and is not advised. After a reboot for the changes to take effect the following CloudWatch graph (including a default c4.8xl for comparison) shows the difference in idle CPU.

Idle instances c3.8xl (i-2cb4bd86) vs. c4.8xl with C-state disabled (i-0e737ba4)

A notable improvement but still not identical, lets take a look at performance next.

The need for speed

As the c3.8xl does not support P-states we would not expect to see a frequency scaling policy or any difference between the core CPU frequencies and this can be confirmed with the cpupower utility:

$ sudo cpupower frequency-info
analyzing CPU 0:
  no or unknown cpufreq driver is active on this CPU
  boost state support:
    Supported: no
    Active: no
    25500 MHz max turbo 4 active cores
    25500 MHz max turbo 3 active cores
    25500 MHz max turbo 2 active cores
    25500 MHz max turbo 1 active cores


$ cat /proc/cpuinfo | grep MHz
cpu MHz : 2800.044
cpu MHz : 2800.044
cpu MHz : 2800.044
cpu MHz : 2800.044
cpu MHz : 2800.044
cpu MHz : 2800.044
cpu MHz : 2800.044
[...]
cpu MHz : 2800.044
cpu MHz : 2800.044
cpu MHz : 2800.044
cpu MHz : 2800.044
cpu MHz : 2800.044
cpu MHz : 2800.044
cpu MHz : 2800.044


Nothing terribly surprising here, no cpufreq driver and all the cores running at the default frequency. Looking at the c4.8xl however, we see a few noteworthy items:

$ sudo cpupower frequency-info
analyzing CPU 0:
  driver: acpi-cpufreq
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 10.0 us.
  hardware limits: 1.20 GHz - 2.90 GHz
  available frequency steps: 2.90 GHz, 2.90 GHz, 2.80 GHz, 2.70 GHz, 2.50 GHz, 2.40 GHz, 2.30 GHz, 2.20 GHz, 2.00 GHz, 1.90 GHz, 1.80 GHz, 1.70 GHz, 1.60 GHz, 1.40 GHz, 1.30 GHz, 1.20 GHz
  available cpufreq governors: conservative, ondemand, userspace, powersave, performance
  current policy: frequency should be within 1.20 GHz and 2.90 GHz.
                  The governor "ondemand" may decide which speed to use
                  within this range.
  current CPU frequency is 1.20 GHz (asserted by call to hardware).
  cpufreq stats: 2.90 GHz:1.17%, 2.90 GHz:0.00%, 2.80 GHz:0.00%, 2.70 GHz:0.01%, 2.50 GHz:0.00%, 2.40 GHz:0.00%, 2.30 GHz:0.00%, 2.20 GHz:0.00%, 2.00 GHz:0.00%, 1.90 GHz:0.00%, 1.80 GHz:0.00%, 1.70 GHz:0.00%, 1.60 GHz:0.00%, 1.40 GHz:0.00%, 1.30 GHz:0.00%, 1.20 GHz:98.82%  (15)
  boost state support:
    Supported: yes
    Active: yes

$ cat /proc/cpuinfo | grep MHz
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
[...]
cpu MHz : 1200.000
cpu MHz : 1600.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000

Firstly, it seems Ubuntu is using the acpi-cpufreq driver instead of the new and improved intel_pstate driver (more on this later). Secondly we are using the 'ondemand' governor which is great for power saving but not ideal for a high performance or low latency server. Finally, thanks to the ondemand governor all of  the cores are running significantly slower than their advertised (3.2 GHz) maximum. Some digging finds that it is related to this and is relatively easy to fix by disabling /etc/init.d/ondemand:

$ sudo update-rc.d ondemand disable

Which will prevent the 'performance' governor from being replaced with 'ondemand' at the next boot. To correct the frequency without a reboot we can update the governor with cpupower:

$ sudo cpupower frequency-set -g performance

$ cat /proc/cpuinfo | grep MHz
cpu MHz : 2901.000
cpu MHz : 2901.000
cpu MHz : 2901.000
cpu MHz : 2901.000
cpu MHz : 2901.000
cpu MHz : 2901.000
cpu MHz : 2901.000
[...]
cpu MHz : 2901.000
cpu MHz : 2901.000
cpu MHz : 2901.000
cpu MHz : 2901.000
cpu MHz : 2901.000
cpu MHz : 2901.000
cpu MHz : 2901.000
cpu MHz : 2901.000
cpu MHz : 2901.000
cpu MHz : 2901.000

Still not hitting the maximum frequency but this shaves a few more fractions off the idle CPU usage reported by CloudWatch:

Idle instances c3.8xl (i-2cb4bd86) vs. c4.8xl with C-state disabled and performance frequency policy (i-0e737ba4)

Adding some stress

Thus far we have been looking at the idle CPU utilisation and have got relatively close to the same baseline performance with our customised c4.8xl. Time now to look at how adding some load changes things, starting of with a relatively simple test using 'stress' to generate artificial load on CPU and memory.

Using 'stress -c 4 -m 2' the c3.8xl shows the following sar average after running for a few minutes:
06:58:11 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
06:58:40 AM     all     12.65      0.00      6.12      0.00      0.00     81.23
Average:        all     12.63      0.06      6.14      0.03      0.00     81.14

The default c4.8xl shows:
06:58:11 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
06:58:41 AM     all     11.22      0.00      5.50      0.00      0.00     83.28
Average:        all     11.24      0.15      5.47      0.03      0.00     83.10

And the tweaked c4.8xl (with C-states disabled and performance policy) shows:
06:58:11 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
06:58:43 AM     all     11.27      0.00      5.44      0.00      0.00     83.29
Average:        all     11.23      0.00      5.45      0.00      0.00     83.32

The CloudWatch graph shows the c3 (i-2cb4bd86) averaging around 18.8%, the default c4 (i-c2b4bd68) averaging around 17.1%, and the tweaked c4 (i-0e737ba4) averaging around 16.8%. All of which match the sar %idle relatively closely:

Instances under stress c3 (i-2cb4bd86), default c4 (i-c2b4bd68), tweaked c4 (i-0e737ba4)

Testing memcached using a simple approach of an infinite loop calling memcslap shows slightly greater differences. Two loops are started in the background to simulate a reasonable load of around 200k packets per second (admittedly on the loopback interface but good enough for now):

$ while true; do memcslap --servers=<server_ip_here>:11211 --concurrency=10 --execute-number=10000 --binary > /dev/null; done &

On the c3.8xl sar shows (approximately):
08:35:51 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
08:38:34 AM     all      5.48      0.00      8.61      0.00      0.60     85.31
Average:        all      5.50      0.00      4.62      0.00      0.33     89.54

The default c4 shows:
08:35:51 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
08:36:20 AM     all      5.01      0.00      0.84      0.00      0.00     94.15
Average:        all      5.14      0.00      2.99      0.00      0.00     91.87

While the tweaked c4 shows:
08:35:51 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
08:36:26 AM     all      5.53      0.00      2.12      0.00      0.00     92.35
Average:        all      5.26      0.00      2.71      0.00      0.00     92.03

Again not much between them according to sar but CloudWatch shows some larger discrepancies in this case with CPU above 20% for all instances and the default c4 showing between 25% - 30% vs the tweaked c4 (and c3) staying mostly between 20% and 25%.

Instances running memcslap c3 (i-2cb4bd86), default c4 (i-c2b4bd68), tweaked c4 (i-0e737ba4)
We are clearly missing something important here, time to look at that intel_pstate driver.

All about the driver (enabling intel_pstate)

As noted earlier the default cpufreq driver on Ubuntu 14.04 appears to be the legacy acpi-cpufreq driver. There is some history around this which is not really relevant here but suffice it to say that you need to opt-in to enable the intel_pstate driver. Again it is a relatively simple matter of adding a kernel option to grub and given that the evidence above shows our default c4 installation is a loser, lets make some changes to make it faster. Firstly, disable the ondemand scaling policy as we did for the the tweaked c4 earlier, next we will enable the intel_pstate driver by adding 'intel_pstate=enable' to the Grub command line, we are also going to disable the deeper C states at the same time to limit latency of waking cores (with 'intel_idle.max_cstate=1') after which we will install the new Grub config and reboot:

$ sudo update-rc.d ondemand disable
...
$ cat /etc/default/grub.d/50-cloudimg-settings.cfg 
# Cloud Image specific Grub settings for Generic Cloud Images
# CLOUD_IMG: This file was created/modified by the Cloud Image build process

# Set the recordfail timeout
GRUB_RECORDFAIL_TIMEOUT=0

# Do not wait on grub prompt
GRUB_TIMEOUT=0

# Set the default commandline
GRUB_CMDLINE_LINUX_DEFAULT="console=tty1 console=ttyS0 intel_idle.max_cstate=1 intel_pstate=enable"

# Set the grub console type
GRUB_TERMINAL=console


$ sudo update-grub
...
$ sudo reboot

After the reboot, cpupower confirms that we are now using the intel_pstate driver and that our server is now super-charged at 3.2 GHz:

$ sudo cpupower frequency-info
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 0.97 ms.
  hardware limits: 1.20 GHz - 3.50 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 1.20 GHz and 3.50 GHz.
                  The governor "performance" may decide which speed to use
                  within this range.
  current CPU frequency is 3.20 GHz (asserted by call to hardware).
  boost state support:
    Supported: yes
    Active: yes

$ cat /proc/cpuinfo | grep MHz
cpu MHz : 3200.875
cpu MHz : 3200.875
cpu MHz : 3200.875
cpu MHz : 3200.875
[...]
cpu MHz : 3200.875
cpu MHz : 3200.875
cpu MHz : 3200.875
cpu MHz : 3200.875
cpu MHz : 3200.875
cpu MHz : 3200.875
cpu MHz : 3200.875
cpu MHz : 3200.875

Rerunning the earlier memcached test shows quite a significant improvement for the P-state instance. While the sar results are quite similar, the CloudWatch graph shows the P-state instance using around 16% CPU, beating both the c3 and tweaked c4. The sar approximate average for the P-state c4:

09:09:52 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
09:10:36 AM     all      7.17      0.00      3.75      0.00      0.00     89.08
Average:        all      6.15      0.00      2.43      0.00      0.00     91.42

Instances running memcslap c3 (i-2cb4bd86), intel_pstate c4 (i-c2b4bd68), tweaked c4 (i-0e737ba4)

For comparison rerunning the stress test from earlier (stress -c 4 -m 2):

Instances under stress c3 (i-2cb4bd86), intel_pstate c4 (i-c2b4bd68), tweaked c4 (i-0e737ba4)

Summary

After a relatively quick and dirty investigation it is clear that the intel_pstate driver while not enabled by default on Ubuntu 14.04 really should be the driver of choice.

Monday 22 June 2015

AWS Tip: Terminating instances in an Auto Scaling group

EC2 instances launched by Auto Scaling really should not be terminated outside of Auto Scaling but if for some reason you (really, really) need to terminate a number of instances you may see some unexpected behaviour if you terminate them using the EC2 console. Instead of all the instances being replaced by Auto Scaling simultaneously they are replaced individually over a period of time. This means that it may take significantly longer than you would expect to get the ASG back to the original size. As an example an instance in my test ASG was replaced every 2 minutes, taking around 6 minutes to terminate and replace 3 instances.

There are two ways of doing this more efficiently (using the supremely useful AWS CLI), firstly by setting the group capacity and then resetting it:


Note that update-auto-scaling-group is used due to the minimum size being set, if no minimum size is specified the same result can be achieved with set-desired-capacity.

The second option is to use terminate-instance-in-auto-scaling-group:


Monday 19 January 2015

AWS Tip of the day: Working around the S3 "Failed to parse XML document" exception

If you have S3 objects with Unicode characters that aren't supported by XML 1.0 you are likely to see an exception when calling listObjects in the AWS Java SDK:

Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler

The quick and simple solution to fix this is to set the encoding type to "url". Example code stolen from here: