How switching to AWS Graviton slashed our infrastructure bill by 35%

Image of the blog authorLewis MonteithLast updated: November 11, 2022

EngineeringTipsInfrastructure
Blog cover image

When we started our analytics company, we knew that closely monitoring and managing our infrastructure spending was going to be really important. The numbers started out small, but we’re now capturing, processing, and consuming a lot of data.

On a recent search for new cost-saving opportunities, we came across a straightforward but substantial win, so I thought I’d share what we did and how we did it.

Before I get into exactly what we did, here’s a quick overview of the relevant infrastructure:

Infrastructure overview

Squeaky runs entirely inside of AWS and we use as many hosted options as possible to make our infrastructure manageable for our small team. For this article, it’s worth noting:

  • All of our apps run in ECS on Fargate
  • We use ElastiCache for Redis
  • We use RDS for Postgres
  • We use an EC2 instance for our self managed ClickHouse database

These four things made up the majority of our infrastructure costs, with S3 and networking taking up the rest.

For the past year, Squeaky has been developed locally on M1 equipped MacBooks, with all runtimes and dependencies compatible with both arm64 and x86_64. We've never had any difficulties running the entire stack on ARM, so we decided to see if we could switch over to AWS Graviton to take advantage of their lower-cost ARM processors.

Updating the AWS managed services

The first thing we decided to update was the managed services, including ElastiCache and RDS, as they were the least risky. The process was very straightforward: a single line Terraform change, followed by a short wait for both services to reach their maintenance window.

Whilst we made sure to take snapshots beforehand, both services changed their underlying instances with no data loss and very with little downtime.

Updating our applications

We have been using Fargate to run our Dockerised apps in production for around a year now, as it allows us to quickly scale up and down depending on load. We’ve had a good experience with ECS and it’s been easier to maintain than alternatives such as Kubernetes.

We took the following steps to get our applications running on Graviton Fargate instances:

1. We wanted to change our CI/CD pipeline over to Graviton so that we could build for arm64 in a native environment, meaning we would not need to mess around with cross-architecture builds. As we use AWS Codebuild, it was a simple case of changing the instance type and image over.

- type  = LINUX_CONTAINER
+ type  = ARM_CONTAINER
- image = aws/codebuild/amazonlinux2-x86_64-standard:4.0
+ image = aws/codebuild/amazonlinux2-aarch64-standard:2.0

These were an in-place change, and all our history and logs remained.

2. Next up we changed the Dockerfile for each app so that they used an arm64 base image. We built the Docker images locally before continuing to check there were no issues.

- FROM node:18.12-alpine
+ FROM arm64v8/node:18.12-alpine

3. Thirdly, we disabled the auto deploy in our pipeline, and pushed up our changes so that we could build our new arm64 artefacts and push them to ECR.

4. Next, we made some changes in Terraform to tell our Fargate apps to use arm64 instead of x86_64. This was a simple case of telling Fargate which architecture to use within the Task Definition.

+ runtime_platform {
+    cpu_architecture = "ARM64"
+ }

We applied the change app-by-app and let them gradually blue/green deploy the new Graviton containers. For around 3 minutes, traffic was served by both arm64 and x86_64 apps while the old containers drained and the new ones deployed.

5. Lastly, we monitored the apps and waited for them to reach their steady states before reenabling the auto deployment.

For the most part, there were zero code changes required for our apps. We have several Node.js based containers that run Next.js applications, and these required zero changes. Likewise, our data ingest API is written in Go, which also didn’t need any changes.

However, we did have some initial difficulties with our Ruby on Rails API. The image built fine, but it would crash on startup as aws-sdk-core was unable to find an XML parser:

Unable to find a compatible xml library. Ensure that you have installed or added to your Gemfile one of ox, oga, libxml, nokogiri or rexml (RuntimeError)

After some investigation it turned out that by default, Alpine linux (the base image for our Docker apps) reports it's architecture as aarm64-linux-musl, whereas our Nokogiri gem ships an ARM binary for aarm64-linux, causing it to silently fail. This was verified by switching over to a Debian based image where the reported architecture is aarm64-linux, where the app would start without crashing.

The solution was to add RUN apk add gcompat to our Dockerfile. You can read more about this here. I suspect this will only affect a small number of people, but it's interesting nonetheless.

Updating our ClickHouse database

This was by far the most involved process, and the only part that required any real downtime for the app. All in all the process took about 30 minutes, during which time the Squeaky app was reporting 500 errors, and our API was periodically restarting due to healthcheck failures. To prevent data loss for our customers we continued to collect data and kept it in our write buffer until the update was complete.

The process involved a mixture of Terraform changes, as well as some manual changes inside of the console. The steps were as follows:

1. We spun down all the workers that save session data. This way we could continue to ingest data, and save it when things were operational again

2. Next up was to take a snapshot of the EBS volume in case anything went wrong during the update

3. We stopped the EC2 instance, and detached our EBS volume. This was done by commenting out the volume attachment in Terraform and applying

# resource "aws_volume_attachment" "clickhouse-attachment" {
#   device_name = "/dev/sdf"
#   volume_id   = "${aws_ebs_volume.clickhouse.id}"
#   instance_id = "${aws_instance.clickhouse.id}"
# }

4. We then destroyed the old instance including the root volume. Any user data was configured by the user_data script and would be re-created with the new instance

5. After that, we updated the Terraform to switch the instance over to Graviton, we had to change two things - the AMI and the instance type. The volume attachment was left commented out so that the user_data script would not try to reformat the volume. The Terraform apply destroyed everything that was left and recreated the instance. The user_data script ran on start, and installed the latest version of ClickHouse, as well as the Cloudwatch Agent.

 filter {
   name   = "architecture"
-  values = ["x86_64"]
+  values = ["arm64"]
 }

6. The volume was then reattached and mounted, and the ClickHouse process was restarted to pick up the configuration and data stored on the mounted volume

7. All of the alarms and health checks started to turn green, and service was resumed

8. The workers were spun back up and the last 30 minutes or so of session data was processed. The following graph shows the brief pause in processing, followed by a huge spike as it works through the queue

Image shows the abnormal processing behaviour due to the stopped workers
Image shows the abnormal processing behaviour due to the stopped workers.

Conclusion

We’re strong believers in continuously improving tools and process, and that’s really paid off this time. By having all our apps running the latest versions of languages, frameworks and dependencies, we’ve been able to switch over to brand new infrastructure with almost zero code changes.

Switching our entire operation over to Graviton only took one day and we’ve saved approximately 35% on our infrastructure costs. When comparing our CPU and memory usage, along with latency metrics, we’ve seen no performance degradation. In fact, our overall memory footprint has dropped slightly, and we expect to see further improvements as the month rolls on.

It’s fair to say we're all-in on ARM, and any future pieces of infrastructure will now be powered by Graviton.