Slack Aws

This is the Slack for the Open Guide to Amazon Web Services. Anyone with AWS expertise or questions is encouraged to join. Fill in your e-mail, and you’ll get an invitation from Slack for this group. Then follow Slack’s instructions. Slack runs largely on AWS – the architecture is an AWS case study, though now out of date – and AWS engineers uncovered the problem with the TGWs. Slack had followed the practice AWS recommends, which is to organise workloads in separate AWS accounts. There are multiple virtual private clouds (VPCs) and the TGWs are hubs that link the VPCs. You can run commands using AWS CLI syntax directly in Slack channels. AWS Chatbot enables you to retrieve diagnostic information, invoke Lambda functions, and create support cases for your AWS resources. When you interact with AWS Chatbot in Slack, it parses your input and prompts you for any missing parameters before it runs the command.

Slack Aws Architecture
Slack Aws Chatbot

1:55

Slack is partnering with Amazon Web Services to integrate AWS management functions into Slack channels, among other integrations between the two.

Slack Aws Architecture

The integrations primarily involve Slack incorporating AWS functionality and products into its system in several ways that will benefit organizations that use both Slack and AWS, the companies said.

Must-read developer content

Slack also said it 'will use a range of AWS services, including storage, compute, database, security, analytics, and machine learning, to develop new collaboration features,' while AWS' part in the deal seems limited to adopting Slack as its internal collaboration platform.

Behind-the-scenes integrations are just one element of the Slack/AWS partnership, and many elements of the new deal between the two companies will affect Slack and AWS users as well.

SEE: 10 ways to prevent developer burnout (free PDF) (TechRepublic)

What Slack and AWS users can expect

There are four major things that Slack announced as part of its new integration partnership with AWS, and each will affect enterprise Slack customers who use AWS.

The most invisible change coming as part of the partnership is Slack's migration of the infrastructure for its Slack Calls video calling and conferencing service to Amazon Chime. Slack said this migration will allow it to eliminate the overhead of maintaining its own video conferencing infrastructure and 'leverage AWS's proven infrastructure to deliver high-quality and reliable user experiences.'

Slack said the Amazon Chime migration won't have a direct effect on users, at least in the short term. Over time 'the move will allow us to add new features, such as mobile video, so users can continue to rely on Slack for secure enterprise communication.'

One thing that Slack users will definitely see is the addition of AWS Chatbot to Slack. While AWS Chatbot already exists and is actively being used, Slack said that the plan is to continue integrating the AWS Chatbot service to incorporate all of AWS' services 'to give developers the ability to collaborate with their teams to manage all of their cloud-based services without leaving Slack.'

Slack Aws Chatbot

The Slack AWS Chatbot is able to send alerts in Slack channels when errors occur, lets developers run AWS command line interface commands, and can provide reports and other diagnostic data, all inside of Slack.

Amazon AppFlow is also getting a Slack integration as part of the partnership. AppFlow in Slack will allow users to transfer data between Slack and AWS services like S3 and Redshift without having to manually download and upload data, and in coming months will support bidirectional data transfer inside of Slack channels.

Finally, Slack's Enterprise Key Management (EKM), which has operated on AWS Key Management Service since 2019, is getting additional features that will allow Slack users to automate routine key security processes through integration with Slack's workflow builder.

Slack said its goal in the partnership is to create an alliance that allows enterprises to unlock their capabilities to innovate. 'By integrating AWS services with Slack's channel-based messaging platform, we're helping teams easily and seamlessly manage their cloud infrastructure projects and launch cloud-based services without ever leaving Slack,' said Slack CEO Stewart Butterfield.

'Together, AWS and Slack are giving developer teams the ability to collaborate and innovate faster on the front end with applications, while giving them the ability to efficiently manage their backend cloud infrastructure,' said AWS CEO Andy Jassy.

At Slack, we’ve gone through an evolution of our AWS infrastructure from the early days of running a few hand-built EC2 instances, all the way to provisioning thousands of EC2s instances across multiple AWS regions, using the latest AWS services to build reliable and scalable infrastructure. One of the pain points inherited from the early days caused us to spin up a brand-new network architecture redesign project, called Whitecastle. In this post, we’ll discuss our design decisions and our technological choices made along the way.

Down Memory Lane

In recent times, Slack has become somewhat of a household name, though its origin is more humble: the messaging service began as the internal communication tool when Slack was Tiny Speck, a game company building Glitch. Once they realised the potential of this tool, it was officially launched as Slack in 2014. It was simpler back then and there were only a few customers using it. As time went by, Slack started to evolve, and so did its infrastructure, allowing us to scale from tens of users to millions.

If you ever wondered what the infrastructure behind the original Slack looked like, it was all in a single AWS account. There were very few AWS instances in it and they were all built manually. Eventually the manual processes were replaced by scripting, and then by a Chef configuration management system. Terraform was also introduced to manage our AWS infrastructure.

As our customer base grew and the tool evolved, we developed more services and built more infrastructure as needed. However, everything we built still lived in one big AWS account. This is when our troubles started. Having all our infrastructure in a single AWS account led to AWS rate-limiting issues, cost-separation issues, and general confusion for our internal engineering service teams. To overcome this hurdle, we introduced the concept of child accounts. Now the service teams could request their own AWS accounts and could even peer their VPCs with each other when services needed to talk to other services that lived in a different AWS account. This was great for a while, but as a company we continued to grow. Having hundreds of AWS accounts became a nightmare to manage when it came to CIDR ranges and IP spaces, because the mis-management of CIDR ranges meant that we couldn’t peer VPCs with overlapping CIDR ranges. This led to a lot of administrative overhead.

Looking for a Way Out of the Pit

The Slack Cloud Engineering team had to devise ways to simplify this problem and make our lives easier. We looked at AWS shared VPCs. This was a feature AWS introduced at the end of 2018, allowing one AWS account to create VPCs and subnets, and then share those VPC and subnets to other AWS accounts.

With this concept we could create VPCs in one AWS account and share them across multiple accounts. Other accounts are able to build resources on these shared VPCs, but these resources are not visible outside that specific child account. The rate limit of their host AWS account applies to these resources as well. This solved our earlier issue of constantly hitting AWS rate limits due to having all our resources in one AWS account. This approach seemed really attractive to our Cloud Engineering team, as we could manage the IP space, build VPCs, and share them with our child account owners. Then, without having to worry about managing any of the overhead of setting up VPCs, route tables, or network access lists, teams were able to utilize these VPCs and build their resources on top of them.

Great New Build

With all these benefits in mind, we kicked off a project to build this out. We decided to call this project Whitecastle, because the lead engineer of the project conceived the idea while visiting the White Castle burger chain.

So what does this all look like? We created three separate accounts, one for each development environment here at Slack (sandbox, dev and prod). Service teams use the sandbox environment mostly for experimentation purposes while they use the dev environment for all pre-production testing. The prod environment is used for all customer facing workloads. Therefore, the Cloud Engineering team must treat the three environments as production since we are bound by SLOs to our internal customers. We also built a separate sandbox network called Bellwether just for those of us in the Cloud Engineering team to use for experimentation. This environment is not accessible to Slack’s service teams so we can test our changes, experiment, or break things without affecting anyone outside the Cloud Engineering team.

In each of these networking accounts in the us-east-1 region, we created a VPC with two /16 CIDR ranges (the biggest CIDR range AWS allows) and shared these VPCs with all our child accounts. Two /16 CIDR ranges provide over 130,000 IP addresses per region – more than enough to serve our current capacity and projected capacity for the foreseeable future. As we approach capacity, we are easily able to add more /16 CIDRs to these VPCs.

Once the build-out was completed for the us-east-1 region, we repeated the build-out for all other regions that our current AWS infrastructure operates in.

Our next challenge was to get all these remote regions to communicate with each other successfully, since we run customer facing workloads such as image caching for fast response times in several regions. However, the regions need connectivity between each other and our main region us-east-1 for various operations such as service discovery. For this we looked into another AWS feature that was introduced in late 2019 called Transit Gateway Inter-Region Peering. As a start, we created a Transit Gateway in each region and attached it to the region’s local VPC. Then a given Transit Gateway was peered with another Transit Gateway to build inter-region connectivity.

With AWS Transit Gateway Inter-Region Peering, one Transit Gateway must request the peering connection and the other Transit Gateway must accept it. Then routes must be added to each Transit Gateway’s route tables to be able to route traffic between each other.

The tricky part was figuring out how to choose which Transit Gateway to request the peering connection from, and which Transit Gateway to accept it, because we had to automate all this using Terraform. We decided to use the md5sum of the Transit Gateway ID given they are unique. During the Terraform run, we calculate the md5sum of all Transit Gateways. Then we take the last 16 characters of the md5sum and convert it into an integer. We called this number the Transit Gateway priority. If a given Transit Gateway has a greater priority number to another Transit Gateway, we choose that to be the requester and the other to be the accepter.

At the same time, Terraform updates the route tables of each Transit Gateway to be able to route traffic between each other. When we allocated CIDR ranges to our regions, we ensured all CIDRs fall inside the summary route of 10.0.0.0/8. To make sure each VPC sends any non-local traffic to the Transit Gateway that matches this summary route, we created a default route 10.0.0.0/8 in each VPC’s route table pointing at its local Transit Gateway so as to keep the VPC route tables relatively simple.

Not Getting Lost in the Mesh

There are many routes and many VPCs that need to talk to each other. A single bad route will render a service unable to talk to another service, and things can go bad very, very quickly. To remedy this, we decided to include real-time network testing. We built a very simple Go application called Whitecastle Network Tester, and then built an autoscaling group with a single instance running this application in each of our VPCs. This application registers its VPC ID, environment, and the IP of the instance in a DynamoDB table. We call this our Whitecastle Network Tester service discovery table.

The Whitecastle Network Tester application does two things:

It starts a simple web server on a new thread and responds to HTTP requests on the /health endpoint with a HTTP 200.
It also performs the following tasks automatically:
- Read a configuration file and find out what environments it needs to connect to
- Looks up the service discovery DynamoDB table for the IPs in that environment
- Tests connectivity to each of them
- Writes the results to CloudWatch in a designated account by assuming a role

We display this CloudWatch data source on a Grafana dashboard to monitor and alert based on any failing network test results.

Our Whitecastle Network Tester also validates the build-out of new regions by adding test cases pre-build-out and then creates connectivity between the new/old regions and the legacy VPCs. With this approach, we can see any tests failing to start, but then moving to a pass state as the build-out completes.

The current process of applying any changes to the Whitecastle network is to initially make changes and test them in the sandbox environment, then deploy the same changes to the dev environment. Once we are confident with our changes, we follow through to the prod environment. Having real-time network testing allows us to catch any issues with our configuration before these changes make it to the production environments. The Whitecastle Network Tester helped tremendously in developing a network mesh Terraform module, as we could validate the changes deployed by each Terraform apply command to network routes and resources.

How Do We Migrate to the Great New World?

Once we had this all built, the next step was to get our internal customer teams to migrate their services onto the Whitecastle network and into child AWS accounts. However, big-bang migrations are cumbersome and risky and it was difficult to convince Cloud Engineering’s internal customer service teams to take the leap initially.

If we offered incremental migration to the new network architecture, then there would be much more buy-in from service teams, so we engineered a simple solution for that, by sharing the new Transit Gateway to child accounts and attaching them to the region’s legacy VPC. With this approach, if a portion of the services moved onto the new VPCs, they’d still be able to communicate with parts of the system located in the legacy VPCs.

Although the sandbox environment is not shown in the above diagram, it works in a similar way to the dev environment.

All in all, we found a lot of success in building this new network architecture. There were many challenges along the way, but we managed to work through them and learn a great deal. Once we started running production workloads in Whitecastle, we stopped making ad hoc changes to all three environments that could impact our internal customers. It also meant we adopted a reliable and safe process to rollout changes. Having a sandbox environment dedicated to the Cloud Engineering team to test concepts and experiment helped catch any issues before pushing changes out to customer-facing environments. Tools like Whitecastle Network Tester give us confidence in rolling out our changes and also alerts us promptly if any problems slip through our testing. All these safety measures give us the ability to provide extremely satisfying SLAs such as high reliability and availability to the Cloud Engineering team’s internal customers. Having said all that, we are still looking for ways to improve our platform by adopting the latest technologies available to us. As an example, we will be looking at replacing all our inter-region route propagation logic once AWS releases BGP route propagation support for Transit Gateway peerings.

We are certain there will be many more challenges to overcome and lessons to learn as we move more services onto the new network architecture. If this work seems interesting to you, Slack is always on the lookout for new engineers to help us build great new things, and you can join our ever growing team.

Comments are closed.