How we built our CI/CD pipeline for velocity and quality

Working in an early-stage startup possesses quite a few challenges in comparison to working in an established company.

The engineering team currently consists of five developers, eng. manager and our CTO. We are building pricing & packaging APIs for developers. Our application is mission-critical software for our customers and it requires us to be super considerate with how, what, and when we deploy to Production. Undoubtedly, this is quite a small team that needs to ship a lot of code and fast. Besides proper coding, and not having a dedicated DevOps/SRE, all our deployment pipeline, and cloud architecture is designed and managed by us, the developers.

Being a small team, we constantly need to make decisions and refine what to focus on in our product. Our ability to choose what to focus on is crucial for the success of the business

This article will shed light on our decisions around the topic of the CI/CD pipeline.

We believe that having the right pipelines and safety nets around deploying code to production at the highest level is crucial to our team's Velocity. In fact, I would say that a healthy, reliable process from the moment the code is pushed to GitHub until a customer sees it is the one most important thing for our velocity at this stage (and also later).

Moreover, and not less importantly, laying down great CI/CD foundations will impact positively the way we write, test, and ship our code in the future.

Constructing the pipeline, I chose the term Velocity to help us derive better decisions about what kind of pipeline we are going to build right now. What is crucial for us from day one? What doesn’t support Velocity in a meaningful way at this moment and can be done later?

‍

Velocity in mind: Our five CI\CD principals

‍
1. Isolate Staging and Production environment.

2. Implement all needed safety nets in order to move fast.

3. Simple way to roll back any bad deployment (not that we are going to have any, but just in case).

4. Have the ability to track easily what are the code changes deployed at any moment in any env and service.

5. 100% of cloud infrastructure managed by IAS (CloudFormation in our case) , saved inside our code repository alongside the code.

‍

A bit more in detail:

1. Isolate Staging and Production environment.

When we started the company it was obvious that we are gonna need a stable Production environment in order to show demos to customers while we work on new features. As we started to design our cloud environment, we quickly understood that we also need a Staging environment as well. We wanted to have an identical setup in the cloud as in production which would differ solely in scaling and would serve as another gate before production. The other purpose of the staging environment for us is to allow us to run complex E2E test cases and sanity checks. The staging environment allows us to run all our services together in a similar way as they would run in Production. It is very powerful for us because as we don’t want to have any downtime with bad deployments on Production we are more than happy to break Staging once in a while to test things out.

One of our core decisions here was that a release is built once, from the master, and then shipped and deployed automatically to staging. When we are happy with what we did, we will manually (by pressing a button) deploy to production exactly the same code that was deployed to staging but with a different configuration. This is a very powerful approach because it allows us to have very high certainty that code that is happily running in staging will perform fine, also in production (as long as the configuration itself is correct).

‍

2. Implement all needed safety nets in order to move fast

As we move fast, we can break things. These days, we can deploy more than 20 times to production in one day. To enable that, we have to have really good safety nets to allow us to be certain that we don’t have major issues. And once we do, to discover them quickly and remediate them really fast. I'll share more in a subsequent post but our safety nets consist of quite a few tools starting from monitoring our infra and critical flows, e2e tests that run on deployment, and periodically both on staging and production testing the most critical flows.

We strive to know that something went wrong before our customers do and therefore we have implemented all those gates in place.

‍

3. Simple way to roll back any bad deployment.

In order to support our core value of velocity we have to be able to easily roll back when we ship malfunctioning code. Having that confidence will allow us to safely ship code without being too worried that we just can’t go back or that it will take us a lot of time and sweat to do so. We always try to ship as small pieces of software as possible and consider “what can go wrong?”. When we ship small pieces of code and we have a minor blast radius we feel confident and rely on our rollback mechanism in case something goes wrong. If we didn’t have this mechanism, we would have to let the new code run on staging for quite some time, perform rigorous tests on staging, and be sure that what we deployed is never going to hurt production (aha….). And even after all this, when we break production it will take us at least half an hour to revert the code and release a new version to staging and then in prod. Dealing with “quick” fixes like this is not supporting velocity and therefore we decided to invest in our rollback mechanism.

‍

4. Have the ability to track easily what are the code changes deployed at any moment in any env and service.

We want to know immediately without the need to call Sherlock Holmes to help us correlate code changes to builds to releases, logs, traces and alerts. This allows us to easily understand what we have deployed in which region, container, and lambda. Not having this in place would force us to spend time on understanding which exact docker image is deployed in which lambda and would make our debugging more complex. We create a Semver tag today for each master build and bubble it through the whole pipeline. It helps us to know which version and what precise lines of new code are deployed in staging and production and what version of code creates this or another log.

‍

5. 100% of cloud infrastructure managed by IAS (CloudFormation in our case)

Quite often, from my personal experience in startups, not having a dedicated DevOps/SRE in the early stage there is a lot of manual work around the deployment of any resources in the cloud (also code, for that matter).

In all cases, this is not working well. Deploying anything manually to the cloud directly contradicts our previous 4 principles. Working with manual deployments is possible, and might be faster when you start. It can be completely ok if you are a team of two (or fewer) developers, having one production service and no customers. But if you are a bigger team, you actually have customers and more complex infra requirements you would be stuck debugging manual deployments instead of developing features. Velocity is created when we can focus on creating features and releasing them safely to production. Whenever you need to manually deploy services and hope for the best that is contradictory to velocity.

Following this principle might be a bit time-consuming to set up in the beginning (took us a couple of weeks) but is a life saver as we will grow and scale. Today, more than a year after shipping the first line of code, I feel that we have a really strong foundation and the couple of weeks of investment, in the beginning, proved itself. Our deployment pipeline supports dozens of releases to production a day and allows us to focus on creating new features for our customers with the speed of light.💫

No items found.

Velocity in mind: Our five CI\CD principals

You might also like

The Engineer's Case for Buying Monetization

Metering Isn’t Billing. It’s Infrastructure, and It Needs to Be Right.

Integrating Stripe with Stigg: A Deep Dive into Decoupled Monetization