As the first VP Engineering at Snyk, I built and scaled the engineering team from one developer to 150. This five-year growth spurt was not just about hiring more developers. We also needed to scale our architectural solution. It had to enable communication between the teams, while ensuring high quality and reliable delivery. This article is the story of how we were able to test and deliver in confidence as we scaled (and how sometimes we weren’t).

How it Started

We started out as a small team of five developers. I was so excited back then. We were the first people creating the code of what would become a leading developer security platform. Being a startup, communication was agile and fairly simple. We had just a few code repositories, and our functional testing was able to catch most of our bugs and regressions before delivery.

Growing Pains

Soon, our team grew and doubled in size. We were ten developers in two teams, managing ~10 microservices. We divided our code across repositories in a multi-repo framework. This division, coupled with team ownership over repositories, provided each team with more independence to deploy and release their changes. In addition, it made the management of the code simpler, while providing each team with more autonomy.

The multi-repo approach was not just aligned with our values. It was also aligned with Snyk’s business model: we began mapping different users and features to different code areas. Separating the repositories made it easier to build and fix features according to product needs.

At this point we encountered one of our first major challenges: functional testing of each service in isolation was no longer a feasible option. A service might have passed a test, but if it was dependent on another service that had changed after the test, we were pushing a buggy change to production. This started happening more frequently, since we were implementing CD (continuous deployment). Services were being deployed immediately after they passed the functional tests.

This gap between testing and production increased the number of rollbacks and slowed down velocity. I even remember a celebratory dinner we were having with the entire team in London, when a Twitter message alerted us that our website was broken. We opened up our laptops in the restaurant to quickly fix the CSS.

I look back fondly at that time now, but such events made it clear to us that we needed a new solution. One that could elastically tie our teams together as they worked on the same growing codebase, while ensuring the autonomy of each and avoiding deadlocks and dependencies.

The Solution: Microservices Testing in Context

We had been big believers in testing ever since we started Snyk, so when we were faced with this new challenge, we knew that testing would be the solution. We needed to find a new way to test our changes in a manner that would increase our confidence to deploy, and be worthy of the added overhead of writing and running the tests. We wanted to empower our engineers and make them accountable for their own code as a whole, not just for each change to the codebase.

Since functional testing left a gap in testing coverage, we incorporated automated system testing to bridge most of that gap. Our functional tests focused on each service and its direct dependencies (i.e., a database for a stateful service) and system tests focused on the integration between services. To put it in rough numbers, functional tests “insured” 30% of the risk, system tests covered 50% of the risk, and the remaining 20% was a risk we were willing to take.

Functional Testing with Microservices

Functional tests of service A would only check out and run the code of service A, and could rely on mocking external dependencies. These would be faster to run locally, and more straightforward to write.

System Testing with Microservices

System tests, on the other hand, allowed the logic of service A to be continuously tested with the most up-to-date logic of other dependent services. Consider a scenario where in order to fulfill a certain request, service A needs to pull up some saved state, perform a computation, call out to service B for some more information, and then respond with the combined result to the caller. In a functional test setup, the response from service B needs to be mocked. This allows for easier testing, but splits the expectation of the contract between A and B to both A’s and B’s codebases.

System tests for service A bridge this gap by providing a live, most-up-to-date instance of service B to the test suite. This requires more attention to state management to keep the setup aligned for tests, but in return this creates an ‘elastic’ dependency between A and B and the contract they share. A breaking change in B’s codebase, if not picked up by service B’s tests, will fail service A’s tests. Specifically, the system tests suite of A.

In the scenario described above, service B can help service A reach maturity by providing an up-to-date, constantly available instance of B to be used for A’s tests. We used our staging environment for this exact purpose, exposing our internal services to be used for testing. Now, only after deploying to staging and testing there, we would merge to the main branch and deploy to production.

The staging environment became more than a playground for manual checking of the entire system before deploying to production. It was a critical part in our automated testing of each and every PR. Deploying to staging was a mandatory step before deploying to production, as it reinforced the purpose of having a prod-like instance of each service available for tests.

While this approach has its downsides, it proved to be very cost-effective in keeping the different teams in check with their changes across the codebase, nudging us to constantly reduce the size of our PRs and the time changes took from inception to deployment. All this, without introducing any excess gates such as ‘let’s test all the services on the staging environment together, monitor for issues, and only then decide to push to production.’

Here’s what it looks like:

How It’s Going

By using our staging environment to evolve our testing paradigm and introduce system testing, we were able to scale our architecture and teams. This model accompanied us as we grew from 20 developers to over a hundred, and about two hundred of repos and microservices. From a socio-technical perspective, each developer out of the hundreds working at Snyk is still responsible for their own code and can test it while being empowered to feel confident when pushing to production.

Why Not Mocking?

We chose to add a staging environment, but a common solution to the microservices testing challenge is mocking. By creating mocks of all the other services the tested service depends on, a test can supposedly run and return accurate results.

The challenge with mocking is that the mocked services are a ‘git checkout’ of the past, and things might have changed since then. This solution is great in some scenarios, but we found it to be too rigid across a constantly changing codebase. Using live instances of your dependent services aligns the teams in a better way, helping them move in unison. It is akin to changing lanes on a highway - you want to roughly match the speed of all cars, not to stop everyone to make room for your change, and then resume going forward.

Tips for Building Your Microservices Architecture and Team

Building your microservices framework is about finding the right solutions so your teams are confident to make changes and push code in an agile manner. Here are three tips for building and scaling your team, architecture and business:

1. Communication

Solve communication challenges - between teams, services repositories, etc. By streamlining the processes of syncing, feedback and testing, you can ensure minimal rollbacks when code is pushed to production.

2. Simplicity

Make work simple. Always think about the next hire on your team, and how their onboarding process can be smoother and simpler than the previous one. Your codebase and practices might be defying this goal - the drive to add new technologies, handle more complex use cases and add more processes to avoid pitfalls. Resist the urge, and always ask whether a change is making the right things simpler or not.

3. Incremental Steps

Break down complex goals. Everyone wants to deploy with zero bugs, but it just isn’t simple to do. Inspect your quality-related process, and see how each step (from code review to microservices testing) has the best ratio between the overhead (how much time and effort it takes) and the confidence it provides to the developer pushing the change to production.

By making good work easy to do, it becomes much simpler to create high quality code. Trying to achieve quality by stopping bad things is a much steeper climb than making the right things achievable!

To learn how to make API testing easy, sign up for free to Loadmill.

Anton Drukh is the former VP Engineering at Snyk, where he scaled the team from 1 to 150 people over 5 years. Today he is mentoring Engineering Leaders and working on a side project to eliminate SSL certificate expirations. When not busy with work, he is being raised by his wife and 3 children in sunny Israel. He is writing in short form at twitter.com/adrukh, and a bit longer at adrukh.medium.com.