Introduction
Microservices is a software architecture style where you compose complex applications through small, independent processes. Almost all major conferences around the world have a separate dedicated track for microservices, even at language-specific conferences. They have been gaining further momentum in recent years.
Companies using microservices include Pinterest, Twitter, Halo, and Uber. These companies have been around for a decade maximum, staffed with young engineers. They don’t have much legacy to carry around.
The magic of microservices is in quick deployments; you can release something to production and adapt to feedback as quickly as possible. You can scale services up and down however you like, based on your user consumption. This makes it very easy to enhance or add features.
Age-old systems
I worked in a place as a consultant where we were in an environment which had Microsoft servers behind the architecture. We knew was that the customers were seeing patterns: they saw new startups come in to their own space, their own publication domain, and the startups were able to quickly add features. These were features that our client had been trying to add for years. They were in a position to be market pioneers, but they saw their user base being corroded away by tons of new startups in their space. They wanted to change the way they worked.Ultimately, the interface to their end users is software delivery.
My team was dealing with a 20-year old system written in C, and the team did not have a business analyst at the beginning. We were only given a part of the codebase. We were told, “This is the thing. You do not have any business unless you re-engineer code.” They didn’t know how things worked, but they wanted it to be exactly the same so as to not lose any user base because of changing features.
With any business-critical application, you always have tight deadlines. When we saw how people were tackling problems like these, we saw microservices as the future: this great world that answered every problem we were having. We wanted to release quickly and gauge customer feedback. When we looked at microservices, we felt like it was the future.
However, it also seemed like a utopian mystery, like an imagined place where everyone exactly knows what they are supposed to do. When I was given this project and I was looking at microservices, that is exactly how I felt: all of that would be great, but it’s normally used by companies that are only 10 years old and know their software well.
Still, we decided that we needed to start delivering, so we tried microservices in our application. Before microservices, we imagined our monolithic application as a beautiful, multi-tiered cake, which we all had to delicately balance. When we changed to microservices, we had the same tiers of architecture, but as smaller, bite-size cupcakes.
Lesson 1: Keep It Small
This is the “micro” in microservices. It is important to have small services so that you can rewrite the entire service if you want. You can measure the size of your service by answering this question: how long does it take to rewrite your entire service? The ideal answer for that should be in the order of two weeks.
You want the service to have the ability to be rewritten. You could work on a story, and you should be able to deploy it quickly to production. When you pull a story from analysis and get it ready for dev, you want a smooth ride. If you are working with a service that is massive, it’s not going to be a pleasant experience to take a story, do the development, push it into test, and then wait because some other story has a dependency on the one that you are working on. You will run into problems.
If you have a service that is small enough, with only one responsibility, then it is very simple for you to be able to make changes to that service and deploy it quickly. Remember, one of the important things we had in our application was that it took months to see any changes deployed to production. This was what we wanted to avoid by keeping our services small.
You have smaller codebases, which leads to small context to change, which means that it helps you do autonomous delivery. Small is not beautiful; it is practical.
Lesson 2: Focus on Autonomy from Design to Deployment
Autonomy is the right of self-governance. In a microservices environment, you want to be able to push a button and deploy automatically. You shouldn’t have to do any choreography at all, or talk to other services.
If your service is small enough, it would have dependencies it needs to make its work happen. We accepted that there will always be dependencies. That was a important turning point for us: you are not supposed to do choreography at all. But how do you deal with the dependencies?
We made a conscious decision to think about what our deployment strategy would be at the beginning of the story development. We identified which services our story would touch, then when the development was done, you add the services there.
Lesson 3: Plan for Contingency Measures, a.k.a. Breaking Changes
How do we ensure that the end user is not going to be affected when these breaking changes are deployed, and how do you avoid breaking changes?
We used semantic versioning. We ensured that we had a tolerant reader in our APIs. Sometimes we would do lock-step deployments. We extensively used feature toggle, based on the environment that a given story was in.
While a story is in development, it will be feature toggled on dev environment, but the QA, staging, and production environment would be toggled off. Even when the RPMs reached those environments, until we are able to test and all those necessary checks are done, they are not available to the end user. When you have feature toggles, and when you ensure that you have semantic versioning tolerant readers and other things, you can try and avoid breaking changes and can at least plan for contingency measures.
We did have breaking changes, but we would divide a story into sensible pods, and then say, “it is going to be blocked, so do not deploy this until the other story is ready.” The developers can continue with their development, and the testers can continue. When these two things are ready, then you can turn the feature on. When we were able to plan for it at the start of development, it made sense to us.
It’s important to be aware of having a heterogeneous architecture. When you work with old systems, you have to be really careful. As a consultant, I have the responsibility to ensure that I am not choosing tools that my customers are not able to maintain. In our case, we chose things that we knew would add long term benefits to customers. They were done in simple languages rather than ML Clojure.
Lesson 4: Pay Attention to Bounded Context
Bounded context because it is such an important thing to consider, and a very easy thing to miss.
In any application, you have multiple contexts, and these multiple contexts can have models which are named exactly the same. For example, in a support context, you have a “customer”, and you also have a “customer” in a sales context, but the operations that that person has are completely different. People sometimes misinterpret this and they try to share operations and models between these two contexts. Please do not do that.
As an example, imagine we have two services, an authentication service and the web app. When the web app is trying to authenticate, the authentication service returns a JSON response with the username and some user details (i.e. age). The web app needed a response: this did not need the auth token. If I were to impart this inside the authentication service, then it would mean that the authentication service would have to return this response. Anytime the web app had to change, we had to change what the authentication service was returning… which was a big mistake on our part.
What we should have done instead was make a transformation inside the web app and turned off the authentication service. Authentication service should always return one response, then consumers interpret it in different ways based on what they want.
Lesson 5: Choose What Works for You and Document Your Reasons
I am a software developer, and the last thing that I want to do is write pages and pages of documentation. But it was very important for us to do this step.
Developers like to improve the system they are working on, which meant that we were having the same conversations over and over again about things should be a certain way. People were getting tired of explaining again and again and trying to see what was happening.
When we make a decision, understand the context why a certain decision was taken and put it up somewhere, so that it is communicated to the entire team. Then, re-evaluate it anytime one of those contexts changes. That way, instead of having the same discussion over and over again, you can have the discussion where it matters.
An example was understanding where our constraints and principles came from. We decided that we would do validations at every service level, which meant the validations were duplicated, but we were fine. We had good reasons why the validations should be duplicated. The important thing was around constraints. There are certain things which you cannot change in your project constraint. For instance, I was working on a Java application, and we had Python scripts. When we came to use a journey test, we started with Ruby. At that point, our clients went, “No. We cannot do one more language into the port. Can you please pick up something else, either Python or Java to do this?” After doing the user journey test in Ruby, we decided to change the user journey test using Java because Cucumber had a Java API as well. It made sense at that point why we had to do it: it was from a constraint which we had no control over, and there is no point in discussing over and over again whether Ruby or Java is better when your clients says to choose Java.
Duplicate validations
For principles, we decided that it is okay to duplicate validations. We are always taught to not repeat ourselves, but then there are certain duplications which you should be doing.
An example of this is validations where you have your client-side validations in JavaScript when a user is entering something, and you match an email and you say, “this is not a valid email address.” That is something you would do in JavaScript. Whereas, when you are trying to save something in your database, that is a completely different validation you should be doing, which is, “does it have any SQL injection in it? Does it have Bobby Tables?” These validations serve two different purposes: client-side vs. server-side. It is a good thing to ensure that you do validations whenever necessary. In some cases, we needed to check this over and over again by duplicating validations.
Shared libraries or shared clients?
In our case, we decided to do both. One of the services that I was taking care of published a client for any of its consumers to connect to the service. The client had all the validations that the service would do. If you were to use the client, then you can ensure that all the validations are done. There were times where you do not want to use shared client, but you want to use instead a shared library.
An example where we use a shared library, which is a very good layer of abstraction, was managing negative TTL caching. We used ELBs. When your ELB switch is being made, your service endpoint switch is being made, at that point there are times where your negative TTL caching affects the service discovery. When we had to abstract that, that was a very important thing for all the services to have it done the same. It was no reason for it to be done in two different ways. That was a very good example for us to do something in a shared library.
An example of abstracting something that we should not have is abstracting models into shared libraries. We saw that the support context and sales context had customer and product shared. Someone thought: I know what to do here; we are doing the validations again and again. Let’s do the smart thing: let’s extract that into a separate library and then ensure that those domain models are shared between those things.
It removed this boundary completely. It was a mess. We could not add any features to one context without affecting the other, which is why we wanted the microservices in the first place. What is the point of keeping it small when you share libraries in between? You might as well have both of them together, which was a costly lesson for us.
Lesson 6: Embrace Conway’s Law
Conway’s Law is that your solution is going to mirror what your organization structure is. Instead of fighting Conway’s Law, it is useful to embrace it. Make it work for you.
Initially, we had a UI team, which is why we have a web app service. Then we had a platform team. We had one AWS deployment architecture. Then we had feature teams, which meant we have features in between. Any time you had to add a feature, it would touch both the UI team and the feature team and the platform team. You had to somehow coordinate giving or splitting work and deployment across these three different teams; it was a mess.
As a developer, the last thing you want is to sit in meetings about how you should develop code. We decided to make our feature teams have UI developers, have a platform person, and importantly, have a product owner within. When we had a product owner team, decision-making was difficult. We had to wait to get business owner approval.
You want to have vertical slicing of teams instead of horizontal slicing. We used Conway’s Law to our benefit to have self-contained systems. The project owner is the the product owner, and they give us the requirements.
Lesson 7: Take Monitoring Seriously
I cannot stress how important it is to ensure that you monitor your systems. Things can go wrong easily. When you have one thing to take care of, you have to sit and stare at that one thing. Whereas, if you have hundreds of things to take care of, you would go mental. It is important to take your monitoring seriously and ensure that you have alerts.
A region going down is an important thing, but an instance going down may be not that important. You need to understand at which levels you have alerts. We had our dashboards configured for three important metrics: business, application, and system.
Business metrics was a huge win for us to gain our product owner’s confidence. He could see anything that he was adding, how quick it reaches production, and how it affects the user’s interaction with the system. We showed whether downloads increased or decreased after a certain feature was added: those are the metrics they like to see. Those are the metrics you should track as well, as a developer on your team, to understand how the software that you put out there is interacting with the users.
We used whichever tools suited our needs, and we worked on our dashboards continuously. Hystrix is a great library that has circuit breakers. If any of your dependent services are not working, we decided that we will show the content to the user anyway, because the user should not be penalized for our problem of not being able to put software out there properly. That was an example of a very good callback we had in our system.
Hystrix also gives out metrics for you. We tracked callbacks in our application, and we saw that certain services were alert, which means that something went wrong at that point and something happened, but the user had the content anyway.
Lesson 8: Testing: Do It
This should not be a point of contention in this century. It is important that we do testing at different levels.
I was working in a system where our product owners were not used to change at a rapid pace. It scared them that they could potentially release software and lose customers. We decided we would have a QA environment which mirrors exactly what the production does, and run a soak test where we have always a user interaction with the QA environment. Anytime we were finished with a story, we would check whether the QA environment was free, and we would release it in QA. Because the soak test was running continuously, it was giving us feedback if anything were to go wrong.
That covered 80% of our cases when things could go wrong. The product owners seem to come around after it. When he saw that things went wrong in QA, developers are working to fix it. Once QA testing is done, we would talk about it in the stand-up. It was a great feeling for us to be able to do that.
It’s like Schrodinger’s cat. They say that you never know whether the cat is dead or alive when it is in a box. The moment you open it is when you see it is dead or alive. But actually, you can test it: if you shake the box and the cat shouts, then it is probably alive. That is how we tested in our QA environment.
We had other measures that we added apart from testing:
-
Chaos Monkey is important if you are working in a microservice environment. Even if Chaos Monkey is not automated in your test set, try doing it as an exercise for the week and see what happens. Chaos Monkey will take care of bringing down systems, and you can track whether your infrastructure is resilient enough to handle chaos.
-
One no-brainer was to add health checks. The health check would not only check whether the web app is up and running, but if it is able to accept a request. It would also check whether it can talk to the authentication service that it was dependent on, or other five other services.
-
We tied the health checks of a service to be also tied to the health checks of its dependent services. That was a very easy win for us when we broke any contracts between those two services. If we deploy something in QA, then we know that the health checks of the dependent services go down and there is something wrong. By doing that, it gave us resources on how we are doing things wrong, and all we had to do was quickly go fix one by one.
It is important for you to ensure that your service by itself is working, and then your service with its ecosystem is working, and then services for your user are working. It is very important for you to think about in your testing strategy.
Note that tests are there to validate constraints. They should not be constraints themselves.
Lesson 9: Invest in Infrastructure
When we started our application between infrastructure and feature, we had 100% infrastructure stories, and then it started going down. We started getting feature stories at the beginning, because some infrastructure work had already been done. Then for the third service, as we went by we saw that the amount of infrastructure code that needs to be done was reducing. It can be daunting when you start with infrastructure, but perseverance is key.
Lesson 10: Embrace New Technology
It is an evolving ecosystem out there, and you are losing out if you are not tapping on to that potential. Digital disruption has already happened. Blockbuster used to be a big thing, and now my MacBook doesn’t have a DVD slot. If you are not embracing new technology, you are going to be phased out sooner rather than later.
It is important for old companies to understand this and not fight it. It won’t work out for long. Every environment, including banking, technology, and other services, is working on this space, and you need to tap into that potential to make things happen for you.
The three factors that people say contribute to your personal growth are autonomy, mastery, and purpose. I think your microservices also need to have those three things in order for them to work properly. Ensure that it has proper autonomy and it knows what it is doing. It must have the right tools to do and, and have a purpose to exist.
Microservices help you to improve on iterations. It is important in agile software development to improve iteratively and adapt to feedback. Having autonomy and taking control of business as well as technology in smaller teams helps you to deliver software quickly. You have to ensure that there is high cohesion between your services, but that they are coupled loosely enough such that you can alter its state whenever you need. You need to be able to embrace Conway’s Law. If your team is not structured to do that, then that is definitely going to get into the way of you adopting microservices properly.
In the end, do microservices because they’re fun! It gives me great purpose as a developer when I work on something and don’t have to wait six months to be released to the user. I deserve bragging rights on the things that I worked on!
Receive news and updates from Realm straight to your inbox