The need
With the rise of microservices or large scale distributed systems and communicating through HTTP and NoSql databases, we also see the rise of eventual consistency. The focus on reliability or the perfect operation at all times had to shift to resilience; the ability of an application to recover from certain types of failure and yet remain functional. All-or-nothing and reliable transactions were paramount, data had to be safely stored above all, sacrificing the user experience and cost. The objects on those transactions could also be very complex, frequently using multiple tables and even different databases. Typically, if a transaction failed, the user would receive an error requiring his action; usually to resubmit the request or contact support. To keep response times low, vertical scaling with costly “big iron” was common.
Once the mindset shifts to resilience, requests are made smaller in scope, RESTful oriented, and eventually consistent. This allows requests to return faster, but the data might take some time to propagate throughout a cluster. This cluster is now made of smaller, cheaper machines and, if errors happen, the current systems tries to find a solution before quitting and throwing the problem into the user’s lap. This is one side of resilience. The other one is the need to perform large scale automated error handling and recovery. The problem of the app using dozens of interdependent microservices where one of them goes belly up, causes a dramatic increase in latency (all calls to it are hitting a timeout) that cascades and takes the whole thing down. That’s a serious problem. In order to make systems more resilient, a few design patterns were devised and are now gathered under the Fault Tolerance (FT) umbrella:
- Bulkhead – isolate failures in part of the system.
- Circuit breaker – offer a way of fail fast.Once the mindset shift
- Retry – define a criteria on when to retry.
- Fallback – provide an alternative solution for a failed execution.
On a broader resilience scale we can also find, among others, the following patterns:
- Health endpoint monitoring – implement functional checks in an application that external tools can access.
- Leader election – elect a coordinating leader for other instances.
- Compensating transaction – undo the work performed by a series of steps.
The APIs
Some early APIs addressing these issues were Netflix’s Hystrix and the Failsafe library.
When the microservice oriented MicroProfile started, soon the need for a FT specification was made clear.
These projects are following the new FT specification:
- Geronimo Safegard library, the library that will be soon used on TomEE;
- Wildfly Swarm Fault Tolerance library;
- Payara Micro;
- Open Liberty (IBM)
The standard
The FT specification is part of the Eclipse MicroProfile, the open source community specification for Enterprise Java Microservices.
The MicroProfile community specification is hosted by the Eclipse Enterprise for Java (EE4J) open source initiative. EE4J is based on the Java Platform, Enterprise Edition (Java EE) standards, and uses Java EE 8 as the baseline.
In the beginning of 2018 this is how Microprofile looked:
On its 1st version, the FT specification includes the following aspects:
- Timeout
- Bulkhead
- Circuit breaker
- Retry
- Fallback
- Asynchronous
This last one looks like the old EJB @Asynchronous annotation, but it isn’t. Yes, it allows fast return with the execution on a different thread, but doesn’t need EJBs.
The next v1.1, bound to be released on May 1, will be included in version 1.4 of the MicroProfile spec. These are a few improvements we can expect:
- Add MicroProfile Metrics support!
- Support for exponential backoff retry. Each subsequent retry might take longer to execute, hence relieving pressure from systems already in trouble.
- Asynchronous interceptor improvements and CompletionStage return type for reactive applications.
Some other improvements being discussed for future releases:
- The ability to disable at runtime, annotations at the class or method level, e.g. com.acme.test.MyClient/serviceA/CircuitBreaker/enabled=false
- Ability to open a circuit breaker during runtime.
On our TomEE side, we will be working on integrating Geronimo Safeguard…..Stay tunned!