Category Archives: Software Best Practices

Layers of Failure, With a Side of Bacon

By, Julie Pitt

Remember when we talked about action codes as a means for unambiguous, actionable protocol errors? You may have wondered why you need all this fancy stuff like action codes and error metadata, when there’s already a well-defined standard in HTTP status codes. It’s working for you, so why mess with a good thing?

Today we will dive into an example REST over HTTP application. I’ll show you why it’s important to separate failures in the transport layer from those in the application layer. Then I’ll give an example of how to do it.

Bacon is the New Coffee

I briefly considered making an example API that is all about coffee when I realized that there are not nearly enough bacon applications out there. Let’s start with a little story.

Greybird Labs is a fast-growing company with headquarters in a 5-story building. Many Greybird employees are bacon fanatics, so naturally, the company recently installed a bacon cooker on each floor. Since a bacon cooker is a thing I just made up, I can also say that it has a convenient REST API, which employees can call from their desks to start cooking bacon.

Bringing Home the Bacon

As a recent hire at Greybird, you’re in the midst of a massive refactor. All of a sudden, you can’t decide whether to use recursion or a for loop. “Aha!”, you exclaim (as I often do), “I need bacon!” Conveniently, you have created an alias called ‘bacon’ for the following command:

$ wget --post-file "order-bacon.json" \
> --header "X-Greybird-Auth: SSBMb3ZlIEJhY29u" `# authentication token` \
> --header "Content-Type: application/json" \
> `# not a real domain, so don’t try it`

The order-bacon.json file (i.e., the request body) contains:

 “baconType”: “BlackForest”,
 “numberOfPieces”: 3

The response indicates that all is well:

 “machineId”: 0
 “jobId”: 243,
 “baconType”: “BlackForest”,
“numberOfPieces”: 3

You’ve submitted your bacon job. Is it done yet? To find out, you type:

$ wget --header "X-Greybird-Auth: SSBMb3ZlIEJhY29u" \

In return, you see:

  “machineId”: 0
  “jobId”: 243,
  “status”: “cooking”
  “baconType”: “BlackForest”,
  “numberOfPieces”: 3

Your bacon order is still cooking. While you wait, read on.

Bacon as a Service (BaaS)

By now you’ve figured out that the bacon cooker API is pretty simple. It has two resources:

POST /v1/[machineId]/orderBacon
GET /v1/[machineId]/status/[jobId]

Each bacon cooker is assigned a machineId, which is the floor number the machine is on. Greybird’s engineering team thought it would be funny to use a 0-based index for floors, so machineId 0 is actually on the first floor. Each active order is assigned a jobId, which can be used to track the status of your order.

Bacon Foul

Now let’s think about what can go wrong when it comes to ordering bacon.

  • Credentials (i.e., X-Greybird-Auth header) missing or not authentic
  • Invalid URL path (i.e., no matching handler could be found)
  • Unanticipated error in the endpoint (probably a bug)
  • Server is busy and can’t take more requests
  • Order queue is full
  • No such machine ID
  • No such job ID
  • Invalid input data (numberOfPieces, baconType)
  • Not authorized (e.g., employee doesn’t have permissions to access a job)

A common model for failures in an API like this is to map each one onto an HTTP status code. You may have noticed that some of these failures are quite specific to ordering bacon, but others are generic enough that they would apply to other applications. For example, if Greybird wants to start offering eggs, they’d like to leverage much of the protocol and service stack already developed for bacon.

Bacon, Eggs and Reuse

A major drawback of using HTTP status codes for all failures is that a single piece of code needs to understand failures in both message transport and the application. This makes it nearly impossible to reuse error handling code for eggs. Testability suffers as well since testing the application now has a dependency on HTTP. Not to mention the ambiguity and brittleness that may be caused by overloading status codes.

A better model is to separate the protocol and corresponding failures into at least two layers: transport layer and application layer. The transport layer is responsible for sending and receiving messages via HTTP, but doesn’t care what the application does. The application layer knows the semantics of bacon ordering but doesn’t care how the orders came in.

Here’s how we might categorize the failures into layers.

Transport layer

  • Credentials missing or not authentic
  • Unanticipated error in the endpoint
  • Server is busy and can’t take more requests

Application layer

  • Invalid URL path
  • Order queue is full
  • No such machine ID
  • No such job ID
  • Invalid input data
  • Not authorized

Now that things are separated, we can map HTTP status codes onto transport layer failures.

401 -> Credentials missing or not authentic
500 -> Unanticipated error in the endpoint
503 -> Server is busy and can’t take more requests

Better yet, as we learned in my last post, let’s not define each and every possible failure in our protocol spec. Instead, enumerate the subset of HTTP status codes returned by the API.

400 Bad Request
401 Unauthorized
500 Internal Server Error
503 Service Unavailable

Then, we enumerate the actions the client can take as a result of a failure.

enum ActionCode {

In the protocol spec, we define which action code is implied by each HTTP status code. The client then acts according to the implied action code.

400 -> DoNothing
401 -> ObtainCredentials
500 -> DoNothing
503 -> Retry

Bacon != Eggs

What about application layer failures? A convenient delivery mechanism for application layer failures is the body of a 200 status code response. An alternative model is to reserve a specific status code (other than 200) to indicate an application-level failure, and place failure details in the body. I will describe the former method.

In the case of using a 200 for application failure, the body is now a wrapper that tells whether the request was successful. If successful, the response data is found inside the wrapper.

  “status”: “Success”,
      “machineId”: 0
      “jobId”: 243,
      “baconType”: “BlackForest”,
      “numberOfPieces”: 3

Otherwise, the wrapper contains the failure.

  “status”: “Failure”,
  “error”: {
      “actionCode”: “Retry”,
      “details”: {
        “name”: “OrderQueueFull”,
        “errorCode”: “1234”,
        “description”: “Order queue is full. Wait and retry.”

The client only needs to check the status field. If the status is Failure, the client can then unwrap the error field and act according to the actionCode. Conversely if the status is Success, the client can then forward the response to the appropriate handler.


With this design, it is possible to keep application logic completely separate from the business of transporting messages. The benefits of such a design include independent reuse of transport and application logic, testability of both client and server applications and resilience against failure scenarios. It has worked well at several companies I’ve worked for, both in initial design phases and at scale.

Meanwhile, the employees of Greybird Labs are clinking their bacon strips as they toast to one more successful refactor. By the way, your bacon is done.

Making Failure Matter

By Julie Pitt

If you have ever seen or written vague error handling code; if you’ve ever been frustrated by an unhelpful error message like “something went wrong”; if you’ve ever designed an API, this article is for you. I’ll begin with a short story that describes the problems caused by ambiguous failures in client/server protocols and then explore ways to address them.

Enter Application Developer

Say you’re an application developer. You’re writing this awesome app and everything’s going great. It looks very pretty, the UI is responsive and best of all, it’s easy to use. Now all you need is data. Chances are, you’re going to get it from someone else’s API, which invariably requires access to a network and data store of some kind. You’re not too familiar with this API, so you start with something like this (you know, just to try it out):

try {
 // call the API
} catch (Exception e) {
  // error gobbling sasquatch
  print(“me want error. nom nom.”)

That is utterly…un…awesome. You wonder, how can I give this error-gobbling sasquatch the precision of Wolverine, with his nifty retractable claws and whatnot? How can I make my application responsive and resilient so that my users like it? You are determined to do better, so you try again:

try {
 // call the API
} catch (SQLException s) {
  // hmm wait...what does SQL Exception mean?
} catch (IOException i) {
 // should I try again, or give up? Probably try again?
} catch (TimeoutException t) {
  // Retry. Definitely.
} catch (Exception e) {
  // uh….

I guess that was a little better. At least now you have discrete code blocks that allow you to recover in different ways. It’s kinda like you taped some claws onto Frankenstein’s fists and told him to have at it.

Now say that the API has been updated with a new error condition called ServerBusyException. You probably want to retry like you would with a timeout, but without changing your code, the ServerBusyException falls into the sasquatch bucket. Nom nom.  Worse yet, when you do change your code, you have to map both TimeoutException and ServerBusyException to the retry logic.

Can you do better? Not really. But not to worry; I am here to tell you that it is not your fault. In fact, I would point the finger at the API designer. Whoever designed this API did not properly separate two very different concerns:

  1. Alleviate the pain
  2. Gain insight

As the application designer, you should only have to care about the first one. The API designer needs to worry about both.

Alleviate the pain

Alleviating pain means taking action. When you frame it this way, understanding exactly what went wrong is not a prerequisite to handling failures. Another way to look at it is that there is really only a discrete set of possible actions that an application will take to recover from failure. The goal of the API designer is to explicitly define those actions and enumerate them in the contract.

Let’s go back and look at the errors you had to catch in the last section:


How can we make these actionable? The first step is to map them onto specific actions the client application should take:

SQLException -> DoNothing
IOException -> Retry
TimeoutException -> Retry
ServerBusyException -> Retry

We call these action codes, which we can now enumerate:

enum ActionCode {

Generally, any error that is due to some transient failure in the service should be acted upon by retrying the same request using a well-defined retry policy. On the other hand, if there is a bug in the client (e.g., corrupt data or a malformed request), the action taken should be to never try that request again. It is a good idea to limit the number of action codes to the smallest set of recovery scenarios that will lead to a resilient and responsive application.

Once the action codes are defined, we wrap the errors into a generic exception that conveys both which action to take, and detailed information about the failure:

class MyAppException extends Exception {
  ActionCode actionCode
  // We’ll get to this one a little later:
  MyAppError error

The client code then becomes:

try {
 // call the API
} catch (MyAppException e) {
  if (e.actionCode == Retry) {
   // do a retry!
  } else if (e.actionCode == DoNothing) {
    // do nothing!
  // Here you would want to log what the action and error are

Notice how this code completely ignores WHAT went wrong, aside from recording the particulars of the failure in logs and/or metrics. What it does care about is the actionCode field, which it uses to determine the course of action to take. I wrote this example using pseudocode that looks like Java, but there is no reason why you could not model MyAppException in JSON as part of a REST API.

This model has several properties that are worth noting:

  1. The API designer is free to add as many MyAppError types as he wants, without breaking client applications. To maintain this property, the client code must never act upon or interpret any information in MyAppError.
  2. The client application only needs to handle each type of action code once. There is no longer a need to figure out which exceptions can be thrown and handle them in multiple places.
  3. Multiple client applications may implement the same API with consistent and unambiguous failure handling logic. This reduces maintenance costs for service maintainers.
  4. Action codes are extensible, provided the API is properly versioned. For example, you could introduce one called RenewAuthentication to indicate that a user must be prompted for her username and password. Each new action code is a change to the API contract and requires changes in the client code. Luckily in practice, such changes are infrequent once the initial API stabilizes.

Now that we have a model for conveying actions in our API, why not dispense with error types all together? Unfortunately, your code will have bugs and you’ll need enough insight to detect and fix them.

Gain insight

Detailed error types are the mechanism for understanding what is happening in the application and debugging when something unexpected happens. Remember that action code called DoNothing? Unless we have fine-grained error types, there is no way we can chip away at precisely what the underlying causes are. Thus, the API designer should add as many error types as necessary to understand the failures in the application.

Let’s take a look at what you might want to put into MyAppError, in order to understand failures:

class MyAppError {
  // An easily distinguishable, unique name for the error that is also human-readable.
  // This is what you would use in the name of a counter, for example.
  String name
  // Helpful to put into a diagnostic screen, in case customers need to tell customer service
  Integer errorCode
  // Human-readable description of the problem, for developers (not user visible)
  String description
  // The exception that caused this error
  Exception cause

This is only one possible representation. The point is that as the API designer you can create a rich model of errors with enough metadata that you can tell what is going on in your application and debug if there is a problem. The consuming code should log and collect metrics on these errors to expose both specific failures and aggregates.


To keep your error handling code pumped full of adamantium, consider the following:

  1. Design your application to be resilient and responsive by enumerating specific actions it will take in response to failure
  2. Include each such action in your API specification
  3. Separate actions from error metadata. Do not act upon error metadata.
  4. Log and collect metrics on the error metadata so that diagnosis is possible after the fact.

Stay tuned for the sequel, which will discuss protocol layering and failure.