Fast Collaborative Decision Making

By, Christian Kaiser

In previous posts, I wrote about being crisp about who the decision makers are, and a simple process to break ties. Both help create nimble, well-informed organizations. Today, I will outline how to build on this foundation to implement technical decision making that scales well to organizations with dozens of senior technical contributors, but still allows making decisions quickly and effectively.

Let’s take a look at what we want to achieve.

Ideally,

  • decisions are made quickly (expediency)
  • the collective time spent on this process is minimal (efficiency)
  • all stakeholders plus a broad set of other contributors are able to bring in their knowledge and experience so that the best possible decision can be made (inclusiveness, collaborativeness and cooperativeness)
  • everybody who needs and/or wants to know is fully informed about
    • what is being decided,
    • when the decision making will take place,
    • whether a decision has been made, and
    • what the final decision was (transparency)
  • decisions stick until there is significant new information (effectiveness)

In my experience, there is little refined culture around technical decision making in many organizations, and the resulting ad-hoc process regularly misses some of the goals above. I’m sure the esteemed reader can recite several examples off the top of their head.

Here is one that illustrates several misses:

A small group of engineers (let’s call them Team A) makes a high-impact technical decision after careful deliberation in several meetings and rolls it out to the rest of the company.

Then others (who had not been involved in the decision process and therefore hear of the decision for the first time) quickly point out a number of flaws, some of them seemingly quite serious.

It turns out that Team A had actually thought of some, but not all of the flaws because they had not been aware of some technical detail outside of their area of expertise.

The flaws that were considered were traded off against other advantages. However, Team A did not communicate that part.

There are only ugly choices at this point:

  • Team A admits to the issues they had missed, then goes back and address them. This requires honesty, is probably somewhat embarrassing and will take additional time.
  • the company can move on based on the flawed decision. Less painful and more expeditious, but obviously problematic.
  • some variation or combination of the above

Let’s assume Team A chose to fix the decision, and after several months, the company has made great progress in implementing it. Some of the original decision makers have left the company, switched projects or got promoted into a less hands-on role.

The project is now in the hands of Team B, comprised of smart and experienced engineers. However, none of the members of Team B were in the room when Team A made the original decision, and since it was not documented, they have no way of knowing why and how it was made. One of the team members points out a major drawback of the plan of record that does not seem to have been considered by Team A, so they get together to discuss a remedy.

There are four possible outcomes in this situation:

  • Team A had not considered the flaw. Team B recognizes that correctly and modifies the plan to avoid the flaw. This is a positive outcome because it decreases overall project risk. Perhaps late, but better late than never.
  • Team B cannot determine with certainty whether Team A had considered the flaw or not, assumes that it did and dismisses the concern. In this case, a project risk that could have been fixed remains, plus the team walks away with the unease of uncertainty. Overall, a big negative.
  • Team A had considered the flaw, and Team B reconfirms their original decision. This is a negative outcome. The team lost some time in the discussion, but at least there are no new additional project risks.
  • Team A had considered the flaw, but Team B assumes they did not, thus not investigating why they had traded it off. Team B then decided to change the plan, leading to additional work and later, materialization of another (worse) flaw that is unknown to Team B as of the time of decision. This is the most negative outcome as it adds risk to the project, some of it yet unknown.

The astute reader has probably noticed that only one of these outcomes is positive, and that better communication about the original decision could have avoided most of the downside.

Other examples for suboptimal decision-making processes that may be familiar:

  • consensus-based decision making – highly inefficient and often fails to create good decisions (or any decision at all)
  • failure to have the right people in the room – too many or too few

A Better Way 

Let me describe a decision-making process that I’ve found to allow a great balance between the tradeoffs while still achieving the goals. It consists of three phases: Drafting, Review and the Decision Meeting.

fsdm01

The Drafting Phase

The process begins when a team identifies the need for addressing an issue that requires non-trivial exploration and investigation – more than can be done in a meeting or in the hallway. For example, a client-server protocol is needed. Or the scope of a new middle tier service needs to be defined. The team identifies and assigns one or more people to investigate the issue and draft a proposed solution. The drafting phase typically has a deadline of less than 2-3 weeks in the future. If that seems too short for the issue at hand, it may be useful to break the task into parts that can be addressed individually.

The goal and desired output of the drafting phase is a document that describes a proposed solution at some level of detail that’s deep enough for thorough understanding and review by all the stakeholders, but that is still coarse enough so the document can be reviewed in less than an hour of time (e.g. three to five pages of technically dense material).

The draft must also include:

  • a description of the known requirements at the time of the writing,
  • any assumptions about unknowns that were made, and
  • a high-level description of design alternatives and why they were discarded.

The Review Phase

At the end of the drafting phase, the document is opened for review by the team, by virtue of one of the drafting team members sending out an announcement like this:

From: tigerteam@bacon.com
To: eng-all@bacon.com, product-all@bacon.com
Subject: ACTION REQ’d: Design Proposal for Client/Server Protocol Ready for Review

All,

The first draft of a design for the new client/server protocol is ready for review! (link)

Please review it and, if you have concerns, ideas or any other relevant information, please leave them as comments in the document. We would like the stakeholders (Dilip, Frank and Kathy) as well as Jeff to review and comment. All others are very welcome, but optional.

We will work to address and resolve your comments quickly, but please don’t delay your review so we have time to react. We will meet 8 days from now on Thursday, April 7th at 10am in the Sequoia conference room to discuss and decide.

Comments are due 24 hours earlier, on Wednesday, April 6th at 10am at the latest, to give us a chance to process them before the meeting.

Now is the time to get involved! We appreciate your input.

The “Tiger Team” (Lily, Joe and Wu)

Please note the broad distribution and the clearly set expectations.

The broad distribution makes sure that everybody who would like to be involved will know about the matter to be decided on, and the fact that it is up for discussion at this time.

In this phase, it is mandatory for stakeholders and a set of additional contributors chosen in advance to read and understand the document, and optional for all other team members. The voluntary nature of the latter allows non-stakeholder team members to choose freely between contributing their cognitive cycles and experience, or to opt out of the time commitment if they have more pressing stuff to do or if they trust others in the team to provide adequate input and feedback.

This is a nice compromise solution for the conundrum between being inclusive by allowing everybody to provide input and the cost of the time commitment involved. It is my experience that people who care about the mission and get a chance to provide critical input are strongly motivated to do so, and will make time.

The review phase should be no shorter than two or three working days to give everybody the time to respond at a time that’s convenient for them, and no longer than about a week in order to maintain the proposal’s focus and energy.

Review Mechanics

Google Docs provides a very powerful mechanism for review – it allows everybody to review the same document at the same time even as it changes, and comment on clearly delineated parts of the document. Others can respond to a comment or “resolve” it. It’s like a simple issue-tracking system, but rendered inline in the document, and updated in real time for rapid collaboration. It’s amazing to watch sometimes how comment threads pop up and grow within minutes.

fsdm02

Best practices for working with this feature:

  • only the document owners are allowed to edit the text of the document, all others can comment. This can be enforced by setting comment/edit permissions as required.
  • everybody (document owners and reviewers) should engage to allow the maximum amount of evolution of the document. Avoid waiting for close to the deadline to make changes.
  • document owners can and should resolve comment threads as soon as possible after making changes that address the comments. Example:
    • A reviewer discovers a typo or other oversight and creates a comment to that effect. Later, one of the document owners simply fixes the text and resolves the comment.
    • This could range all the way to adding whole sections to the document based on the feedback.
  • After the review phase has ended, the document owners should distill remaining (and sometimes lengthy) comment threads into an “Remaining Open Issues” section in the document to prepare for the decision meeting.

The Decision Meeting

After the review phase, the team comes together in an in-person meeting to make decisions. Specifically, the decision to be made is to either “ratify” the proposal and make it the new plan for record, or to do another iteration. It is important to insist on these being the only possible outcomes.

The invitation for such a meeting could look like this:

From: tigerteam@bacon.com
To: dilip@bacon.com, franks@bacon.com, kbeyer@bacon.com
Cc: eng-all@bacon.com, product-all@bacon.com
Subject: ACTION REQ’d: Design Proposal for Client/Server Protocol Ready for Review

Thanks for your participation in the review of our draft design for the new client/server protocol! With your help, we made great progress and caught several issues.

We will meet tomorrow, Thursday, April 7th at 10am in the Sequoia conference room to discuss all unresolved feedback (see document (link)), and, if things go well, make a decision on the spot. Everybody who has participated in the discussion should have received a meeting invite. If you’re marked as optional in the invite, you may attend, but don’t have to.

Again, we will *not* review the draft in the meeting, but just go through the unresolved issues. You’re expected to be familiar with the document.

If the issues are sufficiently resolved, the stakeholders (for this matter: Dilip, Frank and Kathy) will make a decision in the meeting!

See you tomorrow!

The “Tiger Team” (Lily, Joe and Wu)

Again, please note the clearly stated expectations and deadlines, and an explicit list of the stakeholders (and therefore decision makers).

The required attendees for the meeting are:

  • The document authors
  • All stakeholders for the matter at hand
  • team members who participated in the review and whose comments are still unresolved at the time of the meeting

Other participants in the review are invited as optional.

This one’s important: If a stakeholder cannot attend, they must send a delegate with decision authority. If a stakeholder is not present (either in person or with a delegate), the meeting must be postponed until all decision makers are in the room.

At the time of the meeting, everybody is expected to have read and understood the document. The meeting host allows only discussions about comment threads that have not been able to be resolved, not about the proposal per se.

If there are no unresolved matters, there is no discussion, and the stakeholders immediately move to make the decision to ratify the draft. Thus, the meeting may end long before its scheduled end and everybody in the room gets some time back. Or, if this is the expected outcome, the meeting host may schedule multiple decisions for the same time slot.

If unresolved issues do remain at the end of the allotted discussion time, the stakeholders can either make a decision anyway, or ask for an amended proposal. In this case, the process is reset to the drafting phase (perhaps with a shortened review period if the changes are not major).

fsdm03

Executed well, this process is very efficient. I’ve seen groups make 2-3 high-confidence decisions about meaty technical matters within a 60 minute meeting (after some practice, of course!).

After the Decision

If the proposal is ratified, the document is updated with any new material that came out of the review or decision meeting, and becomes plan of record. It will be archived in a place where it can be easily discovered. This allows everybody in the company to learn the details and also the history of the plan of record in the future.

With the decision, the team implicitly agrees that the plan of record will not be re-discussed unless substantial new information has come up that may invalidate the decision.

In that case, the team would likely follow the same process again, starting with a new draft (that refers to the old ratified proposal, of course, and explains what has changed).

Summary

The process described above meets many desirable criteria for decision making processes.

It is transparent, collaborative, cooperative, inclusive and participatory, and therefore well-received in organizations that value these principles.

In large teams, the process will typically take between two and four weeks from inception to decision. This is a good compromise between speed, cooperativeness and thorough communication. In smaller teams, it could be a lot faster (e.g. a couple of days for teams of 5 or less).

In any case, decisions will still be made quickly once the context is available and reviewed because there is no consensus required – it is up to the stakeholders.

This concludes my three-part series on organizational best practices (part 1 and part 2). In upcoming posts, I will write about other aspects of organizational culture (e.g. team building, compensation).

AWS Elastic Load Balancer Black Hole

by Greg Orzell

It turns out that there is a rather nasty bug/behavior in Amazon’s TCP/IP load balancing that can lead to traffic being black holed.  This is of course the nightmare scenario for pretty much anything TCP/IP.  Lets take a detailed look at what can happen and what I think is going on.

TCP/IP Load Balancing

Before talking about the nature of the problem and how it was diagnosed, it is probably worth a quick review of how TCP/IP load balancing works.  Clients connected to the load balancer(LB), establish a session with it.  All of the TCP/IP protocol requirements are handled by the LB, so as far as the client is concerned it IS the server.  This is really important because it means that things keepalive and other protocol related packets are handled independently of the connection to the actual server.

When a new connection arrives at the LB it creates an additional connection to one of the servers that are registered with it, using a round robin algorithm.  The body of the packets from the client are unwrapped at the LB, then re-wrapped for the LB to server connection and sent on.  Response packets do the same thing in reverse.  In essence the LB is acting as a proxy for packet going between the client and the server.

Finding The Hole

A little while back I was helping to debug why connections were behaving in odd ways when going through an Elastic Load Balancer(ELB).  The application was using TCP/IP load balancing for long lived TCP/IP connections (specifically XMPP).  It’s also important to note that the XMPP protocol does not follow a request/response pattern for all actions.  This means that the client can send requests out for which it doesn’t expect a response from the server.  Using long lived connections and lengthy timeouts was particularly important because most of our clients were mobile devices.  It was important that we not power up the radio to send heartbeats any more frequently than absolutely necessary.

For the most part this worked as expected, however there were times when clients were mysteriously unable to continue talking to the servers they were connected to.  Through a series of tests we were able to determine that these times tended to correlate with when we were deploying new code.  Whenever new code was pushed and the client sent further information, it would never reach any of the servers.  This would continue until a heartbeat request (which expects a response) failed.  Because we wanted to limit radio use, this type of request was only sent every few minutes, so numerous updates from the client could be lost.

At this point we could see that packets were being successfully delivered to the ELB from the client, but we couldn’t find the same packets being forwarded on to any of the servers.  This led us to take a closer look at the ELB itself and how it was interacting with the servers behind it as they came and went.  It’s important to note that we were using a Red/Black push methodology, where a new set of servers is registered with the load balancer and the old servers are deregistered.  The diagrams below show a simplified view of what was happening during the push process.

Initial State:

ELB2

After de-registration:

ELB1

As you would expect, because the client’s connection is terminated at the ELB and then proxied to the server, it is not directly affected by the de-registration event.  As mentioned before, TCP/IP load balancing is done by connection, not request (as with HTTP) or packet.  When a server is deregistered it leaves things in an awkward state on the client side.  Your packets have no destination, which there is no way for the client to know, but they are successfully sent to and ACKd by the ELB.  What exactly the ELB does with them after that is a bit of a mystery, but they are never to be seen again.

A Better Way

When this happens, I think that the load balancer should help you out, but it turns out it doesn’t.  Instead it just sends all the client packets into nothingness.  Bummer.  What I think would make more sense would be for the ELB to send RST packets to all of the clients for which it was proxying connections to that host, when de-registration occurs.  The ELB should have a state table mapping ELB/client connections to ELB/server connections.  So I would think that this kind of solution would be fairly trivial to implement.  If it behaved in this way, the clients would establish new connections to the ELB which could then be proxied to hosts that are still registered with the ELB and everything would be more or less happy.

While we wait for something like this to be implemented, there are a couple of ways to work around the problem.  One is to send requests that have an expected response more frequently so that clients can more quickly identify a dead connection.  The other is to do a rolling push were you shut down rather than de-register the servers.  When this happens the server connections to the ELB are closed and the upstream clients are notified appropriately that what they were talking to is no longer available.

Luckily this isn’t a very common use case so it probably doesn’t affect a large portion of the ELB user base.  But it does give me pause as more people start to use websockets and other long lived connections for their services.  Will they do the right thing?  It’s probably worth taking a closer look.

 

Escalation as Tie-Breaker

By Christian Kaiser

In my previous post, I described how teams can remove obstacles by clearly defining who makes decisions for every area of responsibility in the team. The method includes assigning stakeholders who need to agree on every decision in the area.

An interesting aspect of it is that all stakeholders have the same level of authority over an area, i.e. all need to agree (or at least not disagree) in order to make a decision.

So what happens when the stakeholders do not agree?

I’ve observed teams become stuck when the decision is not being made and the dispute is just left standing. That is never a good option.

On other occasions, people have been more inventive and gotten things moving again, but not without breaking some rules. For example, I’ve seen one side just silently get going on the controversial project, thus creating irreversible facts. On the upside, the team can move on. On the downside, this behavior will undermine trust in the organization.

Why might intelligent people who belong to the same organization disagree?

Most likely, both sides have excellent reasons for their stance based on the context they have. For example, there may be insufficient or outdated context that needs to be communicated to one or both sides. Or perhaps the disagreement has exposed a question that no one had thought about thus far.

These situations clearly requires communication to, and action from people with the necessary context and authority. Simply put, the boss decides. After all, it is part of their job description!

Here’s how it can work: In case of a stakeholder stalemate, the next higher layer of the organization is informed and asked to make the decision (the “escalation”).

eatb01

Ideally, that is just one person, but if there is no immediate common boss, it may be multiple people (e.g. the respective department heads). If these bosses do not agree either, the matter is escalated further. If needed, all the way to the top.

eatb02

Doing so exposes the disagreement to a larger portion of the organization. That may come with discomfort. However, if all the obvious routes have been explored and the conflict is in fact due to a disconnect at a larger scale, it is the right thing to do and should be encouraged in every way possible.

Best Practices

So how does a boss help their team to make a decision? Three steps:

#1: The conflicting parties must give the boss the full pros and cons of the decision at hand. It is important that the boss fully understands them and demonstrates that understanding. If she fails to do so, it may create frustration and loss of trust (“I’m not surprised she decided the other way, she did not even listen to me / understand my proposal”).

#2: She then needs to make the decision.

#3: Last but not least, she needs to communicate why she made the decision this way. The reasons can actually range from “coin flip” to “judgment call” to “logical conclusion”. It’s important that there is an explanation so that people don’t have to make up one. The made-up explanation is typically a lot more dramatic than the real one!

At the end of the process, everybody can move on, knowing that a needed decision has been made. The party whose proposal was not chosen can rest assured that they have been heard and their reasoning was fully taken into account.

Potential Problems

What if people don’t know they’re expected to escalate, or are afraid to?

This is a common problem. The boss must re-iterate their readiness for escalation often, and avoid emotional behavior when it happens. A simple eye roll will discourage people strongly! And since escalation doesn’t come naturally in many organizations, the boss should actively explore her blind spots, find out when escalations that should be happening are not, and teach the team when to seek her out for decision making.

What if the boss is not making the decisions she needs to make?

In that case, the boss is failing her team. Perhaps it’s time for the team to escalate the inaction to the next level up.

What if the boss deals with escalations a lot, and it’s taking too much time?

First and foremost, making decisions is a crucial part of the duties of a manager, and should be highest priority, especially if progress is blocked. However, one clearly would want to minimize the occurrence of escalations.

Some common reasons for escalations and how to address high frequency occurrences:

  • Lack of context in the teams – probably the most common reason. The remedy is to diagnose the disconnect and communicate the resolution.
  • Unclear responsibilities – another form of lack of context. See above and my last post.
  • Conflicting personal motivations –  this is the hardest to diagnose and fix. For example, Bob is pushing for a decision because it will help get him promoted, even though it may not be the best for the company as a whole. It takes grit and determination to address these problems, especially in bigger companies.

What if decisions are regularly escalated so highly up the hierarchy that it’s difficult for the boss’s boss’s boss to know enough details to make a good decision?

It may be time to think about changing the structure of the organization, e.g. by creating a “common boss” lower in the hierarchy.

Conclusion

As part of a series of articles on organizational health and best practices, we’ve looked at escalation, an often underused method to reach quick decisions and resolution of larger disconnects.

In my next article, I’ll use these principles as a foundation for best practices for rapid technical innovation in teams of any size.

My Master Plan: Make it Up as You Go Along

By, Julie Pitt

“Where do you want to be in five years?” We’ve all been asked this question in a job interview. If you’re like me, you probably made something up on the spot about ascending into leadership, or expanding your skillset. I have a confession to make. I loathe this question. Honestly, I don’t know what I’m doing. I have never had a plan.

When I really examine this question and how it relates to my success, I can confidently say that lack of a plan has never been a problem for me. I distinctly remember when I left my last government job to take a position at Netflix. My boss at the time told me: “you will go to industry, get burned out, and be ready to return and become a Civil Servant.” The opposite happened. Within a few months of joining Netflix, I overcame the feeling that I was out of my league. I became energized, realizing I was capable of things I never thought possible.

Still, if someone told me at the time that I would someday start a company, the last thing I’d say would be “yep, that’s my master plan.” I would have told them that such a move “just isn’t in my nature.” And yet with every career move since, I have ventured into the unknown and come out the other end resembling more of an explorer and less of a comfort seeker.

Maybe that is why I feel like now is the right time to start a company.

The Next Phase

It is no secret that most companies fail. Overconfidence and optimism are well documented in entrepreneurs who delude themselves into thinking that unlike most people, they will be successful. I claim that I am only slightly delusional. I accept that I am entering into this venture against extreme odds. However, if the only thing I walk away with is an exciting adventure and the wisdom of hindsight, I will feel as if my time was well spent.

At a personal level, I want to know, is this “it” for me? Is there an “it”? My gut says no, but only time will tell. Since humans like me easily forget moments of sentiment and determination, I shall lay out a series of testable hypotheses that I can assess as we form and evolve our company. I admit that most of the “testing” will be subjective. My hope is to counteract the human tendency to fit an explanation to past events, by making predictions before those events unfold. Before diving into those, let me set the stage.

The Partners

My two co-founders and I have worked together closely at two companies in the past, over a span of 6-8 years. We have observed each other’s successes, but have also worked through problems together under extreme stress and pressure. We have a good idea of our individual strengths and quirks, and have given one another very candid and sometimes painful feedback. Fundamentally, we are starting from a mutual basis of trust founded on years of experience working together.

All three of us have some runway in the form of personal funds, where we can each go for some time without a paycheck. This allows us to start without external funding. We all have very marketable strengths from our past industry experience, such that we can advise other companies on building successful products, avoiding pitfalls along the way. Given the immediate demand for our expertise, consulting is our primary business.

At the same time, we are all passionate about developing an innovative product that people will love. We don’t yet know what that will look like. Our current strategy for exploring this space is to set aside time for R&D projects that will yield a clear signal on which direction to go. Admittedly, it can be extremely difficult to balance these efforts against a growing consulting business, so one open question is whether we can pull this off.

Business Hypotheses

Within the above context, I make the following predictions.

Hypothesis: If each partner has motivations that are aligned, all will act in the best interest of the company. Nobody will try to “cheat” and nobody will feel shortchanged.

This one especially applies to small startups. I want to make the distinction that when forming a partnership, it is unreasonable to expect that everyone will contribute equally in all areas. This means that assuming you have the right mix of people and strengths, each person will contribute those strengths and that as a whole, nobody feels like they are pulling extra weight. If it feels like someone is slacking off, the question should be why. I would posit that the reason most assuredly comes down to that person’s motivations no longer aligning well with those of the other partners.

Hypothesis: Having a sane and disciplined process for evolving company vision and strategy is more important than having “the right” vision and strategy.

In essence, you don’t know what you don’t know, and that’s OK. I believe it is best to acknowledge what you know and what you don’t, form a strategy based on available information, but quite quickly adjust as more information is available. What I’m stressing here is that the governance process for evolving a vision and strategy is more important than having the right ones from the beginning.

Hypothesis: Exploring open-ended problems in time-boxed increments with measurable achievements can lead to innovative solutions that ultimately provide business and/or consumer value.

When exploring an unknown space in a business setting, some questions are bound to come up. Is this a fruitful path to explore, or is it a dead end? If it is a fruitful path, how long will it take? Usually these questions don’t have answers, especially at the beginning.

I claim that the best approach when embarking on an expedition like this is to put checkpoints on the calendar, in advance, to assess whether to kill the project, keep going, or make a course correction. A goal or testable hypothesis should be set for each time increment and assessed at each checkpoint. Even if it’s as simple as “I will have a working vocabulary of technology XYZ in 1 week”, your ability to decide what to do next after 1 week will be dramatically improved.

Hypothesis: The business will unwind gracefully and amicably if the exit is discussed and agreed to in writing at the outset.

This prediction is based on advice I have heard from a number of entrepreneurs and professionals. Even when you are working on the basis of mutual trust, talk through all the ways your business can spontaneously combust and decide how to handle them before emotions and tensions of the situation are involved. In other words, let your thinking brain decide, rather than your emotional brain.

Hypothesis: Giving something away spreads the seeds of serendipity far and wide. Most of those seeds will wither and die, but some will bloom in surprising ways.

To prove this one, all I need to do is provide an example of what I judge to be a fortuitous encounter, relationship or other windfall that results from this blog. Another question that has come up between the three of us is, how much free advice should we give before engaging a client? My stance is that we should not worry about this. If a 20 minute conversation with a prospective client solves all their problems, we should feel good that we helped them. It would also be a signal that either their problems aren’t that challenging or our advice isn’t as valuable as we think it is.

Hypothesis: Growing a company very slowly means you can work sane hours and have a life.

This is one of those questions I have wrestled with for years. I have a natural tendency to immerse myself in whatever I am doing. I know that to be a healthy person for the long term, I need to balance that with family, fitness, hobbies and other pursuits. On top of that, commuting to the Silicon Valley is horrendous for me and I’m not interested in moving. Can I work a normal workday and still create a successful business?

On the growth side, I often wonder how much of the urgency of delivering something quickly is manufactured by CEOs and leadership. It seems that at any particular moment, “the time is right and we must act now.” I believe that nimbleness and maneuverability are keen advantages in just about any business, but moving quickly in an unfocused way will lead to burnout with no real gain. Does a company really need to grow quickly to be successful?

Hypothesis: Enlisting the help of a good lawyer and accountant at the outset will save a ton of headache.

To prove this hypothesis, I need to produce an anecdote in which the advice of one of these professionals averted a major snafu. I am almost certain this will happen.

Hypothesis: It is possible to build a successful consulting practice while doing R&D for product development. The business can successfully transition from one to the other, or sustain both.

I am the least certain about this one. I have heard from other friends who have done consulting that sales and marketing eventually soak up much of one’s time when building up a client base. From my experiences so far, I can easily believe this. The delusion creeps in here because I must believe that somehow “it will be different this time.”

Summary

Some of these hypotheses will become evident quite quickly, and some will take months, if not years to be more or less conclusive. I will endeavor to give updates on these as events unfold. It is also more than likely that I will realize that the formulations of some of these hypotheses missed the mark, or that there are other ones that are truly relevant. At least now I have a starting point from which to compare my experience.

Who’s Responsible For What?

By Christian Kaiser

Imagine everyone in your company or department knowing exactly whom to invite or not to invite to any given meeting. Imagine decisions being made without delay because everybody who needs to be in the room consistently is, and the total number of people in the room is always as small as it can be!

In my experience, there’s no better way to accelerate a project than by being crisp about who’s responsible for what, who needs to be involved when and how, and who doesn’t.

I found that a RACI-like method (applied quite informally) works well. It’s a usually quick and simple exercise, with a very high return on the time spent.

I’ll walk through the steps in the following sections:

Determining Areas of Responsibility

First, you will need to define the “areas of responsibility” you want clarity on.

The areas you use for your exercise can be as coarse or fine-grained as needed though, as long as everybody knows where the areas begin and end. To keep it simple, I like to start with few coarse areas and only split and refine when needed (i.e. if there’s a different assignment of roles needed for a sub-area).

Let’s use an example: Baconwrappr.com is a fictional small startup of a handful people. Everybody’s wearing a number of big hats, and their areas of responsibility are rather coarse: “Sales”, “Finance”, “Engineering”, “Marketing”.

As Baconwrappr.com’s business grows, they hire additional people. They now have James, a sales executive. Mike, one of the founders, still works directly with a few key clients who he has a personal relationship with. In order to reflect that, the group split the “Sales” area into “Sales (client X and Y)” and “Sales (everyone else)”.

The company also hired a great product manager, Sean. Now “Product” becomes a new area of responsibility that is separate from Engineering.

To record things for discussion purposes, I like to put them into rows in a spreadsheet, like so:

wirfw01

NB: Google Apps seems particularly well-suited for this purpose because it allows collaboration between multiple contributors, plus no documents are passed around in email, creating confusion about which one’s the right one and how to merge.

Assigning Roles

Then you need to assign roles for the appropriate people in the team, in each area. Start with putting the names of people on top of columns to the right, like so:

wirfw02

Assigning the roles is quite simple, since there are only a small number of roles to choose from. Here are the ones I use:

Owner

The person in this role drives the whole area of responsibility forward. It could be simply someone who actually does the actual hands-on work, or a delegate like a program manager, team manager etc., if a whole team covers this area. This is the person who drives the decision process and calls the meetings. Please note that ownership means just that, and does not include final decision authority.

There can only be one owner for each area of responsibility. If you find yourself thinking about creating multiple owners for some area, think about whether your should split the area instead.

Stakeholder

A person in this role needs to agree when decisions in this area are made. Owners are also stakeholders by definition. Stakeholders are mandatory attendees to decision making in this area. If they cannot make it, they must send a delegate with decision authority. Or in other words, no decision that affects the area can be made without them.

There should be as few stakeholders as possible to be able to make decisions quickly. Ideally, there’s just one: the owner. But for a variety of reasons, many organizations choose to have more than one pair of eyes on the area.

Before assigning someone stakeholdership or asking for stakeholdership yourself, ask yourself whether the team could instead find ways to trust the owner or other existing stakeholders to make decisions. After all, stakeholdership comes with a commitment of time and focus. For example, the team might decide instead to increase the owner’s level of context by training them, or by bringing them into conversations they’re not part of yet.

Informs decision

People in this role are invited to meetings or are otherwise consulted when decisions need to be made. They are not mandatory attendees at decision meetings and are also not required to agree with a proposal.

To the contrary, if you find that you would like someone to be mandatory attendee to decision making meetings, they should probably be a stakeholder.

None

This is actually a very important role. Someone with this role in a particular area is not involved, e.g. not invited to the decision making meetings, therefore does not need to spend time and can focus fully on other areas. You’ll want to maximize the number of these.

 

In case you’re missing a role along the lines of “informed”: I like to simplify by not calling such a role out separately. In most small to medium size organizations, the default for this role should be “everyone”, at least for opt-in information channels.

Some people may argue that everyone in their organization already knows what each others’ role is, and that might certainly be true in their case. However, I have found it to be false in most cases. Specifically, it is often very difficult to organically disperse this knowledge crisply when new people come on board.

I was often surprised about how many disconnects were exposed after writing things down and discussing it with the people involved. For example, multiple people might think they’re the owner, or, even worse, it turns out that no one thinks they own a particular area.

Responsibilities are expressed as field colors at the intersection of people and areas of responsibilities. Pick your own color scheme! I’m using RED for owner, GREEN for stakeholder and YELLOW for informer.

Here’s baconwrappr.com’s matrix:

wirfw03

You’ll notice that it is not only clear how the roles are distributed in a particular area (horizontal), but also which hats and how many of them everyone wears (vertical).

I’ve seen very productive reality checks come out of this exercise, e.g. when someone realizes they cannot really own two major areas and be stakeholder in three more, or that a certain area has too many stakeholders. It’s often satisfying for the team to mutually agree on “downgrading” someone’s role, e.g. from “stakeholder” to “informs”, or even to “none”, because it likely translates into increased focus for the other (probably no less important) roles, and into faster decision making overall.

In our example, Mike, as a founder, started out wanting to be a stakeholder in all the areas, but then realized that he would be content (and already very busy) with informing Finance and Marketing decisions, and not being involved in Engineering decisions at all.

Communicating It

When the group is satisfied with the result, it’s time to communicate it to whoever needs to know (probably everybody). How you do that is up to you.

Since the semantics do need some explaining to folks who have not been part of the exercise, I’ve found it useful to go over the details with the team and have extensive Q&A rather than just attaching it to an email to all@.

Summary

  • There’s often confusion about responsibilities, especially over time and through team changes
  • A rather simple and short exercise can provide clarity and focus
  • It allows to define the nature of involvement crisply
  • The team can agree upon a efficient distribution of responsibilities after careful deliberation

Next Time

You may have noticed that I did not touch on what happens when stakeholders disagree. How can a decision be reached in that case? I will cover best practices around this in my next post.

Layers of Failure, With a Side of Bacon

By, Julie Pitt

Remember when we talked about action codes as a means for unambiguous, actionable protocol errors? You may have wondered why you need all this fancy stuff like action codes and error metadata, when there’s already a well-defined standard in HTTP status codes. It’s working for you, so why mess with a good thing?

Today we will dive into an example REST over HTTP application. I’ll show you why it’s important to separate failures in the transport layer from those in the application layer. Then I’ll give an example of how to do it.

Bacon is the New Coffee

I briefly considered making an example API that is all about coffee when I realized that there are not nearly enough bacon applications out there. Let’s start with a little story.

Greybird Labs is a fast-growing company with headquarters in a 5-story building. Many Greybird employees are bacon fanatics, so naturally, the company recently installed a bacon cooker on each floor. Since a bacon cooker is a thing I just made up, I can also say that it has a convenient REST API, which employees can call from their desks to start cooking bacon.

Bringing Home the Bacon

As a recent hire at Greybird, you’re in the midst of a massive refactor. All of a sudden, you can’t decide whether to use recursion or a for loop. “Aha!”, you exclaim (as I often do), “I need bacon!” Conveniently, you have created an alias called ‘bacon’ for the following command:

$ wget --post-file "order-bacon.json" \
> --header "X-Greybird-Auth: SSBMb3ZlIEJhY29u" `# authentication token` \
> --header "Content-Type: application/json" \
> http://bacon.greybirdlabs.com/v1/0/placeOrder `# not a real domain, so don’t try it`

The order-bacon.json file (i.e., the request body) contains:

{
 “baconType”: “BlackForest”,
 “numberOfPieces”: 3
}

The response indicates that all is well:

{
 “machineId”: 0
 “jobId”: 243,
 “baconType”: “BlackForest”,
“numberOfPieces”: 3
}

You’ve submitted your bacon job. Is it done yet? To find out, you type:

$ wget --header "X-Greybird-Auth: SSBMb3ZlIEJhY29u" \
> http://bacon.greybirdlabs.com/v1/0/status/243

In return, you see:

{
  “machineId”: 0
  “jobId”: 243,
  “status”: “cooking”
  “baconType”: “BlackForest”,
  “numberOfPieces”: 3
}

Your bacon order is still cooking. While you wait, read on.

Bacon as a Service (BaaS)

By now you’ve figured out that the bacon cooker API is pretty simple. It has two resources:

POST /v1/[machineId]/orderBacon
GET /v1/[machineId]/status/[jobId]

Each bacon cooker is assigned a machineId, which is the floor number the machine is on. Greybird’s engineering team thought it would be funny to use a 0-based index for floors, so machineId 0 is actually on the first floor. Each active order is assigned a jobId, which can be used to track the status of your order.

Bacon Foul

Now let’s think about what can go wrong when it comes to ordering bacon.

  • Credentials (i.e., X-Greybird-Auth header) missing or not authentic
  • Invalid URL path (i.e., no matching handler could be found)
  • Unanticipated error in the endpoint (probably a bug)
  • Server is busy and can’t take more requests
  • Order queue is full
  • No such machine ID
  • No such job ID
  • Invalid input data (numberOfPieces, baconType)
  • Not authorized (e.g., employee doesn’t have permissions to access a job)

A common model for failures in an API like this is to map each one onto an HTTP status code. You may have noticed that some of these failures are quite specific to ordering bacon, but others are generic enough that they would apply to other applications. For example, if Greybird wants to start offering eggs, they’d like to leverage much of the protocol and service stack already developed for bacon.

Bacon, Eggs and Reuse

A major drawback of using HTTP status codes for all failures is that a single piece of code needs to understand failures in both message transport and the application. This makes it nearly impossible to reuse error handling code for eggs. Testability suffers as well since testing the application now has a dependency on HTTP. Not to mention the ambiguity and brittleness that may be caused by overloading status codes.

A better model is to separate the protocol and corresponding failures into at least two layers: transport layer and application layer. The transport layer is responsible for sending and receiving messages via HTTP, but doesn’t care what the application does. The application layer knows the semantics of bacon ordering but doesn’t care how the orders came in.

Here’s how we might categorize the failures into layers.

Transport layer

  • Credentials missing or not authentic
  • Unanticipated error in the endpoint
  • Server is busy and can’t take more requests

Application layer

  • Invalid URL path
  • Order queue is full
  • No such machine ID
  • No such job ID
  • Invalid input data
  • Not authorized

Now that things are separated, we can map HTTP status codes onto transport layer failures.

401 -> Credentials missing or not authentic
500 -> Unanticipated error in the endpoint
503 -> Server is busy and can’t take more requests

Better yet, as we learned in my last post, let’s not define each and every possible failure in our protocol spec. Instead, enumerate the subset of HTTP status codes returned by the API.

400 Bad Request
401 Unauthorized
500 Internal Server Error
503 Service Unavailable
...etc...

Then, we enumerate the actions the client can take as a result of a failure.

enum ActionCode {
  Retry 
  DoNothing
  ObtainCredentials
}

In the protocol spec, we define which action code is implied by each HTTP status code. The client then acts according to the implied action code.

400 -> DoNothing
401 -> ObtainCredentials
500 -> DoNothing
503 -> Retry
...etc...

Bacon != Eggs

What about application layer failures? A convenient delivery mechanism for application layer failures is the body of a 200 status code response. An alternative model is to reserve a specific status code (other than 200) to indicate an application-level failure, and place failure details in the body. I will describe the former method.

In the case of using a 200 for application failure, the body is now a wrapper that tells whether the request was successful. If successful, the response data is found inside the wrapper.

{
  “status”: “Success”,
  “responseData”:
    {
      “machineId”: 0
      “jobId”: 243,
      “baconType”: “BlackForest”,
      “numberOfPieces”: 3
    }
}

Otherwise, the wrapper contains the failure.

{
  “status”: “Failure”,
  “error”: {
      “actionCode”: “Retry”,
      “details”: {
        “name”: “OrderQueueFull”,
        “errorCode”: “1234”,
        “description”: “Order queue is full. Wait and retry.”
      }
    }
}

The client only needs to check the status field. If the status is Failure, the client can then unwrap the error field and act according to the actionCode. Conversely if the status is Success, the client can then forward the response to the appropriate handler.

Summary

With this design, it is possible to keep application logic completely separate from the business of transporting messages. The benefits of such a design include independent reuse of transport and application logic, testability of both client and server applications and resilience against failure scenarios. It has worked well at several companies I’ve worked for, both in initial design phases and at scale.

Meanwhile, the employees of Greybird Labs are clinking their bacon strips as they toast to one more successful refactor. By the way, your bacon is done.

Making Failure Matter

By Julie Pitt

If you have ever seen or written vague error handling code; if you’ve ever been frustrated by an unhelpful error message like “something went wrong”; if you’ve ever designed an API, this article is for you. I’ll begin with a short story that describes the problems caused by ambiguous failures in client/server protocols and then explore ways to address them.

Enter Application Developer

Say you’re an application developer. You’re writing this awesome app and everything’s going great. It looks very pretty, the UI is responsive and best of all, it’s easy to use. Now all you need is data. Chances are, you’re going to get it from someone else’s API, which invariably requires access to a network and data store of some kind. You’re not too familiar with this API, so you start with something like this (you know, just to try it out):

try {
 // call the API
} catch (Exception e) {
  // error gobbling sasquatch
  print(“me want error. nom nom.”)
}

That is utterly…un…awesome. You wonder, how can I give this error-gobbling sasquatch the precision of Wolverine, with his nifty retractable claws and whatnot? How can I make my application responsive and resilient so that my users like it? You are determined to do better, so you try again:

try {
 // call the API
} catch (SQLException s) {
  // hmm wait...what does SQL Exception mean?
} catch (IOException i) {
 // should I try again, or give up? Probably try again?
} catch (TimeoutException t) {
  // Retry. Definitely.
} catch (Exception e) {
  // uh….
  print(“fail”)
}

I guess that was a little better. At least now you have discrete code blocks that allow you to recover in different ways. It’s kinda like you taped some claws onto Frankenstein’s fists and told him to have at it.

Now say that the API has been updated with a new error condition called ServerBusyException. You probably want to retry like you would with a timeout, but without changing your code, the ServerBusyException falls into the sasquatch bucket. Nom nom.  Worse yet, when you do change your code, you have to map both TimeoutException and ServerBusyException to the retry logic.

Can you do better? Not really. But not to worry; I am here to tell you that it is not your fault. In fact, I would point the finger at the API designer. Whoever designed this API did not properly separate two very different concerns:

  1. Alleviate the pain
  2. Gain insight

As the application designer, you should only have to care about the first one. The API designer needs to worry about both.

Alleviate the pain

Alleviating pain means taking action. When you frame it this way, understanding exactly what went wrong is not a prerequisite to handling failures. Another way to look at it is that there is really only a discrete set of possible actions that an application will take to recover from failure. The goal of the API designer is to explicitly define those actions and enumerate them in the contract.

Let’s go back and look at the errors you had to catch in the last section:

SQLException
IOException
TimeoutException
ServerBusyException

How can we make these actionable? The first step is to map them onto specific actions the client application should take:

SQLException -> DoNothing
IOException -> Retry
TimeoutException -> Retry
ServerBusyException -> Retry

We call these action codes, which we can now enumerate:

enum ActionCode {
  Retry
  DoNothing
}

Generally, any error that is due to some transient failure in the service should be acted upon by retrying the same request using a well-defined retry policy. On the other hand, if there is a bug in the client (e.g., corrupt data or a malformed request), the action taken should be to never try that request again. It is a good idea to limit the number of action codes to the smallest set of recovery scenarios that will lead to a resilient and responsive application.

Once the action codes are defined, we wrap the errors into a generic exception that conveys both which action to take, and detailed information about the failure:

class MyAppException extends Exception {
  ActionCode actionCode
  // We’ll get to this one a little later:
  MyAppError error
}

The client code then becomes:

try {
 // call the API
} catch (MyAppException e) {
  if (e.actionCode == Retry) {
   // do a retry!
  } else if (e.actionCode == DoNothing) {
    // do nothing!
  }
  // Here you would want to log what the action and error are
  logger.error(e)
}

Notice how this code completely ignores WHAT went wrong, aside from recording the particulars of the failure in logs and/or metrics. What it does care about is the actionCode field, which it uses to determine the course of action to take. I wrote this example using pseudocode that looks like Java, but there is no reason why you could not model MyAppException in JSON as part of a REST API.

This model has several properties that are worth noting:

  1. The API designer is free to add as many MyAppError types as he wants, without breaking client applications. To maintain this property, the client code must never act upon or interpret any information in MyAppError.
  2. The client application only needs to handle each type of action code once. There is no longer a need to figure out which exceptions can be thrown and handle them in multiple places.
  3. Multiple client applications may implement the same API with consistent and unambiguous failure handling logic. This reduces maintenance costs for service maintainers.
  4. Action codes are extensible, provided the API is properly versioned. For example, you could introduce one called RenewAuthentication to indicate that a user must be prompted for her username and password. Each new action code is a change to the API contract and requires changes in the client code. Luckily in practice, such changes are infrequent once the initial API stabilizes.

Now that we have a model for conveying actions in our API, why not dispense with error types all together? Unfortunately, your code will have bugs and you’ll need enough insight to detect and fix them.

Gain insight

Detailed error types are the mechanism for understanding what is happening in the application and debugging when something unexpected happens. Remember that action code called DoNothing? Unless we have fine-grained error types, there is no way we can chip away at precisely what the underlying causes are. Thus, the API designer should add as many error types as necessary to understand the failures in the application.

Let’s take a look at what you might want to put into MyAppError, in order to understand failures:

class MyAppError {
  // An easily distinguishable, unique name for the error that is also human-readable.
  // This is what you would use in the name of a counter, for example.
  String name
  // Helpful to put into a diagnostic screen, in case customers need to tell customer service
  Integer errorCode
  // Human-readable description of the problem, for developers (not user visible)
  String description
  // The exception that caused this error
  Exception cause
}

This is only one possible representation. The point is that as the API designer you can create a rich model of errors with enough metadata that you can tell what is going on in your application and debug if there is a problem. The consuming code should log and collect metrics on these errors to expose both specific failures and aggregates.

Summary

To keep your error handling code pumped full of adamantium, consider the following:

  1. Design your application to be resilient and responsive by enumerating specific actions it will take in response to failure
  2. Include each such action in your API specification
  3. Separate actions from error metadata. Do not act upon error metadata.
  4. Log and collect metrics on the error metadata so that diagnosis is possible after the fact.

Stay tuned for the sequel, which will discuss protocol layering and failure.