As teams, it’s essential that we not only design and develop our software, but also operate our software. In my earlier posts, I discussed some ideas around building teams so that they’re cross-functional and all-inclusive to help facilitate this notion. Something that I didn’t cover in those earlier posts was the “why” behind having teams not only write but also operate their software stack.
Perhaps it seems obvious as to why. After all, those who wrote the software have the most in-depth knowledge of the software they wrote. However, do these folks understand how their software operates? How all those design assumptions and interactions with the other parts of the system impact the overall system? For example, if service A calls service B and two different teams developed the two services, assumptions made by one group may not fit the call patterns expected by the other organization. An example of this that I’ve seen is service B, making an insert into a database using a transaction. If service A is calling this API hundreds of times per minute, an API that service B could expose instead is one that allows batching. This particular use case does not invalidate the use of the single call; however, understanding the overall system and its use cases prevent unpleasant surprises. Systems understanding also leads to designing systems that more usable.
When it comes to developing and operating your software, understanding your stack is as critical as understanding the code that you’ve written. At no time is this more evident than during a significant incident (think a pageable event with customer impact in the middle of the night). When an event like this wakes multiple teams up in the middle of the night, engineers should not be figuring out what dependencies exist and where the failure points may be. This type of event requires teams to be familiar with all aspects of their service, such as deployment activity, database, lambdas, interdependent services, frontend applications, etc. Taking it a step further, the on-call engineers should also understand particularly problematic dependencies, for instance, calls to a legacy system.
A multi-disciplinary and inclusive team helps address such areas. They have varied perspectives and think about different parts of the system. They round out a group’s knowledge and understanding. As explained in earlier posts, DevOps culture is not about creating a team. It’s about building an overall understanding of how your software works and operates.