Seeing Like an SRE: Site Reliability Engineering as High Modernism

5 years ago by theptip

It's an interesting comparison. Looking back in the history of software, "A pattern language" was an architectural treatise which inspired the software concept of "software design patterns".

Similarly, I can see that considering the known issues with top-down vs. bottom-up city planning/evolution could be beneficial for software-centric organizations too; the issues with badly-fit top-down city plans seem to match very well with the pains of an ill-fit software architecture that's mandated from an ivory tower, complete with users using the planned cities "wrong".

I'm sure there are differences though. You have a lot more observability into your software systems, and at the end of the day, they are orders of magnitude less complicated than cities, so you can comprehend more of the system at once, and truly find common usecases to standardize around. This is in contrast to cities where it's impossible to really know every citizen's unique needs, temperament, and usage patterns.

Worth thinking about more; given the relatively low cross-pollination rates between the fields, I suspect there are more lessons that software engineers could glean from architecture and city planning.

5 years ago by ssivark

A key underlying assumption in Scott’s perspective in “Seeing like a State” is that diversity is critically important to healthy functioning of biological/human/cultural ecosystems. In large computing system fleets we’re often okay with the opposite — simplifying by fiat because the understanding/control of the architect is more important than the diversity of individual machine configurations. Yes, the monoculture could lead to correlated failures (Eg: all machines are vulnerable to the same exploit), but the common perspective is that the simplicity/controllability and efficiency gains are worth it.

I think we might be able to get by with this perspective so long as we’re seeing computers/systems only as inert tools. It’s interesting to consider whether there’s any motivation for that to change, as we move towards more ubiquitous & intelligent computing. (Eg: should IOT devices be thought akin to insects?)

5 years ago by WJW

One of the key differences is that (the various components of) nature has no common goal except that each individual component wants to reproduce, while large computing systems are almost always constructed to achieve some particular objective. Thus, nature is OK with it if predators randomly kill some percent of the population while most factories would frown very much if a random employee started sabotaging lathes or something..

You could argue that netflix-style chaos engineering is an attempt to introduce more resilience into the system precisely by mimicking natures "anything can die at any moment" principle, but even then it typically only applies to computers. Netflix is known for firing fast but I don't think even they would consider randomly firing employees to make sure there are no single points of failure in the employee makeup. Would be interesting though: tax filing need to be submitted next Tuesday but the CFO was just fired, what is your recovery plan?

5 years ago by Kalium

I've encountered the idea of a Chaos HR Simian. People get random, unplanned, multi-week vacations.

Mentioned here: https://www.cognitect.com/blog/2016/3/24/the-new-normal-embr...

I know I've been on teams that were significantly disrupted by jury duty, medical incidents, traffic accidents, etc. So it seems like a reasonable way to simulate this.

5 years ago by benlivengood

Something the author didn't touch on specifically is the limit on languages at Google. When I left the officially supported languages were Java, C++, Python, and Go. That limited the scope of CI/CD, tracing and monitoring, and debugging to something tractable for the developer tools teams. It also made it tractable for SRE teams to be able to engage with new product teams without having to learn a whole new language.

A really useful thing my team did (and I think it was a moderately successful trend on other SRE teams) was to role play recent outages. The oncall who had seen a particularly interesting outage would DM using the graphs, error messages, and logs they encountered when debugging a alert for a chosen victim (ahem, role-player) who would have to choose which graphs, dashboards, and logs to look at and which remediation actions to take to track down and fix the actual problem. It was perfect for building metis since it was done in a team setting so everyone benefited from the insights into the system architecture and behavior and the role-player learned practical oncall skills. Things like escalating to other teams and running incident management were built into the RP.

5 years ago by klodolph

Python is “supported” but if you want to write a new program in Python you need approval from the area tech lead ¯\_(ツ)_/¯

5 years ago by pm90

This is such a great idea. I struggle to see it being adopted at my current, OKR driven organization where literally any work is debated until death lol.

5 years ago by Jiocus

The author touches on knowledge management, which is one of the most interesting subjects I was able to study at uni (part of CS/SS). A kind of analogy to the techne/metis is the concept of explicit and implicit knowledge.

We codify knowledge or information into explicit knowledge such as documentation, expert systems or design. Not all knowledge lends itself to this.

Implicit knowledge is that which often require experience and learning by doing. It is hard to capture explicitly. On the one hand because the skilled individual might be unaware of the skill in action, on the other they may be unable to express it.

Various hacks are then tried to pry this valuable asset out into the open, so it can be recorded on a corporate wiki.

5 years ago by ravi-delia

My (admittedly limited) experience is that systems aren't maintainable except by people that are very familiar with them. The basic principles of the SRE don't ignore that, they embrace it. Rather than trying to manage a system from the top, they encourage the admin to delve in and craft it themselves. By bringing infrastructure close to the users of that infrastructure, everyone gets a chance to gain hands on knowledge. Is that how it actually turns out? Maybe, maybe not.

5 years ago by SideburnsOfDoom

> systems aren't maintainable except by people that are very familiar with them.

I think that a consequence of "two sorts of knowledge: techne and metis" is that standardisation is good, but it only gets you so far. Past that point, you need to be familiar with the system.

This should not devalue our efforts to standardise, e.g. get systems to all log to the same aggregator, and emit the same basic stats, agree on naming and forwarding of correlation ids that will allow us to cross-reference related log entries.

But we should also recognise that those efforts will never cover everything.

e.g. If I changed over to working on an unfamiliar system in the same organisation, I would know where it should be logging to, what the field naming and general structure of those log entries should be, but I would not not know what healthy operation should look like in those logs.

5 years ago by jart

The author makes Kubernetes sound like it's a technocratic regime controlled by a political class of anyone who's ever held the title SRE at Google. They do control the means of production. Me however, I'm just a member of the typing pool.

5 years ago by lacker

Perhaps everyone who was ever an SRE at Google added one new configuration option to Kubernetes, and that's how it ended up this way.

5 years ago by logicslave

You joke but thats what happened with Tensorflow at Google. Everyone wanted a "contributed to tensorflow" on their resume

5 years ago by jart

Well I think what they wanted was for their work to be used. It was a great big bag of things.

5 years ago by eternalban

Poor Corbusier, getting blamed for the architectural errors of Mies van der Rohe's sadly untalented copy cats, pseudo-intellectual ideologues, and greedy developers.

For the record, Corbusier's Ville Radieuse (Radiant City) predates the Cold War by a rather hot World War II (1930). Interestingly enough, it was a very Googly impulse -- "organize all the world's" bipeds -- that motivated the relatively young control freak aka architect. After WWII, Corbu mellowed. And his collective residential structures, Unité d'habitation, were the result of his synthesis of a generative measuring system and modularity. OP and fellow SREs have quite a lot to learn from the mature thoughts of Le Corbusier.

Over here in America, we had our own native genius, Frank Lloyd Wright, who devised his vision of an urbanism for a democracy - The Broadacre City:

https://franklloydwright.org/revisiting-frank-lloyd-wrights-...

But of course, the "high modernism" clique (ran by the moneyed set of East Coast (think MoMA), and the "ex-Nazi", Phillip Johnson) that did everything to marginalize Wright. And it was this clique, having imported wholesale (ironically) the leftist architects of Europe escaping Fascism, that gifted us with "high modernism" dystopia.

If you want to learn about modern architecture, I recommend Ken Frampton's Modern Architecture: A Critical History. He was one of the very few actual teachers I had in architectural school worthy of the designation.

https://en.wikipedia.org/wiki/Kenneth_Frampton

https://www.goodreads.com/book/show/70140.Modern_Architectur...

https://en.wikipedia.org/wiki/Philip_Johnson#Controversy_ove...

https://en.wikipedia.org/wiki/Ludwig_Mies_van_der_Rohe (His own works were exquisite gems.)

5 years ago by pm90

This is an extremely well written article. The concepts of techne and metis, I hope these become part of tech vocabulary and allow us to talk about differences in perspectives on infrastructure and especially infrastructure migrations more effectively without hating each other.

5 years ago by anotha1

> Techne is universal knowledge: things like the boiling point of water, Pythagoras’ theorem, the rule that all RPCs should have deadlines, or that we should probably alert if no instances of our jobs are running.

> metis, is local, specific, and practical. It’s won from experience. It can’t be codified in the same way that techne can. The comparison that Scott gives is between navigation and piloting. Deepwater navigation is a general skill, but a pilot knows a specific port — a ‘local and situated knowledge,’ as Scott puts it, including tides, currents, seasonal changes, shifting sandbars, and wind patterns. A pilot cannot move to another port and expect to have the same level of skill and local knowledge.

5 years ago by rwtwe

It might be worth noting that we don't need to rely on this particular book as a source for this distinction. It is essentially congruent with the necessary/contingent distinction in philosophy.

Other expressions of it include the strategy/tactics distinction and the nomothetic/idiographic distinction. The idea is based on the very ancient observation that phenomena involve both general laws and specific circumstances.

5 years ago by Simon321

Great article. Nice insights on techne & metis.

Seeing Like an SRE: Site Reliability Engineering as High Modernism

Daily Digest