Friday, April 5, 2024

The problem of "thrashing"

"Thrashing" is a general term/issue in computer science, which refers to the situation (in the abstract) in which multiple "work items" are competing for the same set of resources, and each work item is being processed in chunks (ie: either in parallel, or interleaved), and as a result the resource access ping-pongs between the different work items. This can be very inefficient for the system if switching access to the resources causes overhead. Here's the wiki page on the topic: https://en.wikipedia.org/wiki/Thrashing_(computer_science)

There are numerous examples of thrashing issues in software development, such as virtual memory page faults, cache access, etc. There is also thread context thrashing, where when you have too many threads competing for CPU time, the overhead of just doing thread context switching (which is generally only a few hundred CPU cycles) can still overwhelm the system. When thrashing occurs, it is generally observed as a non-linear increase in latency/processing time, relative to the work input (ie: the latency graph "hockey sticks"). At that point, the system is in a particularly bad state (and, ironically, a very common critical problem in orgs is that additional diagnostic processes get triggered to run in that state, based on performance metrics, which can then cause systems to fail entirely).

To reduce thrashing, you generally want to try to do a few things:

  • Reduce the amount of pending parallel/interleaved work items on the system
  • Allocate work items with more locality if possible (to prevent thrashing relative to one processing unit, for example)
  • Try to allow more discrete work items to complete (eg: running them longer without switching), to reduce the context switching overhead

Now, while all of the above is well-known in the industry, I'd like to suggest something related, but which is perhaps not as well appreciated: the same problems can and do occur with respect to people within an organization and process.

People, as it turns out, are also susceptible to some amount of overhead when working on multiple things, and task switching between them. Moreover, unlike computers, there is also some overhead for work items which are "in flight" for people (where they need to consider and/or refresh those items just to maintain the status quo). The more tasks someone is working on, and the more long-lived work items are in flight at any given time, the move overhead exists for that person to manage those items.

In "simple" jobs, this is kept minimal on purpose: a rote worker might have a single assigned task, or a checklist, so they can focus on making optimal progress on the singular task, with the minimal amount of overhead. In more complex organizations, there are usually efforts to compartmentalize and specialize work, such that individual people do not need to balance more than an "acceptable" number of tasks and responsibilities, and to minimize thrashing. However, notably, there are some anti-patterns, specific to development, which can exacerbate this issue.

Some notable examples of things which can contribute to "thrashing", from a dev perspective:

  • Initiatives which take a long time to complete, especially where other things are happening in parallel
  • Excessive process around code changes, where the code change can "linger" in the process for a while
  • Long-lived branches, where code changes needs to be updated and refreshed over time
  • Slow pull-request approval times (since each outstanding pull-request is another in-progress work item, which requires overhead for context switching)
  • Excessive "background" organizational tasks (eg: email management, corporate overhead, Slack threads, managing-up tasks, reporting overhead, side-initiatives, etc.)

Note, also, that there is a human cost to thrashing as well, as people want to both be productive and see their work have positive impacts, and thrashing hurts both of these. As a manager, you should be tracking the amount of overhead and "thrashing" that your reports are experiencing, and doing what you can to minimize this. As a developer, you should be wary of processes (and potentially organizations) where there are systems in place (or proposed) which contribute to the amount of thrashing which is likely to happen while working on tasks, because this has a non-trivial cost, and the potential to "hockey stick" the graph of time wasted dealing with overhead.

In short: thrashing is bad, and it's not just an issue which affects computer systems. Not paying attention to this within an org can have very bad consequences.


No comments: