Tuesday, April 30, 2024

Development process issues in different types of orgs

There's an interesting dichotomy I've observed between companies where the development is run by developers, and those where it's run by managers.

In the former case, problems tend to get fixed, because the people making decisions are also the people impacted by the problems.

In the latter case, problems tend to compound and go unresolved, because the people making decisions are not directly affected, and/or cannot understand why the situation is problematic, and/or have competing priorities (to fixing them). This creates a situation where you really don't want to be doing development, and the best developers tend to move into management, and/or onto different projects.

The latter is also the case where projects tend to die over time, and sometimes companies with them, as they eventually become unmaintainable.

All of the above might be obvious to others, but I don't have a lot of previous experience working on teams where the development methodologies are all being dictated by management (most of my previous roles have been in companies using Agile, or something close to it, and/or smaller companies). In Agile ish organizations, developers were motivated and empowered to fix problems which impeded shipping code, and as such when there are blockers, they tend to be short-lived (sometimes the resolutions are not optimal, but significant issues rarely linger unresolved). This is not the case in large orgs where the methodologies are dictated my managers, though, particularly when those managers are not actively working on code.

This needs to be added to my list of questions and concerns when considering future roles, I think.


Monday, April 29, 2024

Thoughts on recent Google layoffs, Python support team

Recently, Google laid off an internal group responsible for maintaining Python within the company (story: https://www.hindustantimes.com/business/google-layoffs-sundar-pichai-led-company-fires-entire-python-team-for-cheaper-labour-101714379453603.html). Nominally, this was done to reduce costs; purportedly they will look to hire replacement people for the group in Germany instead, which will be cheaper. This is what the previous team was nominally responsible for: https://news.ycombinator.com/item?id=40176338

Whether or not this saves Google money in the longer term, on balance, is an open question. This doesn't normally work out well (getting rid of tribal knowledge, experience, etc.), but this is something large companies do regularly, so not a huge surprise in general. But this isn't what was most interesting about this to me.

Rather, I'd like to focus on some tidbits in the reporting and personal accounts which reveal some interesting (if true) things about the inner workings at Google at this moment.

Python is deprecated within Google

According to one of the affected developers, Python is considered tech debt within Google, with existing code to be replaced with code in other languages, and new code in Python frowned upon. There are various reasons given for this, but the fact that Google is moving away from Python is interesting, when this doesn't seem to be the case in general in the industry (and Google often is ahead of the curve with technology migrations).

The group was responsible for fixing all impacted code, even other groups'

When the group upgraded Python across Google, they needed to fix all impacted code, even if they didn't own it. This is a pretty huge headwind to updating versions and taking patches, and is reflected in their update planning and schedules. This points to the rather large systemic problem of trying to take regular updates to libraries across a large code base, and either a lack of planning, or a lack of budgeting at the project level(s).

Related to this, the group noted that often unit tests would be flaky, because they were written in a fragile way, and broke with updates to the language version. This is the large systemic problem with having a large code base of unit tests, of course: you need to have a plan and resource allocation for maintaining them over time, for them to have net positive value. It seems like Google perhaps is lacking in this area also.

Companies often don't value infrastructure teams

This is a general issue, but something to watch out for when charting a career course: lots of companies undervalue and under-appreciate infrastructure teams, who act as effectively force multiplier (in best cases) for product teams. The large the org, the less visibility the people working on foundational structures have, and the more common it is for upper management to look at those teams for cost cutting. Getting into foundational work within a larger org is more risky, career-wise: it might be the most effective value-add for the company, but it's also likely to be the least appreciated as well.

If you want to maximize your career potential, flashy prototype projects supported by mountains of tech debt which you can hand off and move on to the next flashy project will get you the most positive recognition at almost all large companies. A close second-place is the person who fixes high-visibility issues with kludgy "fixes" which seem to work, but ignore underlying or systemic problems (which can be blamed on someone else). It's extremely probable both of those types of developers will be more highly valued than someone who builds and maintains non-flashy but critical support infrastructure.

Managers are generally dumb, don't understand actual impacts

This is somewhat of a corollary to the above, but the person deciding which group to downsize to ensure profits beat expectations and the executives get their multi-million dollar performance bonuses isn't going to have any idea what value people in lower-level groups actually bring, and/or what might be at risk by letting critical tribal knowledge walk out the door. When there need to be cuts (usually for profit margins), the most visible projects (to upper management) will be the ones where people are most safe. Don't get complacent and think that just because your project is critical to the company, and/or your value contribution is high, that your job is more safe. Pay attention to what executives talk about at all-hands meetings: the people building prototype features in those groups are the people who are most valued to the people making the layoff decisions.

Take home

While Google has declined substantially since it's heyday (in terms of prestige and capability), in some ways it is still a bellwether for the industry, so it's good to pay attention to what goes on there. In this case, I think there's good information to be gleaned, beyond just the headline info. It sounds like Google is now more similar than dissimilar to a large tech company on the decline, though.


Friday, April 5, 2024

The problem of "thrashing"

"Thrashing" is a general term/issue in computer science, which refers to the situation (in the abstract) in which multiple "work items" are competing for the same set of resources, and each work item is being processed in chunks (ie: either in parallel, or interleaved), and as a result the resource access ping-pongs between the different work items. This can be very inefficient for the system if switching access to the resources causes overhead. Here's the wiki page on the topic: https://en.wikipedia.org/wiki/Thrashing_(computer_science)

There are numerous examples of thrashing issues in software development, such as virtual memory page faults, cache access, etc. There is also thread context thrashing, where when you have too many threads competing for CPU time, the overhead of just doing thread context switching (which is generally only a few hundred CPU cycles) can still overwhelm the system. When thrashing occurs, it is generally observed as a non-linear increase in latency/processing time, relative to the work input (ie: the latency graph "hockey sticks"). At that point, the system is in a particularly bad state (and, ironically, a very common critical problem in orgs is that additional diagnostic processes get triggered to run in that state, based on performance metrics, which can then cause systems to fail entirely).

To reduce thrashing, you generally want to try to do a few things:

  • Reduce the amount of pending parallel/interleaved work items on the system
  • Allocate work items with more locality if possible (to prevent thrashing relative to one processing unit, for example)
  • Try to allow more discrete work items to complete (eg: running them longer without switching), to reduce the context switching overhead

Now, while all of the above is well-known in the industry, I'd like to suggest something related, but which is perhaps not as well appreciated: the same problems can and do occur with respect to people within an organization and process.

People, as it turns out, are also susceptible to some amount of overhead when working on multiple things, and task switching between them. Moreover, unlike computers, there is also some overhead for work items which are "in flight" for people (where they need to consider and/or refresh those items just to maintain the status quo). The more tasks someone is working on, and the more long-lived work items are in flight at any given time, the move overhead exists for that person to manage those items.

In "simple" jobs, this is kept minimal on purpose: a rote worker might have a single assigned task, or a checklist, so they can focus on making optimal progress on the singular task, with the minimal amount of overhead. In more complex organizations, there are usually efforts to compartmentalize and specialize work, such that individual people do not need to balance more than an "acceptable" number of tasks and responsibilities, and to minimize thrashing. However, notably, there are some anti-patterns, specific to development, which can exacerbate this issue.

Some notable examples of things which can contribute to "thrashing", from a dev perspective:

  • Initiatives which take a long time to complete, especially where other things are happening in parallel
  • Excessive process around code changes, where the code change can "linger" in the process for a while
  • Long-lived branches, where code changes needs to be updated and refreshed over time
  • Slow pull-request approval times (since each outstanding pull-request is another in-progress work item, which requires overhead for context switching)
  • Excessive "background" organizational tasks (eg: email management, corporate overhead, Slack threads, managing-up tasks, reporting overhead, side-initiatives, etc.)

Note, also, that there is a human cost to thrashing as well, as people want to both be productive and see their work have positive impacts, and thrashing hurts both of these. As a manager, you should be tracking the amount of overhead and "thrashing" that your reports are experiencing, and doing what you can to minimize this. As a developer, you should be wary of processes (and potentially organizations) where there are systems in place (or proposed) which contribute to the amount of thrashing which is likely to happen while working on tasks, because this has a non-trivial cost, and the potential to "hockey stick" the graph of time wasted dealing with overhead.

In short: thrashing is bad, and it's not just an issue which affects computer systems. Not paying attention to this within an org can have very bad consequences.