Friday, November 1, 2024

Documenting a software implementation

Allow me to posit a hypothetical. Imagine you are at a company, and someone (perhaps you) has just completed a project to implement some complex feature, and you have been tasked with ensuring the implementation is documented, such that other/future developers can understand how the implementation works. For the sake of argument, we'll assume the intent and high level design have already been documented, and your task is to capture the specifics of the actual implementation (for example, ways which it diverged from the original design, idiosyncrasies of the implementation, corner cases, etc.). We'll also assume you have free reign to select the best tooling, format, storage, etc. for the documentation, with the expectation that all of these are considered in your work product.

Note: This is, in my experience, a not uncommon request from management, especially in larger companies, so it seems like a reasonable topic for general consideration.

Let's see how it plays out, looking at the various design aspects under consideration, and what the best selections for each might be.

Considerations:

Documentation locality

One aspect which is certainly worth considering is where the documentation will live. One of the very important considerations with any sort of documentation is locality: that is, where the documentation lives. Documentation in an external repository location can be both hard to locate (particularly for developers who are less familiar with the org's practices), and hard to keep in sync with the code (because it's easy for the code to be updated, and the documentation to be neglected). In concept, the documentation should be as "close" to the code as possible. An oft-quoted downside of putting documentation within a source code repository is that it cannot be as easily edited by non-developers, but in this case that will not be a concern, since presumably only developers will have direct knowledge of the implementation anyway. So in concept, the best place for this documentation is in the code repository, and as close to the implementation code as possible, to minimize the chances of one being updated and not the other.

Language and dialect

This might seem like a trivial consideration, particularly if you've only worked within smaller orgs with relatively homogeneous cultures and dev backgrounds, but I would suggest that it is not. Consider:

  • Not all the developers may speak the same native language(s), and nuance may be lost when reading non-primary languages
  • Some developers (or managers) may object to casual nomenclature for business products, but conversely not all developers may want to read, or be capable of writing, business professional text
  • There's also the question of style; for example, writing in textual paragraphs, vs writing in terse bullet points and such

In the abstract, the choice of language and dialect should be such that:

  • All developers can read and understand the nuances expressed in the documentation
  • The language used does not create undue friction for either being too much, or not enough, "business professional"
  • The writing style should be able to express the flow and semantics of the code in a comprehensible manner, while allowing for the various special-cases
    • For example, there should be a manner in which to express special-case notes on specific areas of the implementation, like footnotes or annotations
    • There should also be a way to capture corner cases, and perhaps which cases are expected to work, and which are not

Sync with implementation

This was alluded to in the locality point, but it's important that the documentation stay in sync with the implementation, to the maximum extent possible. If the documentation is out of sync, then it is not only worthless for understanding that piece of the code, but perhaps even a net negative, as a developer trying to understand the implementation from the documentation might be misled, and waste time due to bad assumptions based on the documentation. So in addition to locality (ie: docs near the code), we want to ensure that it is as easy as possible for developers to update the documentation at the same time they make any code changes, so that they will be able and inclined do so.

Expedient vs comprehensive

It would be a bit remiss to not also mention the trade-off in the initial production of the documentation, between being expedient and being comprehensive, and additionally how much the above trade-offs might impact the speed at which the documentation could be produced. Every real-world org is constrained by available resources and time, and presumably you will have some time limit for this project as well. So the quicker you can produce documentation, and the more comprehensive it is, the better your performance on this task will be.

So, what to do?

Admittedly, those readers who have thought about or performed this task already probably have some good ideas at this point, and perhaps the more intelligent readers have already figured out where this is going, just based on the objective analysis above. To recap the considerations, though:

  • We need something in a form which we can produce quickly, but is also as comprehensive of a description of the implementation as possible
  • The language used for the documentation must be readable by all the developers who are familiar enough with the code to work on it, regardless of their native language(s)
    • The documentation must be unquestionably work appropriate (no swear words, slang, obscure references, etc.), but also terse enough to provide value without being excessively verbose
    • There must be some mechanism in the structure to provide footnotes for implementation choices, corner cases, tested inputs, etc.
  • The documentation should be as close to the code as possible, such that it's easy to find, and there is a minimal risk of it getting out of sync with the actual implementation over time
  • It must impose the smallest amount of overhead as would be reasonable to update the documentation along with changes in the implementation over time
    • Note: This is often the hardest thing to get "right" with docs in general, since the value add for future readers must be greater than both the initial production time, and the maintenance time, for documentation to be a net positive value at all

Now, the above might seem like a tall order with lots of hard to answer questions, but let me point out something which might make these decisions a bit easier. A programming language, such as it is, is fundamentally just a way to describe what you want the computer to do in a human readable form. Assuming the hypothetical selection of the same language as the implementation for the documentation, this would be:

  • Able to be produced reasonable quickly
  • Readable to all developers who would be familiar with the implementation code
    • Unquestionably work appropriate
    • Able to provide footnotes (via comments, or ancillary code such as unit tests)
  • Very close to the code (could be in the same files, in fact, right next to the implementation)
  • ... but it would still have some non-trivial overhead to keep in sync with the actual production code

But wait... we can solve that last problem fairly trivially, by just eliding the actual copy or translation of the code into nominal documentation form, and just rely on the code itself! Now we have gained:

  • Produced instantly (once the implementation is done, the documentation is also implicitly done)
  • Zero overhead to keep in sync with the actual implementation (since they are the same)

"But hold on", you might object, "what if the code is incomprehensible?" That is a valid question in the abstract, but I would counter with two observations:

  • If the code is incomprehensible, and you can write more comprehensible documentation (ie: the complexity is in overhead of the implementation, not inherent to the problem space), then you can fix the code to make it more comprehensible
  • If the problem space is inherently complex, then side-by-side documentation will not be less complex, and the code itself is often just as easy for a developer to read and understand than any other form of documentation

Wait, what did we just conclude?

We just concluded, based on an objective analysis of all the various design considerations, that the best way to document a software implementation is to not do any documentation at all, because every single thing you could do is worse than just allowing the code to be self-documenting. You should improve the structure of the code as applicable and possible, and then tell your management that the task is done, and the complete functional documentation is in the repository, ready to be consumed by any and all future developers. Then maybe find some productive work to do.

Note: Selling this to managers, particularly bad ones, might be the hardest part here, so I'm being slightly knowingly flippant. However, I do think the conclusion above is correct in general: wherever possible within an org, code should be self-documenting, and any other form of documentation for an implementation is strictly worse than this approach.

PS: I'm aware that some people who read this post probably have already internalized this, as this is fairly common knowledge in the industry, but hopefully it was at least a somewhat entertaining post if you made it this far and already were well-aware of what the "right" answer here was. For everyone else, hopefully this was informative. :)

Wednesday, October 16, 2024

Just say "no" to code freezes

One of the more insightful conclusions I've reached in my career, if perhaps also one of the more controversial opinions, is that you should always say "no" to code freezes (at least in an optimal development methodology). This isn't always possible, of course; depending on where you are in influence and decision making, you may have only some, or effectively no, input into this decision. However, to the extent that you are prompted with this question and have some input, my advice is to always push back, and I'll elaborate on this below.

The case for code freezes

I've heard a number of different justifications for why people want code freezes, and these desires come in a few forms. Some examples:

  • We need a code freeze
  • We need to defer some code changes for a bit, just until...
  • We need to hold off on this going in because...
  • We have a release schedule where no changes go in during this period
  • etc.

Typically, the justification for the ask is achieving some desired amount of code stability at the expense of velocity, while some "critical" process goes on. The most common case for this is QA for a release, but there are also cases where critical people might be out, before holidays where support might be lacking, etc. In my experience, these asks are also almost always via management, not development, under the pretense that the operational change in necessary to coordinate with other teams and such.

Note that this is, de facto, antithetical to Agile; if you're practicing Agile software development, you're not doing the above, and conversely if you're doing the above, you're not doing Agile. I mention this as an aside, because this is one area where teams and orgs fail at Agile quite regularly.

The reasons this is bad

Any time you're implementing a code freeze, you are impacting velocity. You are also increasing the risk of code conflicts, discouraging continuous improvement of code, and likely increasing overhead (eg: resolving merge conflicts in ongoing work). Furthermore, this can create a strong incentive to circumvent normal workflows and methodologies, by introducing side-band processes for changes during a "code freeze", which can be even worse (eg: "we need to push this change now, we can't follow the normal QA methodology, because we're in a code freeze").

Side anecdote: in a previous company, the manager insisted on a three month code freeze before each release. During this time, QA was "testing", but since QA overlapped with sales support, this was also where all the sales support enhancement requests were injected into the dev queues, as "critical fixes for this release". In essence, the code freeze would have allowed this part of the business to entirely hijack and bypass the normal PM-driven prioritization mechanism for enhancements, and divert the entire dev efforts for their own whims, if not for some push back from dev on the freeze itself (see suggestions below).

Note that this ask is very common; in particular, short-sighted managers (and/or people with higher priority localized goals than overall business success) ask for these fairly frequently, in my experience. It's often the knee-jerk reaction to wanting more code stability, from those who have a myopic view of the overall cost/benefit analysis for process methodologies.

Alternatives and suggestions

To the extent that you have influence on your process, I'd suggest one of the following alternatives when a code freeze is suggested.

Just say "no"

The most preferable outcome is to be able to convince management that this is not necessary, and continue development as normal. This is unlikely in most cases, but I'll list it anyway, because it is still the best option generally, where possible. In this case, I'd suggest emphasizing that better code stability is best achieved by incremental improvements and quick turnaround fixes, as well as better continuous integration testing, and not code divergence. This argument is often not convincing, though, particularly to people with less overall development experience, and/or higher priority myopic goals. It may also not be feasible, given overall org requirements.

Create a branch for the "freeze"

The easiest and cleanest workaround, generally, is to appease the ask by creating a one-off branch for the freeze, and allowing testing/other to be done on the branch code, while normal development continues on the mainline. This is the closest to the Agile methodology, and can allow the branch to become a release branch as necessary. Note that this approach can often require ancillary process updates; for example, pipelines which are implicitly associated with the mainline may need to be adjusted to the branch. But generally, this approach is the most preferable when a freeze is deemed necessary.

Note that the typical drawback/complication with this approach is that developers will frequently be asked to make some changes in parallel in this scenario (ie: on the mainline and the freeze branch). In this case, I suggest mandating that changes happen on the mainline first, then are ported on-demand to the branch. Ideally, this porting would be done by a group with limited resources (to discourage demands for numerous changes to be ported to a "frozen" branch). For extended QA testing, this might encourage re-branching instead if many changes are needed, rather than porting extensively; this is also generally preferable if many changes are asked for during the "freeze".

Create a parallel mainline

This is a functionally identical to creating a branch for the "frozen" code, but can be more palatable for management, and/or compatible with ancillary processes. In essence, in this scenario, dev would create a "mainline_vNext" (or equivalent) when a code freeze for the mainline is mandated, and shift active development to this branch. When the code freeze is lifted, this would then become the mainline again (via branch rename or large merge, whichever is easier).

This approach, as with the above, also induces overhead of parallel development and merging changes across branches. But it satisfies the typical ask of "no active development on the mainline".

Exceptions, or when a real freeze might be necessary

I haven't seen many examples of this, but the one example I have seen is where a company has a truly real-time CI/CD pipeline, where any changes flow directly to production if all tests pass, and there is no mechanism to freeze just the production output, and disruptions to production operations will be catastrophic. In this specific scenario, it might be net positive to have a short code freeze during this period, if the risks cannot be mitigated any other way. In this case, the cost to the org (in productivity) might be justified by the risk analysis, and as long as the time period and process is carefully controlled, seems like a reasonable trade-off.

The ideal

Included just for reference: what I would do if I were dictating this process within a larger org.

  • Allow branches for QA testing and/or stability wants
  • Limit resources for merging into branches (QA or historical)
    • Ideally, have a separate support dev team for all historical branches
    • Encourage re-branching from mainline if merging ask requirements exceed allocated resources
  • Configure pipelines to be flexible (ie: allow release candidate testing on branches, production deployment from release branch, etc.)
  • Mandate no code freeze ever on the mainline (always incremental changes)
    • Solve any asks for this via alternative methods
  • Encourage regular and ancillary integration testing on the mainline (ie: dogfooding)

Anyway, those are my [current] opinions on the matter, fwiw.


Wednesday, October 9, 2024

Real talk about career trajectories

Almost every time I scroll through LinkedIn, I run into one or more posts with some variation of the following:

  • You're wasting your life working for someone else, start your own business...
  • Invest in yourself, not traditional investments, so you can be super successful...
  • [literally today] Don't save for retirement, spend that money on household help so you can spend your time learning new skills...
  • etc.

These all generally follow the same general themes: investing in yourself allows you to start your own business, and that allows you to get wealthy, and if you're working a 9-5 and doing the normal and recommended financially responsible things, you're doing it wrong, and will always be poor. I'm going to lay out my opinion on this advice, and relate it to the career/life path I'm on, and what I would personally recommend.

Let's talk risk

The main factor that most people gloss over, when recommending this approach to a career, is the element of risk. When last I read up on this, at least in the tech sector, roughly 1/20 startups get to an initial (non-seed) funding round, and roughly 1/20 of those get to the point where founders can "cash out" (ie: sold, public, profitable, etc.). That's a huge number of dead bodies along the way, and the success stories in the world are the outliers. When you hear about anyone's success with investing in themselves and becoming wealthy, there's a very heavy selection bias there.

This is where personal wealth comes into play (eg: family wealth, personal resources, etc.), and why people who succeed at starting businesses usually come from money. As someone without financial resources, not working a steady job is a large financial risk: you usually won't make much money, and if your endeavor doesn't pan it, you might be left homeless (for example). People with resources already can take that risk, repeatedly, and still be fine if it doesn't work out; normal people cannot. Starting a business is expensive, often in outlay costs in addition to opportunity costs. Personal resources mitigates that risk.

There's also an element of luck in starting a business: even if you have a good idea, good skills, good, execution, etc., some of the chance of overall success will still be luck. This is something where you can do everything "right", and still fail. This risk is random, and cannot really be mitigated.

The other risk factor in terms of owning and running a business comes later, if/when the business becomes viable. As a business owner, your fortunes go up or down based on the value of the business, and often you'll be personally liable for business debts when the company is smaller. In contrast, an employee is hired for a specific wage and/or compensation agreement, and that is broadly dependent on just performing job functions, not how successful that makes the business overall. Moreover, employees can generally ply their skills for any business, so if the business become bad to work at, they have mobility, whereas the business owner does not. Again, this is the risk factor.

But you still want to be wealthy...

Okay, so here's my advice to put yourself in the best position to be wealthy:

  1. Get really good at networking
  2. Buy rental property

 #1 is the most important factor in overall wealth potential, in my opinion. In addition to being the primary factor in many types of jobs (eg: sales, RE agent), networking will get you opportunities. Also, being good at networking will make you good at talking to people, which is the primary job skill for managers. Since managers are typically compensated better than skill workers, and almost always the people who get elevated to C-level roles, this is the optimal career path in general for being wealthy, even if you don't start your own business.

#2 is the best possible investment type, at least in the US, because the government has made hording property the single most tax advantaged investment possible. It is an absolutely awful social policy in general, as it concentrates wealth among the very rich while creating durable wealth inequality, and making housing unaffordable for much of the middle class. The tax policy is the absolute pinnacle of corruption and/or stupidity in government policy in the US... but rental property is unequivocally the best investment class as a result of the policy, and unless you're in a position to compromise your own wealth for idealism, this is what you should invest in.

Parting thoughts

Most people will be employees (vs self-employed or business owners). If you are in a position and of the mindset to start a business, and are comfortable with the risk, that is a path you can choose. If that's not for you (as it's not for most people), but you still want the best chance of being reasonably wealthy, get really good at talking to people and maintaining social connections, and go into management: it's significantly easier than skilled or manual labor, and happens to pay significantly better also. But at the end of the day, keep in mind that you don't have to luck into the 0.001% who are in position to and are successful in starting a successful business in order to make enough money to have a comfortable life, and there is more to life than just making lots of money.


Saturday, October 5, 2024

The value of knowing the value

Something I've been musing about recently: there's a pretty high value, for managers, in knowing the value of people within a business. I'm kinda curious how many companies (and managers within said companies) actually know the value for their reports, beyond just the superficial stuff (like job level).

Motivating experience: people leave companies. Depending on how open the company is with departures, motivations, internal communications, etc., this may be somewhat sudden and/or unnoticed for some time. Sometimes, the people that leave have historical specific knowledge which is not replicated anywhere else in the company, or general area expertise which cannot easily be replaced, or are fulfilling roles which would otherwise be more costly to the organization if that work was not done, etc. Sometimes these departures can have significant costs for a company, beyond just the lost nominal productivity.

Note that departures are entirely normal: people move on, seek other opportunities, get attractive offers, etc. As a company, you cannot generally prevent people from leaving, and indeed many companies implicitly force groups out the door sometimes (see, for example, Amazon's RTO mandates and managing people out via underhanded internal processes). However, in concept a company would usually want to offer to pay what a person is worth to the company to retain them, and would not want someone to walk away that they would have been willing to pay enough to entice them to stay. That's where knowing value comes into play: in order to do that calculation, you would need the data, inclusive of all the little things which do not necessarily make it into a high level role description.

Not having been a manager (in any sort of serious sense), I don't really have a perspective on how well this is generally understood within companies, but my anecdotal experience would suggest that it is generally not tracked very well. Granted, everyone is replaceable in some sense, and companies do not want to be in a position where they feel extorted, but the converse is that effective and optimal people management means "staying ahead" of people's value in terms of proactive rewards and incentives. I'd imagine that even a company which treats their employees at cattle would be displeased if someone they would have been happy to retain at a higher rate walked, just because their management didn't accurately perceive their aggregate value to the organization.

All of this is to say: there's a high value for an organization in having an accurate sense of the value that people have for an organization, to optimally manage your business relationship with those people. If people leave your company, and you figure out later that they had more value than you were accounting for, that's a failure of management within the organization, and might be a costly one.

Addendum: Lest anyone think this is in relation to myself and my current position, it is not. However, I have been valued both more and less than I perceived my value as at various points in my career, so I know there can be a disconnect there, and I've seen organizations lose valuable expertise and not realize it until the people were gone. I would surmise that this might be more of a general blind spot than many companies realize.

Sunday, September 8, 2024

Zero cost abstractions are cool

There is a language design principle in C++ that you should only pay for what you use; that is, that the language should not be doing "extra work" which is not needed for what you are trying to do. Often this is extrapolated by developers to imply building "simple" code, and only using relatively primitive data structures, to avoid possible runtime penalties from using more advanced structures and methodologies. However, in many cases, you can get a lot of benefits by using so-called "zero cost abstractions", which are designed to be "free" at runtime (they are not entirely zero cost; I'll cover the nuance in an addendum). These are cool, imho, and I'm going to give an example of why I think so.

Consider a very simple, and somewhat ubiquitous in code from less experienced developers, function result paradigm of returning a boolean to indicate success or failure:

bool DoSomething(); 

This works, obviously, and fairly unambiguously represents the logical result of the operation (by convention, technically). However, it also has some limitations: what if I want to represent more nuanced results, for example, or pass back some indication of why the function failed? These are often trade-offs made for the ubiquity of a standard result type.

Processors pass result codes by way of a register, and the function/result paradigm is ubiquitous enough that this is well-supported by all modern processors. Processor register sizes are dependent on the architecture, but should be at least 32bits for any modern processor (and 64bits or more for almost all new processors). So, when passing back any result code, passing 64bits of data is the same as passing one bit, per the above example, runtime performance wise. So we can rewrite our result paradigm as this, without inducing any additional runtime overhead:

uint64_t DoSomething();

Now we have an issue, though: we have broken the ubiquity of the original example. Is zero success now (as would be more common in C++ for integer result types)? What do other values mean? Does each function define this differently (eg: a custom enum of potential result values per-function)? While we have added flexibility, we have impacted ubiquity, and potentially introduced complexity which might negate any other value gained. This is clearly not an unambiguous win.

However, we can do better. We can, for example, not use a numeric type, but instead define a class type which encapsulates the numeric type (and is the same size, so it can still be passed via a single register). Eg:

class Result
{
    int64_t m_nValue;
    bool isSuccess() const { return m_nValue >= 0; }
    bool isFailure() const { return m_nValue < 0; }
};

Now we can restore ubiquity in usage: callers can use ".IsSuccess()" and/or ".isFailure()" to determine if the result was success or failure, without needing to know the implementation details. Even better: this also removes any lingering ambiguity in the first example as well, as we now how methods which clearly spell out intent in readable language. Also, importantly, this has zero runtime overhead: an optimizing compiler will inline these methods to be assembly equivalent to manual checks.

Result DoSomething();

//...
auto result = DoSomething();
if (result.isFailure())
{
    return result;
}

This paradigm can be extended as well, of course. Now that we have a well-defined type for result values, we could (for example) define some of the bits as holding an indicative value for why an operation failed, and then add inline methods to extract and return those codes. For example, one common paradigm from Microsoft uses the lower 16bits to encapsulate the Win32 error code, where the high bits have information for the error disposition and component area which generated the error. This can also be used to express nuance in success values as well; for example, an operation which "succeeded", but which had no effect, because preconditions were not satisfied.

Moreover, if used fairly ubiquitously, this can be used to easily propagate unexpected error results up a call stack as well, as suggested above. One could, if inclined, add macro-based handling to establish a standard paradigm of checking for and propagating unknown errors, and with the addition of logging in the macro, the code could also generate a call stack on those cases. That compares fairly favorably to typical exception usage, for example, both in utility and runtime overhead.

So, in summary, zero cost abstractions are cool, and hopefully the above has convinced you to consider that standard result paradigms are cool too. I am a fan of both, personally.

Addendum: zero cost at runtime

There is an important qualification to add here: "zero cost" applies to runtime, but not necessarily compile time. Adding structures implies some amount of compilation overhead, and with some paradigms this can be non-trivial (eg: heavy template usage). While the above standard result paradigm is basically free and highly recommended, it's always important to also consider the compilation time overhead, particularly when using templates which may have a large number of instantiations, because there is some small but non-zero compilation time overhead there. The more you know.



Wednesday, July 17, 2024

Why innovation is rare in big companies

I have worked for small and large tech companies, and as I was sitting through the latest training course this week on the importance of obtaining patents for business purposes (a distasteful but necessary thing in the modern litigious world), I was reflecting on how much more innovative smaller companies tend to be. This is, of course, the general perception also, but it's really reflective of the realities of how companies operate which motivates this outcome, and this is fairly easy to see when you've worked in both environments. So, I'm going to write about this a bit.

As a bit of background, I have two patents to my name, both from when I worked in a small company. Both are marginally interesting (more so than the standard big company patents, anyway), and both are based on work I did in the course of my job. I didn't strive to get either; the company just obtained them after the fact for business value. I have no idea if either has ever been tested or leveraged, aside from asset value on a spreadsheet.

Let's reflect a bit on what it takes to do projects/code at a typical big company. First, you need the project to be on the roadmap, which usually requires some specification, some amount of meetings, convincing one or more managers that the project has tangible business value, constraining the scope to only provable value, and getting it into the process system. Then you have the actual process, which may consist of ticket tracking, documents, narrowing down scope to reduce delivery risk, getting various approvals, doing dev work (which may be delegated depending on who has time), getting a PR done, getting various PR approvals (which usually strip out anything which isn't essential to the narrow approved scope), and getting code merged. Then there is usually some amount of post-coding work, like customer docs, managing merging into releases, support work, etc. That's the typical process in large companies, and is very reflective of my current work environment, as an example.

In this environment, it's very uncommon for developers to "paint outside the lines", so to speak. To survive, mentally and practically, you need to adopt to the process of being a cog who is given menial work to accomplish within a very heavy weight process-oriented system, and is strongly discouraged from trying to rock the boat. Works gets handed down, usually defined by PM's and approved by managers in planning meetings, and anything you do outside of that scope is basically treated as waste by the organization, to be trimmed by all the various processes and barriers along the way. This is the way big companies operate, and given enough resources and inherent inefficiencies, it works reasonably well to maintain and gradually evolve products in entirely safe ways.

It is not surprising that this environment produces little to no real innovation. How could it? It's actively discouraged by every process step and impediment which is there by design.

Now let's consider a small company, where there are typically limited resources, and a strong driving force to build something which has differentiated value. In this environment, developers (and particularly more senior developers) are trusted to build whatever they think has the most value, often by necessity. Some people will struggle a lot in this environment (particularly those people who are very uncomfortable being self-directed); others will waste some efforts, but good and proactive developers will produce a lot of stuff, in a plethora of directions. They will also explore, optimize, experiment with different approaches which may not pan out, etc., all virtually unconstrained by process and overhead. In this environment, it's not uncommon to have 10x productivity from each person, and only half of that work actually end up being used in production (compared to the carefully controlled work product in larger companies).

But, small companies get some side-benefits from that environment also, in addition to the overall increases in per-person effective productivity. Because developers are experimenting and less constrained by processes and other people's priorities, they will often build things which would never have been conceived of within meetings among managers and PM's. Often these are "hidden" things (code optimizations, refactoring to reduce maintenance costs, process optimizations for developer workflows, "fun" feature adds, etc.), but sometimes they are "interesting" things, of the sort which could be construed as innovations. It is these developments which will typically give rise to actual advances in product areas, and ultimately lead to business value through patents which have meaning in the markets.

Now, I'd be remiss to not mention that a number of companies are aware of this fact, and have done things to try to mitigate these effects. Google's famous "20% time", for example, was almost certainly an attempt to address this head-on, by creating an internal environment where innovation was still possible even as the company grew (note: they eventually got too large to sustain this in the face of profit desires from the market). Some companies use hackathons for this, some have specific groups or positions which are explicitly given this trust and freedom, etc. But by and large, they are all just trying to replicate what tends to happen organically at smaller companies, which do not have the pressure or resources build all the systems and overhead to get in the way of their own would-be success.

Anyway, hopefully that's somewhat insightful as to why most real innovation happens in smaller companies, at least in the software industry.


Friday, July 5, 2024

How not to do Agile

Note: This post is intended to be a tongue-in-cheek take, based on an amalgam of experiences and anecdotes, and is not necessarily representative of any specific organization. That said, if your org does one or more of these things, it might be beneficial to examine if those practices are really beneficial to the org or not.

The following is a collection of things you should not do when trying to do Agile, imho; these practices either run counter to the spirit of the methodology, will likely impede realizing the value from doing such, and/or demonstrate a fundamental misunderstanding of the concept(s).

Mandating practices top-down

One of the core precepts of Agile is that it is a "bottom-up" organization system, which is intended to allow developers to tweak the process over time to optimize their own results. Moreover, it is very important in terms of buy-in for developers to feel like the process is serving the needs of development first and foremost. When mandated from outside of development, even an otherwise optimal process might not get enough support over time to be optimally adopted and followed.

It is very often a sign of "Agile in name only" within organizations when this is mandated by management, rather than adopted organically (and/or with buy-in across the development teams). This is one of the clearest signals that an organization either has not bought into Agile, and/or the management has a fundamental misunderstanding of what Agile is.

Making your process the basis of work

One of the tenants of Agile is that process is there in service of the work product, and should never be the focus for efforts. As such, if tasks tend to revolve around process, and/or are dependent on specific process actions, this should be considered a red flag. Moreover, if developers are spending a non-trivial amount of time on process-related tasks, this is another red flag: in Agile, developers should be spending almost all their time doing productive work, not dealing with process overhead.

One sign that this might be the case is if/when workflows are heavily dependent on the specifics of the process and/or tooling, as opposed to the logical steps involved in getting a change done. For example, if a workflow starts with "first, create a ticket...", this is not Agile (at least in spirit, and probably in fact). If the workflow is not expressed in terminology which is process and tooling independent, the org probably isn't doing Agile.

Tediously planning future release schedules

Many organizations with a Waterfall mindset always plan out future releases, inclusive of which changes will be included, what is approved, what is deferred, etc. This (of course) misses the point of Agile entirely, since (as encapsulated in the concept of Agile) you cannot predict the timeline for changes of substance, and this mentality makes you unable to adopt to changing circumstances and/or opportunities (ie: be "agile"). If your organization is planning releases with any more specificity than a general idea of what's planned for the next release, and/or it would be a non-trivial effort to include changes of opportunity into a release, then the org isn't doing Agile.

Gating every software change

The Agile methodology is inherently associated with the concept of Continuous Improvement, and although the two can be separated conceptually, it's hard to imagine an Agile environment which did not also emphasize CI. Consequently, in an Agile environment, small incremental improvements are virtually always encouraged, both explicitly via process ideals, and implicitly via low barriers. Low barriers is, in fact, a hallmark of organizations with high code velocity, and effectively all high productive Agile dev teams.

Conversely, if an organization has high barriers in practice to code changes (process wise or otherwise), and/or requires tedious approvals for any changes, this is a fairly obvious failure in terms of being Agile. Moreover, it's probably a sign that the organization is on the decline in general, as projects and teams where this is the prevailing mentality and/or process tend to be fairly slow and stagnant, and usually in "maintenance mode". If this doesn't align with management's expectations for the development for a project, then the management might be poor.

Create large deferred aggregate changes

One of the precepts of Agile is biasing to small, incremental changes, which can be early integrated and tested as self-contained units. Obviously, large deferred aggregate changes are the antithesis of this. If your organization has a process which encourages or forces changes to be deferred and/or grow in isolation, you're certainly not doing Agile, and might be creating an excessive amount of wasteful overhead also.

Adding overhead in the name of "perfection"

No software is perfect, but that doesn't stop bad/ignorant managers from deluding themselves with the belief that by adding enough process overhead, they can impose perfection upon a team. Well functioning Agile teams buck this trend through self-organization and control of their own process, where more intelligent developers can veto these counter-productive initiatives from corporate management. If you find that a team is regularly adding more process to try to "eliminate problems", that's not only not Agile, but you're probably dealing with some bad management as well.

Have bad management

As alluded to in a number of the points above, the management for a project/team has a huge impact on the overall effectiveness of the strategies and processes. Often managers are the only individuals who are effectively empowered to change a process, unless that ability is clearly delegated to the team itself (as it would be in a real Agile environment). In addition to this, though, managers shape the overall processes and mindsets, in terms of how they manage teams, what behaviors are rewarded or punished, how proactively and clearly they articulate plans and areas of responsibility, etc. Managers cannot unilaterally make a team and/or process function well, but they can absolutely make a team and/or process function poorly.

Additionally, in most organizations, manager end up being ultimately responsible for the overall success of a project and/or team, particularly when in a decision making role, because they are the only individuals empowered to make (or override) critical decisions. A good manager will understand this responsibility, and work diligently to delegate decisions to the people most capable of making them well, while being proactively vigilant for intrusive productivity killers (such as heavy process and additional overhead). Conversely, a bad manager either makes bad decisions themselves, or effectively abdicates this responsibility through inaction and/or ignorance, and allows bad things to happen within the project/team without acknowledging responsibility for those events. If the project manager doesn't feel that they are personally responsible for the success of the product (perhaps in addition to others who also feel that way), then that manager is probably incompetent in their role, and that project is likely doomed to failure in the long run unless they are replaced.

Take home

Agile is just one methodology for software development; there are others, and anything can work to various degrees. However, if you find yourself in a position where the organization claims to be "agile", but exhibits one or more of the above tendencies, know that you're not really in an organization which is practicing Agile, and their self-delusion might be a point of concern. On the other hand, if you're in a position to influence and/or dictate the development methodology, and you want to do Agile, make sure you're not adding or preserving the above at the same time, lest you be the one propagating the self-delusion. Pick something that works best for you and your team, but make an informed choice, and be aware of the trade-offs.


Monday, June 17, 2024

Status reporting within orgs

Story time:

Back in my first job out of school, when I didn't really have much broad context on software development, I was working for Lockheed Martin as an entry-level developer. One of my weekly tasks was to write a status report email to my manager, in paragraph form, describing all the things I had been doing that week, and the value that they provided for various projects and initiatives. This was something which all the developers in the team needed to do, and my understanding was that it was a fairly common thing in the company (and by assumption at the time, within the broader industry).

At some point, I was asked to also write these in the third person, which was a bit odd, until I realized what was going on with them. My manager was aggregating them into a larger email communication to his manager, in which he was informing his management as to all the value that his team was providing under his "leadership". I don't know to what extent he was taking credit for those accomplishments (explicitly or implicitly), but I do know that he didn't do anything directly productive per se: his entire job was attending meetings and writing reports, as far as I could tell (Lockheed had separate project management and people management, and he was my people manager, so not involved in any projects directly). I rarely spoke to him, aside from sometimes providing on-demand status updates, or getting information on how to navigate corporate processes as necessary.

Furthermore, my understanding was that there was an entire hierarchy of "people managers", who all did only that: composing and forwarding emails with status information, and helping navigate corporate processes (which in many cases, they also created). Their time was also billed to projects which their reports were attached to, as "management overhead".

I raise this point because, as my career progressed, I realized this was not uniform practice in the industry. In fact, practices like Agile explicitly eschew this methodology; Agile strongly promotes developer agency and trust, minimizing process overhead, developer team self-organization, and only reporting blockers as necessary. In large part, Agile is built upon the premise that organizations can eliminate status reporting and still get good outcomes in terms of product development; or, implicitly, the presumption that middle managers receiving and forwarding status reports provide little to no value to the overall organization.

I've always found that presumption interesting, and I think it's very much an open question (ie: value vs cost for management). I've experienced what I would consider to be good managers; these are people who do not add much overhead, but efficiently unblock development efforts, usually through come combination of asymmetric communications, resource acquisition unblocking, or process elimination (ie: dealing with process overhead, or eliminating it entirely). I've also experienced managers who I thought provided very little value in my perception; these are people who frequently ask for status reporting information, mainly just "pass the buck" when blockers arise, and seem unable to reduce or offload any overhead (and/or add more). Those "low value" managers can also add substantial overhead, particularly when they do not understand technical details very well, but nevertheless involve themselves in technical discussions, and thus require additional hand-holding and people management efforts to "manage up", and try to prevent these individuals from making bad decisions (which might contravene or undermine better decisions by people under them in the company hierarchy).

So, then, the next open question is: how does an organization differentiate between "good" and "bad" management, as articulated here (if they were so inclined)?

In my mind, I'd start with looking at the amount of status reporting being done (in light of the Agile take on the overall value of such), as a proxy for how much value the management is providing, vs overhead they are creating. Obviously reports could also be surveyed as another indirect measurement for this, although there is inherently more bias there, of course. But generally speaking, ideally, status information should be communicated and available implicitly, via the other channels of value-add which a good manager is providing (eg: by providing general career advice to reports, and mitigating overhead for people, a manager should then be implicitly aware of the work efforts for their reports). If explicit status reporting is necessary and/or solicited, that could be a sign that the value-add being provided by the other activities is not extensive, and be a point of concern for the org.

Anyway, those are some of my thoughts on the topic, and as-noted different orgs have different approaches here. I don't think there's a "right" or "wrong" here, but I do think there are tangible trade-offs with different approaches, which means it is a facet of business where there is potentially room for optimization, depending on the circumstances. That, by itself, makes it an interesting point of consideration for any organization.


Sunday, June 9, 2024

Some thoughts on code reviews

Code reviews (https://about.gitlab.com/topics/version-control/what-is-code-review/) are a fairly standard practice in the industry, especially within larger companies. The process of having multiple developers look at code changes critically is found in several development methodologies (eg: extreme programming, paired programming, etc.), and they are often perceived as essentially for maintaining a level of overall code quality. I'd imagine that no respectable engineering leader in any larger organization would accept a process which did not incorporate mandatory code reviews in some form.

So with that intro, here's a bit of a hot/controversial take: I think code reviews are overrated.

Before I dive into why I think this, a quick tangent for an admission: the majority of code I've actually written in my career, personally and professionally, has been done without a formal code review process in place (and thus, not code reviewed). I've also, personally, experienced considerably more contention and strife generated by code reviews than value-add. So I certainly have a perspective here which is biased by my own experience... but I also don't think my experience is that unique, and/or would be that dissimilar to the experience of many other people, based on conversations I've had.

So that being said, why do I not subscribe to the common wisdom here?

I know a manager, who shall remain unnamed here, for whom the solution for every problem (bug, defect, customer issue, design oversight, etc.) is always one of two things: we should have caught that with more code review, or we should have caught that with more unit testing (or both). His take represents the sort of borderline brainless naivety which is all too common among nominally technical managers who have never accomplished anything of significance in their careers, and have managed to leverage their incompetence into high-paying positions where they parrot conventional wisdom and congratulate themselves, while contributing no positive value whatsoever to their employing organizations.

The common perception of this type of manager (and I have known several broadly with this mindset) is that any potential product failure can be solved by more process, and/or more strict adherence to a process. There is not a problem they have ever encountered, in my estimation, for which their would-be answer is not either adding more process, or blaming the issue on the failure of subordinates to follow the process well enough. To these people, if developers just stare at code long enough, it will have no bugs whatsoever, and if code which passes a code review has a bug, it was because the reviewers didn't do a good enough job, and should not have approved it.

Aside: The above might sound absurd to anyone who has spent any time working in the real world, but I assure you that it is not. I know more than one manager who's position is that no code should ever be approved in a code review, and/or be merged into the company's repository, which has any bugs whatsoever, and if there are any bugs in the code, the code review approver(s) should be held accountable. I think this is borderline malicious incompetence, but some of these people have failed upward into positions where they have significant power within organizations, and this is absolutely a very real thing.

Back to code reviews, though: I think they are overrated based on a couple perceptions:

  • The most important factor in producing high-value code over time is velocity (of development), hands down
  • Code reviews rarely catch structural design issues (and even when they do, by that time, it's often effectively too late to fix them)
  • Code reviews encourage internal strife from opinionated feedback, which often has little value on overall code quality
  • Code reviews often heavily implicitly bias against people without as many social connections, and/or those which do not "play politics" within the org/teams (and conversely, favor those that do, encouraging that behavior)
  • As per above, code reviews are very often abused by bad managers, which can easily lead to worse overall outcomes for orgs

To be clear and fair, code reviews have some tangible benefits, and I wouldn't necessarily dispose of them all together, were I running a dev org. In particular, the potential benefits such as sharing domain knowledge, increasing collaboration, and propagating best-practices (particularly when mentoring more junior developers) are tangible benefits of code reviews or equivalent. There is a reasonably compelling argument that, with good management in place, and when not used for gating and/or misused for blame attribution, code reviews have sufficient positive value to be a good practice.

However, the risks here are real and substantial, and this is not something which is a clear win in all, or perhaps even most, cases. Code reviews impact velocity, and the positive value proposition must be reasonably high for them to have a net positive value, given that. You're not likely to catch many actual bugs in code reviews, and if your developers use them as a crutch for this (psychologically or otherwise), that's another risk. If you have management which thinks thorough enough code reviews will give you "pristine" code, you're almost certainly better off eliminating them entirely (in concept), in my estimation (assuming you cannot replace those terrible managers). Code reviews are something which can have net positive value when used appropriately... but using them appropriately is something I've personally seen far less often than not.

That's my 2c on the topic, anyway.

Tuesday, April 30, 2024

Development process issues in different types of orgs

There's an interesting dichotomy I've observed between companies where the development is run by developers, and those where it's run by managers.

In the former case, problems tend to get fixed, because the people making decisions are also the people impacted by the problems.

In the latter case, problems tend to compound and go unresolved, because the people making decisions are not directly affected, and/or cannot understand why the situation is problematic, and/or have competing priorities (to fixing them). This creates a situation where you really don't want to be doing development, and the best developers tend to move into management, and/or onto different projects.

The latter is also the case where projects tend to die over time, and sometimes companies with them, as they eventually become unmaintainable.

All of the above might be obvious to others, but I don't have a lot of previous experience working on teams where the development methodologies are all being dictated by management (most of my previous roles have been in companies using Agile, or something close to it, and/or smaller companies). In Agile ish organizations, developers were motivated and empowered to fix problems which impeded shipping code, and as such when there are blockers, they tend to be short-lived (sometimes the resolutions are not optimal, but significant issues rarely linger unresolved). This is not the case in large orgs where the methodologies are dictated my managers, though, particularly when those managers are not actively working on code.

This needs to be added to my list of questions and concerns when considering future roles, I think.


Monday, April 29, 2024

Thoughts on recent Google layoffs, Python support team

Recently, Google laid off an internal group responsible for maintaining Python within the company (story: https://www.hindustantimes.com/business/google-layoffs-sundar-pichai-led-company-fires-entire-python-team-for-cheaper-labour-101714379453603.html). Nominally, this was done to reduce costs; purportedly they will look to hire replacement people for the group in Germany instead, which will be cheaper. This is what the previous team was nominally responsible for: https://news.ycombinator.com/item?id=40176338

Whether or not this saves Google money in the longer term, on balance, is an open question. This doesn't normally work out well (getting rid of tribal knowledge, experience, etc.), but this is something large companies do regularly, so not a huge surprise in general. But this isn't what was most interesting about this to me.

Rather, I'd like to focus on some tidbits in the reporting and personal accounts which reveal some interesting (if true) things about the inner workings at Google at this moment.

Python is deprecated within Google

According to one of the affected developers, Python is considered tech debt within Google, with existing code to be replaced with code in other languages, and new code in Python frowned upon. There are various reasons given for this, but the fact that Google is moving away from Python is interesting, when this doesn't seem to be the case in general in the industry (and Google often is ahead of the curve with technology migrations).

The group was responsible for fixing all impacted code, even other groups'

When the group upgraded Python across Google, they needed to fix all impacted code, even if they didn't own it. This is a pretty huge headwind to updating versions and taking patches, and is reflected in their update planning and schedules. This points to the rather large systemic problem of trying to take regular updates to libraries across a large code base, and either a lack of planning, or a lack of budgeting at the project level(s).

Related to this, the group noted that often unit tests would be flaky, because they were written in a fragile way, and broke with updates to the language version. This is the large systemic problem with having a large code base of unit tests, of course: you need to have a plan and resource allocation for maintaining them over time, for them to have net positive value. It seems like Google perhaps is lacking in this area also.

Companies often don't value infrastructure teams

This is a general issue, but something to watch out for when charting a career course: lots of companies undervalue and under-appreciate infrastructure teams, who act as effectively force multiplier (in best cases) for product teams. The large the org, the less visibility the people working on foundational structures have, and the more common it is for upper management to look at those teams for cost cutting. Getting into foundational work within a larger org is more risky, career-wise: it might be the most effective value-add for the company, but it's also likely to be the least appreciated as well.

If you want to maximize your career potential, flashy prototype projects supported by mountains of tech debt which you can hand off and move on to the next flashy project will get you the most positive recognition at almost all large companies. A close second-place is the person who fixes high-visibility issues with kludgy "fixes" which seem to work, but ignore underlying or systemic problems (which can be blamed on someone else). It's extremely probable both of those types of developers will be more highly valued than someone who builds and maintains non-flashy but critical support infrastructure.

Managers are generally dumb, don't understand actual impacts

This is somewhat of a corollary to the above, but the person deciding which group to downsize to ensure profits beat expectations and the executives get their multi-million dollar performance bonuses isn't going to have any idea what value people in lower-level groups actually bring, and/or what might be at risk by letting critical tribal knowledge walk out the door. When there need to be cuts (usually for profit margins), the most visible projects (to upper management) will be the ones where people are most safe. Don't get complacent and think that just because your project is critical to the company, and/or your value contribution is high, that your job is more safe. Pay attention to what executives talk about at all-hands meetings: the people building prototype features in those groups are the people who are most valued to the people making the layoff decisions.

Take home

While Google has declined substantially since it's heyday (in terms of prestige and capability), in some ways it is still a bellwether for the industry, so it's good to pay attention to what goes on there. In this case, I think there's good information to be gleaned, beyond just the headline info. It sounds like Google is now more similar than dissimilar to a large tech company on the decline, though.


Friday, April 5, 2024

The problem of "thrashing"

"Thrashing" is a general term/issue in computer science, which refers to the situation (in the abstract) in which multiple "work items" are competing for the same set of resources, and each work item is being processed in chunks (ie: either in parallel, or interleaved), and as a result the resource access ping-pongs between the different work items. This can be very inefficient for the system if switching access to the resources causes overhead. Here's the wiki page on the topic: https://en.wikipedia.org/wiki/Thrashing_(computer_science)

There are numerous examples of thrashing issues in software development, such as virtual memory page faults, cache access, etc. There is also thread context thrashing, where when you have too many threads competing for CPU time, the overhead of just doing thread context switching (which is generally only a few hundred CPU cycles) can still overwhelm the system. When thrashing occurs, it is generally observed as a non-linear increase in latency/processing time, relative to the work input (ie: the latency graph "hockey sticks"). At that point, the system is in a particularly bad state (and, ironically, a very common critical problem in orgs is that additional diagnostic processes get triggered to run in that state, based on performance metrics, which can then cause systems to fail entirely).

To reduce thrashing, you generally want to try to do a few things:

  • Reduce the amount of pending parallel/interleaved work items on the system
  • Allocate work items with more locality if possible (to prevent thrashing relative to one processing unit, for example)
  • Try to allow more discrete work items to complete (eg: running them longer without switching), to reduce the context switching overhead

Now, while all of the above is well-known in the industry, I'd like to suggest something related, but which is perhaps not as well appreciated: the same problems can and do occur with respect to people within an organization and process.

People, as it turns out, are also susceptible to some amount of overhead when working on multiple things, and task switching between them. Moreover, unlike computers, there is also some overhead for work items which are "in flight" for people (where they need to consider and/or refresh those items just to maintain the status quo). The more tasks someone is working on, and the more long-lived work items are in flight at any given time, the move overhead exists for that person to manage those items.

In "simple" jobs, this is kept minimal on purpose: a rote worker might have a single assigned task, or a checklist, so they can focus on making optimal progress on the singular task, with the minimal amount of overhead. In more complex organizations, there are usually efforts to compartmentalize and specialize work, such that individual people do not need to balance more than an "acceptable" number of tasks and responsibilities, and to minimize thrashing. However, notably, there are some anti-patterns, specific to development, which can exacerbate this issue.

Some notable examples of things which can contribute to "thrashing", from a dev perspective:

  • Initiatives which take a long time to complete, especially where other things are happening in parallel
  • Excessive process around code changes, where the code change can "linger" in the process for a while
  • Long-lived branches, where code changes needs to be updated and refreshed over time
  • Slow pull-request approval times (since each outstanding pull-request is another in-progress work item, which requires overhead for context switching)
  • Excessive "background" organizational tasks (eg: email management, corporate overhead, Slack threads, managing-up tasks, reporting overhead, side-initiatives, etc.)

Note, also, that there is a human cost to thrashing as well, as people want to both be productive and see their work have positive impacts, and thrashing hurts both of these. As a manager, you should be tracking the amount of overhead and "thrashing" that your reports are experiencing, and doing what you can to minimize this. As a developer, you should be wary of processes (and potentially organizations) where there are systems in place (or proposed) which contribute to the amount of thrashing which is likely to happen while working on tasks, because this has a non-trivial cost, and the potential to "hockey stick" the graph of time wasted dealing with overhead.

In short: thrashing is bad, and it's not just an issue which affects computer systems. Not paying attention to this within an org can have very bad consequences.


Monday, March 25, 2024

The problem with process

Note: This post might more accurately be titled "one problem with process", but I thought the singular had more impact, so there's a little literary license taken. Also, while this post is somewhat inspired by some of my work experiences, it does not reflect any particular person or company, but rather a hypothetical generalized amalgamation.

There's an adage within management, which states that process is a tool which makes results repeatable. The unpacking of that sentiment is that if you achieve success once, it might be a fluke, dependent on timing or the environment, dependent on specific people, etc., but if you have a process which works, you can repeat it mechanically, and achieve success repeatedly and predictably. This is the mental framework within which managers add process to every facet of business over time, hoping to "automate success".

Sometime it works, sometimes it doesn't. Often process is also used to automate against failure as well, by automating processes which avoid perceived and/or historical breakdown points. This, more often than not, is where there be landmines.

Imagine a hypothetical: you're a manager, grappling with a typical problem of quality and execution efficiency. You want to increase the former, without sacrificing the latter (and ideally, with increasing the latter as well). Quality problems, as you know, come from rushing things into production without enough checks and sign-offs; process can fill that gap easily. But you also know that with enough well-defined process, people become more interchangeable in their work product, and can seamlessly transition between projects, allowing you to optimally allocate resources (in man-months), and increase overall execution efficiency.

So you add process: standard workflows for processing bugs, fields in the tracking system for all the metrics you want to measure, a detailed workflow that captures every state of every work item that is being worked, a formalized review process for every change, sign-offs at multiple levels, etc. You ensure that there is enough information populated in your systems such that any person can take over any issue at any time, and you'll have full visibility into the state of your org's work at all times. Then you measure your metrics, but something is wrong: efficiency hasn't increased (which was expected, it will take time for people to adjust to the new workflows and input all the required data into the systems), but quality hasn't increased either. Clearly something is still amiss.

So you add more process: more stringent and comprehensive testing requirements, automated and manual, at least two developers and one manager reviewing every change which goes into the code repository, formalized test plans which must be submitted and attested to along with change requests, more fields to indicate responsible parties at each stage, more automated static analysis tools, etc. To ensure that the processes are followed, you demand accountability, tying sign-off for various stages to performance metrics for responsible employees. Then you sit back and watch, sure that this new process is sufficient to guarantee positive results.

And yet... still no measurable improvement in overall perceived product quality. Worse, morale is declining: many employees feel stifled by the new requirements (as they should; those employees were probably writing the bugs before), they are spending large amounts of time populating the process data, and it's taking longer to get fixes out. This, in turn, is affecting customers satisfaction; you try to assure them that the increased quality will compensate for the longer lead times, but privately your metrics do not actually support this either. The increased execution efficiency is still fleeting as well: all the data is there to move people between project seamlessly, but for some reason people still suffer a productivity hit when transitioned.

Clearly what you need is more training and expertise, so you hire a Scrum master, and contract for some Scrum training classes. Unsure where everyone's time is actually going, you insist that people document their work time down to 10 minute intervals, associating each block of time with the applicable ticket, so that time can be tracked and optimized in the metrics. You create tickets for everything: breaks, docs, context switches, the works. You tell your underling managers to scrutinize the time records, and find out where you are losing efficiency, and where you need more process. You scour the metrics, hoping that the next required field will be the one which identifies the elusive missing link between the process and the still lacking quality improvements.

This cycle continues, until something breaks: the people, the company, or the process. Usually it's one of the first two.

In the aftermath, someone asks what happened. Process, metrics, KPI's: these were the panaceas which were supposed to lead to the nirvana of efficient execution and high quality, but paradoxically, the more that were added, the more those goals seemed to suffer. Why?

Aside: If you know the answer, you're probably smarter than almost all managers in most large companies, as the above pattern is what I've seen (to some degree) everywhere. Below I'll give my take, but it is by no means "the answer", just an opinion.

The core problem with the above, imho, is that there is a misunderstanding of what leads to quality and efficiency. Quality, as it turns out, comes from good patterns and practices, not gating and process. Good patterns and practices can come from socializing that information (from people who have the knowledge), but more often than not come from practice, and learned lessons. The quantity of practice and learned lessons come from velocity, which is the missing link above.

Process is overhead: it slows velocity, and decreases your ability to improve. Some process can be good, but only when the value to the implementers exceeds the cost. This is the second major problem in the above hypothetical: adding process for value of the overseers is rarely if ever beneficial. If the people doing the work don't think the process has value to them, then it almost certainly has net negative value to the organization. Overseers are overhead; their value is only realized if they can increase the velocity of the people doing the work, and adding process rarely does this.

Velocity has another benefit too: it also increases perceived quality and efficiency. The former happens because all software has bugs, but customers perceive how many bugs escape to production, and how quickly they are fixed. By increasing velocity, you can achieve pattern improvement (aka: continuous improvement) in the code quality itself. This decreases the number of overall issues as a side-effect of the continuous improvement process (both in code, and in culture), with a net benefit which generally exceeds any level of gating, without any related overhead. If you have enough velocity, you can even also increase test coverage automation, for "free".

You're also creating en environment of learning and improvement, lower overhead, less restrictions, and more drive to build good products among your employees who build things. That tends to increase morale and retention, so when you have an issue, you are more likely to still have the requisite tribal knowledge to quickly address it. This is, of course, a facet of the well-documented problem with considering skill/knowledge workers in terms of interchangeable resource units.

Velocity is the missing link: being quick, with low overhead, and easily pivoting to what was important without trying to formalize and/or add process to everything. There was even a movement a while ago which captured at least some of the ideals fairly well, I thought: it was called Agile Development. It seems like a forgotten ideal in the environments of PKI's, metrics, and top-heavy process, but it's still around, at least in some corners of the professional world. If only it didn't virtually always get lost with "scale", formalization, and adding "required" process on top of it.

Anyway, all that is a bit of rambling, with which I hope to leave the reader with this: if you find yourself in a position where you have an issue with quality and/or efficiency, and you feel inclined to add more process to improve those outcomes, consider carefully if that will be the likely actual outcome (and as necessary, phone a friend). Your org might thank you eventually.

 

Sunday, March 17, 2024

Some thoughts on budget product development, outsourcing

I've been thinking a bit about the pros and cons of budget/outsourcing product development in general. By this, I mean two things, broadly: either literally outsourcing to another org/group, or conducting development in regions where labor is cheaper than where your main development would be conducted (the latter being, presumably, where your main talent and expertise resides). These are largely equivalent in my mind and experience, so I'm lumping them together for purposes of this topic.

The discussion has been top-of-mind recently, for a few reasons. One of the main "headline" reasons is all the issues that Boeing is having with their airplanes; Last Week Tonight had a good episode about how aggressive cost-cutting efforts have led to the current situation there, where inevitable quality control issues are hurting the company now (see: https://www.youtube.com/watch?v=Q8oCilY4szc). The other side of this same coin, which is perhaps more pertinent to me professionally, is the proliferation of LLM's to generate code (aka: "AI agents"), which many people think will displace traditional more highly-compensated human software developers. I don't know how much of a disruption to the industry this will eventually be, but I do have some thoughts on the trade-offs of employing cheaper labor to an organization's product development.

Generally, companies can "outsource" any aspect of product development, and this has been an accessible practice for some time. This is very common in various industries, especially for so-called "commoditized" components; for example, the automobile industry has an entire sub-industry for producing all the various components which are assembled into automobiles, and usually acquired from the cheapest vendors. This is generally possible for any components which are not bespoke, across any industry with components which are standardized, and can be assembled into larger products.

Note that this is broadly true in the software context as well: vendors sell libraries with functionality, open source libraries are commonly aggregated into products, and component re-use is fairly common in many aspects of development. This can even be a best-practice in many cases, if the component library is considered near the highest quality and most robust implementation of functionality (see: the standard library in C++, for example). Using a robust library which is well-tested across various usage instances can be a very good strategy.

Unfortunately, this is less true in the hardware component industries, since high-quality hardware typically costs more (in materials and production costs), so it's generally less feasible to use the highest quality components from a cost perspective. There is a parallel in first-party product development, where your expected highest quality components will usually cost more (due to the higher costs for the people who produce the highest quality components). Thus, most businesses make trade-offs between quality and costs, and where quality is not a priority, tend to outsource.

The danger arises when companies start to lose track of this trade-off, and/or misunderstand the trade-offs they are making, and/or sacrifice longer-term product viability for short-term gains. Each of these can be problematic for a company, and each are inherent dangers in outsourcing parts of development. I'll expand on each.

Losing track of the trade-offs is when management is aware of the trade-offs when starting to outsource, but over time these become lost in the details and constant pressure to improve profit margins, etc. For example, a company might outsource a quick prototype, then be under market pressure to keep iterating on it, while losing track of (and not accounting for) the inherent tech debt associated with the lower quality component. This can also happen when the people tracking products and components leave, and new people are hired without knowledge of the previous trade-offs. This is dangerous, but generally manageable.

Worse that the above is when management doesn't understand the trade-offs they are making. Of course, this is obviously indicative of poor and incompetent management, yet time and time again companies outsource components without properly accounting for the higher long-term costs of maintaining and enhancing those components, and companies suffer as a result. Boeing falls into this category: by all accounts their management thought they could save costs and increase profits by outsourcing component production, without accounting for the increased costs of integration and QA (which would normally imply higher overall costs for any shipping and/or supported product). That's almost always just egregious incompetence on the part of the company's management, of course.

The last point is also on display at Boeing: sacrificing long-term viability for short-term gains. While it's unlikely this was the motivation in Boeing's case, it's certainly a common MO with private equity company ownership (for example) to squeeze out as much money as possible in the short term, while leaving the next owners "holding the bag" for tech debt and such from those actions. Again, this is not inherently bad, not every company does this, etc.; this is just one way companies can get into trouble, by using cheaper labor for their product development.

This bring me, in a roundabout way, to the topic of using LLM's to generate code, and "outsource" software product development to these agents. I think, in the short term, this will pose a substantial risk to the industry in general: just as executives in large companies fell in love with offshoring software development in the early 2000's, I think many of the same executives will look to reduce costs by outsourcing their expensive software development to LLM's as well. This will inevitably have the same outcomes over the long run: companies which do this, and do not properly account for the costs and trade-offs (as per above), will suffer, and some may fail as a results (it's unlikely blame will be properly assigned in these cases, but when companies fail, it's almost always due to bad executive management decisions).

That said, there's certainly also a place for LLM code generation in a workflow. Generally, any task which you would trust to an intern, for example, could probably be completed by a LLM, and get the same quality of results. There are some advantages to using interns (eg: training someone who might get better, lateral thinking, the ability to ask clarifying questions, etc.), but LLM's may be more cost effective. However, if companies largely stop doing on-the-job training at scale, this could pose some challenges for the industry longer-term, and ultimately drive costs higher. Keep in mind: generally, LLM's are only as "good" as the sum total of average information online (aka: the training data), and this will also decline over time as LLM output pollutes the training data set as well.

One could argue that outsourcing is almost always bad (in the above context), but I don't think that's accurate. In particular, outsourcing, and the pursuit of short-term profits over quality, does serve at least two valuable purposes in the broader industry: it helps new companies get to market with prototypes quickly (even if these ultimately need to be replaced with quality alternatives), and it helps older top-heavy companies die out, so they can be replaced by newer companies with better products, as their fundamentally stupid executives make dumb decisions in the name of chasing profit margins (falling into one of more of the traps detailed above). These are both necessary market factors, which help industries evolve and improve over time.

So the next some some executive talks about outsourcing some aspect of product development, either to somewhere with cheaper labor or to a LLM (for example), you can take some solace in the fact that they are probably helping contribute to the corporate circle of life (through self-inflicted harm), and for each stupid executive making stupid decisions, there's probably another entrepreneur at a smaller company who better understands the trade-offs of cheaper labor, is looking to make the larger company obsolete, and will be looking for quality product development. I don't think that overall need is going to vanish any time soon, even if various players shuffle around.

My 2c, anyway.

Monday, February 19, 2024

Mobile devices and security

Generally, passwords are a better form of security than biometrics. There are a few well-known reasons for this: passwords can be changed, cannot be clandestinely observed, are harder to fake, and cannot be taken from someone unwillingly (eg: via government force, although one could quibble about extortion as a viable mechanism for such). A good password, used for access to a well-designed secure system, is probably the best known single factor for secure access in the world at present (with multi-factor including a password as the "gold standard").

Unfortunately, entering complex passwords is generally arduous and tedious, and doubly so on mobile devices. And yet, I tend to prefer using a mobile device for accessing most secure sites and systems, with that preference generally only increasing as the nominal security requirements increase. That seems counter-intuitive at first glance, but in this case the devil is in the details.

I value "smart security"; that is, security which is deployed in such a way as to increase protection, while minimizing the negative impact on the user experience, and where the additional friction from the security is proportional to the value of the data being protected. For example, I use complex and unique passwords for sites which store data which I consider valuable (financial institutions, sensitive PII aggregation sites, etc.), and I tend to re-use password on sites which either don't have valuable information, or where I believe the security practices there to be suspect (eg: if they do something to demonstrate a fundamental ignorance and/or stupidity with respect to security, such as requiring secondary passwords based on easily knowable data, aka "security questions"). I don't mind entering my complex passwords when the entry is used judiciously, to guard against sensitive actions, and the app/site is otherwise respectful of the potential annoyance factor.

Conversely, I get aggravated with apps and sites which do stupid things which do nothing to raise the bar for security, but constantly annoy users with security checks and policies. Things like time-based password expiration, time-based authentication expiration (especially with short timeouts), repeated password entry (which trains users to type in passwords without thinking about the context), authentication workflows where the data flow is not easily discernible (looking at most OAuth implementations here), etc. demonstrate either an ignorance of what constitutes "net good" security, or a contempt for the user experience, or both. These types of apps and sites are degrading the security experience, and ultimately negatively impacting security for everyone.

Mobile OS's help mitigate this, somewhat, by providing built-in mechanisms to downgrade the authentication systems from password to biometrics in many cases, and thus help compensate for the often otherwise miserable user experience being propagated by the "security stupid" apps and sites. By caching passwords on the devices, and allowing biometric authentication to populate them into forms, the mobile devices are "downgrading" the app/site security to single factor (ie: the device), but generally upgrading the user experience (because although biometrics are not as secure, they are generally "easy"). Thus, by using a mobile device to access an app/site with poor fundamental security design, the downsides can largely be mitigated, at the expense of nominal security in general. This is a trade-off I'm generally willing to make, and I suspect I'm not alone in this regard.

The ideal, of course, would be to raise the bar for security design for apps and sites in general, such that security was based on risk criteria and heuristics, and not (for example) based on arbitrary time-based re-auth checks. Unfortunately, though, there are many dumb organizations in the world, and lots of these types of decisions are ultimately motivated or made by people who are unable or unwilling to consider the net security impact of their bad policies, and/or blocked from making better systems. Most organizations today are "dumb" in this respect, and this is compounded by standards which mandate a level of nominal security (eg: time-based authentication expiration) which make "good" security effectively impossible, even for otherwise knowledgeable organizations. Thus, people will continue to downgrade the nominal security in the world, to mitigate these bad policy decisions, with the tacit acceptance from the industry that this is the best we can do, within the limitations imposed by the business reality in decision making.

It's a messy world; we just do the best we can within it.


Sunday, February 18, 2024

The Genius of FB's Motto

Why "Move Fast and Break Things" is insightful, and how many companies still don't get it

Note: I have never worked for FB/Meta (got an offer once, but ended up going to Amazon instead), so I don't have any specific insight. I'm sure there are books, interviews, etc., but the following is my take. I like to think I might have some indirect insight, since the mantra was purportedly based on observing what made startups successful, and I've had some experience with that. See: https://en.wikipedia.org/wiki/Meta_Platforms#History

If you look inside a lot of larger companies, you'll find a lot of process, a lot of meetings, substantial overhead with getting anything off the ground, and a general top-down organizational directive to "not break anything", and "do everything possible to make sure nothing has bugs". I think this stems from how typical management addresses problems in general: if something breaks, it's seen as a failure or deficiency in the process [of producing products and services], and it can and should be addressed by improving the "process". This philosophy leads to the above, but that's not the only factor. For example, over time critical people move on, and that can lead to systems which everyone is afraid to touch, for fear of "breaking something" (which, in the organizational directives, is the worst thing you can do). These factors create an environment of fear, where your protection is carefully following "the process", which is an individual's shield against blame when something goes wrong. After all, deficiencies in the process are not anyone's fault, and as long as the process is continually improved, the products will continue to get better and have less deficiencies over time. That aggregate way of thinking is really what leads to the state described.

I describe that not to be overly critical: for many people in those organizations, this is an unequivocal good thing. Managers love process: it's measurable, it has metrics and dashboards, you can do schedule-based product planning with regular releases, you can objectively measure success against KPR's, etc. It can also be good for IC's, especially those who aspire to have a steady and predictable job, where they follow and optimize their work product for the process (which is usually much harder than optimizing for actual product success in a market, for example). Executives love metrics and predictable schedules, managers love process, and it's far easier to hire and retain "line workers" than creatives, and especially passionate ones. As long as the theory holds (ie: that optimal process leads to optimal business results), this strategy is perceived as optimal for many larger organizations.

It's also, incidentally, why smaller companies can crush larger established companies in markets. The tech boom proved this out, and some people noticed. Hence, Facebook's so-called hacker mentality was enshrined.

"Move fast" is generally more straightforward for people to grasp: the idea is to bias to action, rather than talking about something, particularly when the cost of trying and failing is low (this is related to the "fail fast" mantra). For software development, this tends to mean there's significantly less value in doing a complex design than a prototype: the former takes a lot of work and can diverge significantly from the finished product, while the latter provides real knowledge and lessons, with less overall inefficiency. "Most fast" also encapsulates the idea that you want engineers to be empowered to fix things directly, and not go through layers of approvals and process (eg: Jira) to get to a better incremental product state sooner. Most companies have some corporate value which aligns with this concept.

"Break things" is more controversial; here's my take. This is a direct rebuke of the "put process and gating in place to prevent bugs" philosophy, which otherwise negates the ability to "move fast". Moreover, though, this is also an open invitation to risk product instability in the name of general improvement. It is an acknowledgement that development velocity is fundamentally more valuable to an organization than the pursuit of "perfection". It is also an acknowledgement of the fundamental business risk of having product infrastructure which nobody is willing to touch (for fear of breaking it), and "cover" to try to make it better, even at the expense of stability. It is the knowing acceptance that to create something better, it can be necessary to rebuild that thing, and in the process new bugs might be introduced, and that's okay.

It's genius to put that in writing, even though it might be obvious in terms of the end goal: it's basically an insight and acknowledgement that developer velocity wins, and then a codification of the principles which are fundamentally necessary to optimize for developer velocity. It's hard to understate how valuable that insight was and continues to be in the industry.

Why the mantra evolved to add "with stable infrastructure"

I think this evolution makes sense, as an acknowledgement of a few additional things in particular, which are both very relevant to a larger company (ie: one which has grown past the "build to survive" phase, and into the "also maintain your products" phase):

  • You need your products to continue to function in the market, at least in terms of "core" functionality
  • You need your internal platforms to function, otherwise you cannot maintain internal velocity
  • You want stable foundations upon which to build on, to sustain (or continue to increase) velocity as you expand product scope

I think the first two are obvious, so let me just focus on the third point, as it pertains to development. Scaling development resources linearly with code size doesn't work well, because there is overhead in product maintenance, and inter-people communications. Generally you want to raise the level of abstraction involved in producing and maintaining functionality, such that you can "do more with less", However, this is not generally possible unless you have reliable "infrastructure" (at the code level) which you can build on top of, with high confidence that the resulting product code will be robust (at least in so far as the reliance on the infrastructure). This, fundamentally, allows scaling the development resources linearly with product functionality (not code size), which is a much more attainable goal.

Most successful companies get to this point in their evolution (ie: where they would otherwise get diminishing returns from internal resource scaling based on overhead). The smart ones recognize the problem, and shift to building stable infrastructure as a priority (while still moving fast and breaking things, generally), so as to be able to continue to scale product value efficiently. The ones with less insightful leadership end up churning with rewrites and/or lack of code reusability, scramble to fix compounding bugs, struggle with code duplication and legacy tech debt, etc. This is something which continues to be a challenge to even many otherwise good companies, and the genius of FB/Meta (imho) is recognizing this and trying to enshrine the right approach into their culture.

That's my take, anyway, fwiw.