The Story of PayPal V2

I was reading The Founders recently (https://www.amazon.com/Founders-Paypal-Entrepreneurs-Shaped-Silicon/dp/1501197266); I was interviewed for this book, fwiw. There's a section in "PayPal V2", the internal effort to rewrite the aggregate website which was ultimately scrapped when Elon was ousted, but obviously the book glosses over a lot of the technical details (for good reason, as these would be not very interesting to most mainstream readers). However, if you're reading this, I'm going to assume you might be more interested in the nitty gritty of this project, and since I had a first-hand perspective, I'll write more about that aspect here (and if you're not interested in this, then perhaps skip this post).

To understand the build up to this, though, I'll need to cover a bit of internal history, from my perspective.

The Preconditions

Technology base

Confinity and x.com had very different technology bases, which went well beyond even the approach to platform and tools. Confinity was founded and led by people who despised Microsoft conceptually, and had built their company software very much "from scratch" on Linux and open source. It has a custom front end scripting engine, custom binaries for processing web commands, custom networking, etc. It was built largely without concern for robustness (in terms of, for example, error checking and handling), or robust code design paradigms (everything was fairly ad hoc), and internal consistency was fracturing as the codebase grew (and more people worked on adding more features). It was very much "startup" code.

X.com, on the other hand, was built almost entirely on Microsoft's platform, embracing their approach to tiered web application design, and "trying" to create robust code. I say trying, because in reality Microsoft's platform was limited, in terms of its robustness, and x.com developers were also a scrappy bunch of young engineers trying to build things as fast as possible. As an example, I was in charge of building the front-end for the site: not in charge of a team, or an engineering group... I wrote all the front end code, myself, in primarily ASP (pre asp.net). It was as robust as the platform and time would allow, but no more than that, and functionally no better than PayPal's more haphazard and hand-grown framework.

When Bill Harris forced the merger of the companies, he ignored the looming issues of the technical divides (being not a technology person himself). But these were two systems which were never going to be able to be merged; they were fundamentally incompatible. One needed to be the path forward, and one would ultimately be discarded, and that had strong implications for the principals involved in each system. This was known, I think, from the time of the merger, but at the time both "sides" thought their solution was superior and would be the path forward.

Approach to correctness of data

One of the most significant other differences between the companies, which often gets minimized when looking from an external perspective, was the very different approach to correctness of the data within the company's systems (and in this case, I mean primarily user account and financial data). This was very evident to me internally, having high-level discussions post-merger with the technical principals (Max and otherwise), but perhaps was not as evident beyond that scope.

X.com, owing to its roots as a would-be financial institution, was always focused on the data in our database being correct. That is, we prioritized transactional correctness for data, particularly during monetary transfers. As a result, it could be asserted that at any specific time (and barring bugs in the system), the data within the database(s) was accurate with respect to the state of accounts, balances, etc.  This extended to the mindset of how transactions were processed as well; for example, x.com did not show money in an account until the company had verified and completed the transaction. This had a cost, of course, and that cost was typically assumed at the start of the customer's interactions with the system. That is, it might take longer to make a transaction (particularly when funding an account), but once the money was "there", using it (sending, taking out, etc.) was never blocked, because it was verified to actually exist.

Confinity took an entirely different approach: they focused on validating the data only when the money was removed from the system, and did prioritize correctness before that point. That meant, for example, that transactions were not necessarily "transactional" in the database, money shown in accounts wasn't necessarily "real", and errors in the data were expected. This allowed much quicker "onboarding" for new accounts and money, as PayPal could show a balance before validating anything, and you could send that money internally before it was even "there" from a backend perspective. The thinking was that as long as there was a robust validation on money egress, they could catch all the potential problems at that point, as well as running robust fraud checks then (for example).

This alternative approach came with a cost, though: withdraw time was extended, usually by a few days, but in a number of cases effectively indefinitely, if/when PayPal could not verify the source of the money. This was a source of frustration for a lot of users (leading to, among other things, people showing up in person demanding their money, lots of customer support issues, and bad feelings among some users, particularly some Power Sellers on eBay). This was later compounded when (for example) the US government started requesting holds on certain accounts, which would impact other accounts (through the transitive native of the fraud checks on monetary egress).

I will note: neither of these approaches is "correct", per se. As was usual for many design decisions at the time, there were tradeoffs, both in terms of robustness of the business and time to market. Truth be told, x.com may have adopted the PayPal model if it had started as a payments company and not an aspiring financial institution; it offered distinct benefits in terms of time to market. But, this was the state at the time of the merger.

The problem of scalability

The fundamental issue which both products had, at the time of the merger, was scalability. X.com was built on Microsoft, and could scale the front-end and middle tier servers effectively without bound (multi-tier architecture), but was limited by the database. While we had a large box, and everything possible was offloaded from it, we could estimate the limit of transactional throughput simply based on the database TPS speed (iirc, it was estimated at the time to be roughly 10x the transactional load at the time of the merger). While this could be mitigated by getting bigger hardware, that would only be true to a point, then the headroom would run out. We had done everything we could to extend this (note: this effort could be a whole other topic), but there were limits, and we could see them.

PayPal's system was fundamentally no better. While their Oracle DB was running on Sun hardware which had more headroom (eg: eBay's architecture was the same, and they famously spent >$1M on a single Sun system for their DB), they also had more processing in the DB (hand written stored procedures for speed), and the net result was that their system was not any more scalable. We also estimated that building transactionality into that system would cripple the database as-architected, and it would be effectively impossible to ever support x.com's financial management ambitions in that framework.

As a result of these realities, a new system was needed, which would both scale and support robust transactions for financial systems. This was the genesis of the V2 plan.

The V2 Design

The main reasons that Tod, Jeff, and I were tasked with designing V2 were:
  • x.com was relatively stable at that point, and didn't require constant babysitting from a code perspective
  • The feeling was that building a transactional system on the hand-grown Linux framework would be insurmountably hard
Microsoft had a transactional framework in place already for their web apps (MTS, and DTS), which we were already using for x.com, and made sense to try to leverage for V2. That left the scalability issue, which we thought we know how to solve.

Sharding and transactions

The main solution which we landed on (as proposed by Tod, iirc), was to "shard" (modern terminology, I don't think we called it this) the user accounts into multiple databases, and run distributed transactions (using DTS) across the instances for transfers. We should have what we called an "index server" for the lookups and some synchronization, but everything else would be offloaded to the shards (and in particular, all the high-volume transactions were offloaded there). In this design, the index server was still a bottleneck, but we estimated a 100x increase in headroom from the current solution at the time, which we joked would get us to the next milestone (which was the "trucks problem").

This design worked conceptually, and to sone extent in practice, but suffered from what was an inevitable and insurmountable cultural problem inherent in the decision to redesign the codebase. While we made all the x.com functionality work, we didn't have any buy-in from people on the PayPal side, and while V2 was in progress they had continued to work on their site and functionality. Thus, by the time V2 was "functional", it was missing some core components, and didn't have the robustness of either of the other two codebases. We estimated that these gaps could be closed, but as "heads down" engineers solving technical problems, we didn't really put any effort into getting anyone else on board with the planned overhaul. As it turned out, nobody else was working on that aspect either; Elon assumed everyone else would get on board when it was "good", but without many others on board, it would never get to the point where it would be perceived as "good".

As an aside, we were also somewhat on the bleeding edge of scalable architectures, writing code against platforms which were never robustly tested for the use-cases, and with a tiny set of developers. Now days we take horizontal scalability somewhat for granted, but this was during the early days of Google, AWS wasn't a thing yet, and scalable cloud databases were not even a concept then. You can see some of the concepts from that time in things like cloud databases now, but it was all "invent as you go" back then, on napkins and over coffee milkshakes in the afternoons. Engineers like me might call that "the best of times".

Programming people

I've always liked computers: they are rational, objective, and behave as expected, as long as you understand enough about how they work. The same could be said for people in concept, but people are much more complex, have lots of hidden variables, and I've never had as much success understanding them. The real problem which caused the V2 design overhaul to ultimately fail was that we didn't account for people; specifically, we didn't get the people from PayPal on board with the design direction (and perhaps that would have been impossible), and we didn't have buy in from the aggregate company that the direction was correct. We proceeded as directed by Elon, with the consensus that getting people on board was his job, but ultimately that did not work out (as seen by the revolt which saw him ousted as CEO, with the subsequent departure of most of the principals from the original x.com).

Post effects

Ultimately, the combined company decided to move forward with PayPal's existing system design, after Elon was forced out. I stayed on for a few months after, making an attempt to integrate into the new culture, but ultimately I perceived a certain amount of animosity toward the "Microsoft people" within the PayPal engineers, and they were also strongly resistant to my efforts to make their code more robust (eg: I distinctly recall one of their high level developers telling me that they categorically rejected adding any data validation to any of the business logic calls, because "that would be slower, and is unnecessary, because the data coming from the web application is always correct"). At the end of the day, the strong cultural divides were not ever effectively bridged, and I departed a few months prior to the IPO.

I have a lot more stories and recollections from that time and those interactions, of course, but that was the story of what was to be PayPal V2, for whatever it's worth.

Comments

Popular posts from this blog

The problem of "thrashing"

Zero cost abstractions are cool