04 - The Embarrassment Moment - Go Live or Go Home!

July 26, 2024

The trade-off with the “Now” vs. “Next” problem is that sometimes we choose a quick fix that then becomes a permanent solution. This accumulates over time and eventually becomes “Technical Debt.”

Some advisors said “We can minimize technical debt if we build a system with bare minimums that are ready to scale”. Agreed, but sometimes “We don't know what we know”

…What “We don’t know”

It took years to achieve product market fit.
Our morale will declined after several months
We might pivot multiple times until we get the right memomentum
Attracting good engineers is challenging in the early stages due to the company's insufficient brand recognition.
…etc

For instance, at that time, one of my team member who is experienced building systems that supported over 10,000 TPS. Despite this expertise, our team opted Firebase Real Time Database because of its simplicity. Using Firebase meant we didn't need to worry about setting up an initial server, hosting, storage, and authentication. Everything was available in a single SDK."

Then covid 19 hit.

After using Firebase for 3 years, our transaction increased by 10x. Our Firebase database hit 100% usage, resulting in our application to freeze for several minutes. It turned out that the Firebase Realtime Database does not scale horizontally See Usage Limit. Typically, when you reach 100% usage, it means you're either writing more than documented there or, less likely, your clients are reading more data than can be pushed out. It was a devastating. The business was ready to scale, our system couldn't handle the load.

I Wish I Would Have Written Less Code. Every block of code needs to be maintained and can potentially break the application. Beginners often write a lot of code, but experienced developers realize that less is more. Simplified, well-maintained code is more valuable than complex code.

The migrations project started, under the name "one million dollar project - go live or go home".

Here is what we did:

Setup Mitigation Plan: We first developed a comprehensive mitigation plan, aiming to automate or simplify processes through scripting, for example find data discrepancy, then fix by script.
Identify and Verify Bottlenecks: We identified critical bottlenecks in our sub-systems that could significantly impact load reduction. We focused on areas that would yield the highest dividends, such as capturing metrics like QPS (Queries Per Second), TPS (Transactions Per Second), RPS (Requests Per Second), and process times.
Rewrite Services: We began rewriting services to enhance performance and scalability. Our team used strangler pattern for migrations strategy. Detailed technical aspects of this migration will be shared in another blog post.

“It's not about making the right choice. It's about making a choice and making it right.” ― J.R. Rim

As a result, our system can finally capable to support for 10x business growth and several month later we got funding.

From this use case I learn that early stage company requires engineer who are comfortable in dealing with migration project because of load issue, architectural issue, supporting orgz growth issue (translate monolith into micro/domain service) or modernizer version your system.

Additionally, there is another roller coaster driven not only by engineering problems but by a broader context known as "Wartime" vs "Peacetime."