Guy Louzon - a Podcast about business

Databricks: The Story of the Data Intelligence Platform


Listen Later

Part I: The Founders—Refugees, Programmers, and Dreamers

Ali Ghodsi: From Tehran to UC Berkeley

The story of Databricks begins, improbably, with a child’s flight from revolution. In 1984, when Ali Ghodsi was just five years old, his family had approximately 24 hours to escape Iran. The country was in upheaval, torn by the revolution and the early years of the Iran-Iraq War. His parents, both physicians, recognized the existential threat. With nothing but what they could carry, the Ghodsi family fled to Sweden—a country they knew only from maps and the kindness of strangers.

Growing up in the suburbs of Stockholm, the family struggled financially. Ghodsi’s parents were doctors rebuilding their careers in a new country, which meant opportunities were limited but expectations were clear: education was the path forward. When Ali was around seven or eight years old, his family acquired something that would change the trajectory of his life: a used, semi-broken Commodore 64. Most children would have been frustrated by a machine that didn’t run games. Ali Ghodsi did something different. He read the manuals. By the time he was eight, he was programming.

“From age eight until he transitioned into becoming Databricks’ CEO, Ali hadn’t spent a day without programming,” one biographer would later write. This wasn’t mere childhood dabbling. It was an obsession that shaped everything that followed.

Ghodsi excelled in his Swedish education, completing degrees in computer engineering and an MBA in logistics and strategic marketing from Mid Sweden University. But Sweden, with its emphasis on incremental research and seniority-based advancement in academia, felt confining to a young man who had spent his entire childhood as an outsider fighting for recognition. After earning his PhD in distributed computing from KTH Royal Institute of Technology in 2006, he spent two years as an assistant professor at KTH, from 2008 to 2009. The work was respectable. It was not enough.

In 2009, opportunity knocked in the form of UC Berkeley’s AMPLab—a $40 million DARPA-funded research initiative focused on big data analytics and machine learning systems. Ghodsi was invited to Berkeley as a visiting scholar for what was supposed to be one year. He arrived intending to observe the heyday of big data innovation, learn what the Americans were building, and return to Sweden. He would stay for the rest of his life.

Ion Stoica: The Romanian Distributed Systems Master

At UC Berkeley, Ghodsi found his intellectual soulmate in Ion Stoica, a Romanian-American computer scientist who had established himself as one of the premier minds in distributed systems. Stoica’s trajectory had been its own remarkable journey. Born in Romania, he had earned an MS in electrical engineering and computer science from Polytechnic University of Bucharest in 1989. In 1995, as a doctoral student at Old Dominion University, he and his advisor Hussein Abdel-Wahab published an algorithm for “earliest eligible virtual deadline first scheduling”—a breakthrough that would become the default process scheduler in the Linux kernel itself.

Stoica had been at UC Berkeley since 2000 as a professor of computer science. He was not just an academic; he was an entrepreneur who understood systems at the deepest level. In 2006, he had co-founded Conviva with other computer scientists, a company that emerged from CMU research on multicast systems and became a pioneer in video streaming technology. When Ghodsi arrived at Berkeley in 2009, Stoica was the intellectual center of the AMPLab, a man who combined rigorous research with a pragmatic understanding of what industry needed.

The two connected immediately. Ghodsi would stay that one year. Then another. Then another.

Matei Zaharia: The Spark Visionary

In the AMPLab at Berkeley, there was also a brilliant Romanian-Canadian graduate student named Matei Zaharia. Zaharia had come through an extraordinary academic pedigree—he’d been a gold medalist at the International Collegiate Programming Contest (ICPC) in 2005 with the University of Waterloo, and had even contributed to the acclaimed open-source game 0 A.D. But his true genius would emerge in his PhD research.

In 2009, Zaharia observed a fundamental problem that nobody in the big data world was adequately addressing. The dominant framework for distributed computing was Apache Hadoop, built on Google’s MapReduce model. MapReduce was powerful—it could distribute massive computations across clusters of commodity hardware. But it had a critical architectural flaw: it was designed for batch processing. Every time a MapReduce job completed, the results were written to disk. If the next job needed that data, it had to read it back from disk. For iterative algorithms—the kind used in machine learning—or for interactive data exploration, this disk I/O bottleneck made Hadoop painfully slow.

“Machine learning researchers in our lab at UC Berkeley were trying to use MapReduce for their algorithms and finding it very inefficient,” Zaharia would later explain. The problem was clear. The solution was not obvious.

Zaharia began designing an alternative. His key insight was radical: what if, instead of writing intermediate results to disk, we kept them in memory between operations? This would eliminate the disk I/O bottleneck and make iterative algorithms and interactive queries blazingly fast. In August 2009, he began building what he called Spark—a distributed computing engine that would keep data cached in RAM, enabling multiple operations to reuse the same dataset without the expensive disk reads.

The results were staggering. When Spark cached a dataset in memory and ran machine learning algorithms on it, it executed 10 to 100 times faster than Hadoop MapReduce. It made possible entire categories of applications—interactive data science, complex machine learning workflows, real-time analytics—that were prohibitively slow on Hadoop. In 2012, Zaharia and his co-authors published the seminal paper “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing” at the top-tier NSDI conference. It was a best paper award winner.

The AMPLab Dream Team

Ghodsi, Stoica, and Zaharia were surrounded at the AMPLab by other exceptional computer scientists: Michael Franklin, a database systems expert; Scott Shenker, a networking genius; and others. This was the dream team of distributed systems research. But by 2009-2013, a philosophical question was starting to gnaw at them: What good is the best research in the world if nobody uses it?

The Spark project was open-sourced in 2010 and donated to the Apache Software Foundation in 2013, becoming Apache Spark. By 2012-2013, Spark was becoming adopted by leading technology companies. But adoption was still limited. The problem was distribution: companies had to download Spark, learn to manage it, integrate it with their data infrastructure, and hire engineers who understood distributed systems to make it work.

Ghodsi, who by now had committed to Berkeley for the long haul, began to see the opportunity clearly. Spark was powerful. But the world would never fully benefit from Spark as long as it remained difficult to deploy. What if someone built a company to commercialize Spark? Not to create a proprietary competitor to Spark, but to build a managed platform that made Spark accessible to enterprises that didn’t have a team of Ph.D. computer scientists on staff?

In 2013, the decision was made. The AMPLab researchers would start a company.

Part II: Birth in Crisis—The Early Years (2013-2015)

The Founding

Databricks was founded in 2013 by what would later be called “the Apache Spark Seven”: Ali Ghodsi, Ion Stoica, Matei Zaharia, Patrick Wendell, Reynold Xin, Andy Konwinski, and Arsalan Tavakoli-Shiraji. All seven were UC Berkeley researchers. All seven had contributed meaningfully to Apache Spark. This was not a startup founded by business school dropouts with an idea and ambition. This was founded by some of the world’s leading distributed systems researchers who wanted to build a company.

But founding was harder than it seemed. Ghodsi, in particular, was reluctant. He had found genuine happiness in academic research. He was respected, secure, and engaged in meaningful work. The idea of leaving that to start a company felt risky and uncertain. When the co-founders began discussing what kind of funding they would need and what valuation made sense, opinions varied wildly. Some thought $20 million in valuation was appropriate. Others thought $35 million. The founders debated extensively, unsure of their own market value.

Then Ben Horowitz, the legendary venture capitalist and co-founder of Andreessen Horowitz, arrived in the picture. Horowitz had heard about Spark through Scott Shenker, the AMPLab professor. He became convinced that Spark represented something profound—that a hundred-billion-dollar company could be built on top of this technology. He came to the team and asked what they needed. The founders, expecting to ask for something modest, were shocked by his response: “This company is worth $50 million,” Horowitz said. “And I’m willing to invest $14 million.” (In some accounts, Horowitz came with an $11 million check in hand.)

For Ghodsi, making $59,000 as a UC Berkeley professor, the decision suddenly became clearer. In September 2013, Databricks was officially founded with Andreessen Horowitz as the lead investor. Ion Stoica, the senior academic and the co-founder most prepared for executive leadership, became CEO. Matei Zaharia became CTO. Ali Ghodsi, still somewhat reluctant, became VP of Engineering and Product Management—the operational leader responsible for actually building the product.

The Struggle: Free Software, Hard Sales

The early years of Databricks were brutal in a way that many pre-product-market-fit startups are, but particularly so because of the founders’ academic backgrounds. They had built the world’s best distributed computing engine. They had revolutionized the category. They understood the technology so deeply that they could optimize Spark to do things that seemed like magic to competitors.

What they did not understand was enterprise sales.

The business model, in the beginning, was straightforward: provide a managed Spark service so enterprises could run Spark workloads without building and maintaining their own clusters. But there was an immediate, crushing problem. Spark was free. Open source. Anyone could download it and deploy it themselves. Why would they pay Databricks?

For several years, Databricks struggled to find customers willing to pay meaningful amounts of money. In customer meetings, enterprise stakeholders would literally ask: “Why would we ever pay $10,000? We’re just going to get it for free.” The open source advantage that was supposed to be a moat around the business—the credibility that came from the founders building the best technology in the world—became a liability. How do you sell a service around free software?

This was the great dilemma that Silicon Valley encounters again and again: the open source trap. The founders had given the world an extraordinary gift. Now they had to figure out how to build a profitable business on top of it.

Databricks raised more funding—$33 million in Series B in 2014, led by New Enterprise Associates (NEA) with follow-on from Andreessen Horowitz. But revenue remained tiny. The founders were solving a technical problem, not an economic one.

The Great Recalibration: 2015-2016

By 2015, something remarkable began to happen. Spark achieved mainstream recognition. Every data engineering team in the world seemed to be talking about it. Google, Facebook, Amazon, Alibaba—the world’s largest technology companies adopted Spark. It became the de facto standard for distributed data processing, topping the Gartner Hype Cycle.

But Databricks’ revenue was still just $1 million annually. The company had raised roughly $174 million and was valued at around $1 billion (or close to it), yet it was generating almost no revenue. The board was getting anxious. As one account memorably put it: “Even a local restaurant had higher revenues.”

In 2015, Reynold Xin, one of the co-founders and chief architect, had an idea. Why not participate in the Sort Benchmark—a well-known third-party competition for processing large datasets? The idea was to prove Spark’s superiority not through marketing or sales pitches, but through undeniable technical achievement.

The Sort Benchmark had long been the gold standard for measuring data processing efficiency. Companies and research teams would compete to sort massive amounts of data as quickly and cost-effectively as possible. In October 2014, Databricks entered the Daytona GraySort competition. Using 207 EC2 machines, Databricks’ team sorted 100 terabytes of data (1 trillion records) in just 23 minutes. The previous record, held by Yahoo using Hadoop MapReduce, had required 2,100 machines and taken 72 minutes. Spark had accomplished the same feat with 10x fewer machines and 3x faster execution.

It was a tie with a UCSD research team for first place, but it was a world record.

But more was to come. In 2016, Databricks partnered with Nanjing University and Alibaba Group in the CloudSort competition—a variant focused not on speed, but on cost-efficiency. The team sorted 100 terabytes of data using only $144.22 worth of cloud computing resources. That worked out to $1.44 per terabyte. The previous record had been $4.51 per terabyte, held by UC San Diego.

Databricks had achieved a 68 percent reduction in cost.

“Databricks reduced the per terabyte cost from 4.51 dollars, the previous world record held by University of California, San Diego in 2014, to 1.44 dollars, meaning our optimizations and advances in cloud computing have tripled the efficiency of data processing in the cloud,” Reynold Xin announced in November 2016. The achievement was recognized by Guinness World Records.

This was the moment. Not the moment Databricks became profitable—that would take years. But the moment when the world could no longer ignore it. Suddenly, everyone was talking about Databricks and Spark. The technical proof points were undeniable. The marketing benefit was immense.

But there was still the matter of making money.

Part III: The Ali Ghodsi Era—Operator’s Turnaround (2016-2018)

The Leadership Transition

In January 2016, the board made a decision. Ion Stoica, a brilliant researcher and beloved professor, had excelled at articulating the vision and guiding the technical direction of Databricks. But running a hypergrowth technology company requires a different skill set—the operational intensity, the go-to-market execution, the ability to hire executives, the willingness to make unpopular decisions. Stoica wanted to return to his professorship at UC Berkeley. He became executive chairman.

Ali Ghodsi, the reluctant entrepreneur who had come to Berkeley with plans to return to Sweden, became CEO. By his own account, the decision to make him CEO was based largely on the fact that he was the eldest co-founder remaining in an operational role. But he would prove to be far more than that.

Ghodsi approached the role with the hard pragmatism of an immigrant who had seen family members lose everything and rebuild. In a 2018 interview, he would reflect on his childhood inspiration: “I loved the fact that you could think about large corporations as patients, and you could perform surgery on them to make them super healthy and successful.” This was how he approached Databricks—as a series of problems that needed diagnosis and correction.

The first diagnosis: the company was not charging appropriately for its software.

The Great Pivot: Charging for Software

In early 2016, immediately after becoming CEO, Ghodsi told his executive team something radical for a company built by academics around open-source software: “We need to charge for software, not just services.”

This single insight began to reshape Databricks fundamentally. The company had been operating on a services mentality—we’ll host your Spark infrastructure, manage your clusters, and you pay us for the infrastructure costs. But Ghodsi saw the problem clearly: the vast majority of value was not in hosting. It was in the software, the features, the intellectual property that Databricks’ engineering team was building on top of Spark.

He began to implement a strategy that would have seemed naive to many entrepreneurs but was brilliant in its simplicity: identify the enterprise customers who would benefit the most from Databricks, determine what 1% improvement in their metrics would be worth, and price accordingly.

“Databricks provides machine learning for massive data sets, allowing customers to potentially improve metrics by about 1%,” as one account describes it. “The only customer base that made sense for their business model are large-scale enterprises. But enterprises don’t just swipe a credit card to pay for your service.”

This realization led to the next insight: the company needed a real enterprise sales organization.

Building the Go-to-Market Machine

In 2016, with a valuation near $1 billion but revenue that was minimal, Ghodsi undertook a hiring spree that seemed reckless to many observers. He hired 12 new executives, experienced operators who had built sales organizations at other enterprise software companies. He hired a head of enterprise sales. He hired a Chief Financial Officer. He hired a head of marketing. He hired a Chief People Officer (HR).

“Many founders don’t make this a priority and end up spending a lot of their time on HR needs,” one account notes, quoting Ghodsi on the importance of hiring someone professional to build out processes for onboarding, compensation, training, and recruiting.

This was the discipline of an operator. Ghodsi had studied how successful companies scaled, and he was following the playbook, but with the advantage of Databricks’ incredible product and team.

The results were dramatic. In 2017, Databricks closed its first million-dollar deal. By the end of 2017, the company’s annual recurring revenue had reached $40 million. In 2018, it hit $100 million. By Q3 2019, Databricks was running at a $200 million annual revenue rate.

In just three years, Ali Ghodsi had transformed Databricks from a brilliant technology company that couldn’t sell to a hypergrowth enterprise software company with an unstoppable trajectory.

The Microsoft Partnership

A critical turning point came in 2017 with a landmark partnership with Microsoft. Microsoft, one of the world’s largest software companies and an aggressive entrant into cloud computing with Azure, could have built a Spark competitor. Instead, it recognized the talent and technology in Databricks and made a strategic decision to invest in and integrate Databricks into Azure.

The partnership generated hundreds of millions in annual revenue—some accounts suggest the initial deal alone was worth $100 million in sales. More importantly, it provided enterprise legitimacy. If Microsoft was betting on Databricks, then Databricks was the future of big data on the cloud.

By 2017-2018, Databricks’ story was no longer one of struggle. It was a story of momentum. The company went from sub-$1 million in annual revenue in 2015 to $100 million in 2018. The valuation reflected this: the company was now valued at several billion dollars, reflecting the expectations of continued hypergrowth.

Part IV: The Lakehouse Revolution (2019-2021)

The Architecture Problem

By 2018-2019, Databricks had achieved strong product-market fit with enterprises that needed advanced data engineering and machine learning workflows. But the founders were thinking bigger. They were beginning to see a fundamental architectural opportunity that would reshape the entire data industry.

For years, the data industry had been bifurcated. On one side were data warehouses—high-performance SQL databases optimized for business intelligence and analytics on structured data. Snowflake, which had gone public in 2020 with spectacular success, dominated this space. On the other side were data lakes—large-scale storage systems that could hold any kind of data—structured, semi-structured, or unstructured—but required complex, difficult-to-manage processing pipelines.

Every large enterprise needed both. This meant managing two different systems, two different teams, two different sets of tools, and the data synchronization complexity that came with maintaining them both.

What if you could combine them?

Enter the Lakehouse

In 2020, Databricks introduced the concept of the “lakehouse”—a unified architecture that combined the performance and management features of a data warehouse with the flexibility and cost-efficiency of a data lake. This wasn’t just a marketing term; it represented a genuine architectural innovation built on several technical foundations that Databricks had developed:

Delta Lake: Databricks open-sourced Delta Lake, a project that added ACID transaction support to data lakes. This meant that data lakes could now offer the reliability and transactional guarantees that had previously been the exclusive domain of warehouses. You could perform consistent operations, enforce constraints, and guarantee data integrity—all while maintaining the flexibility to store any kind of data.

Databricks SQL: A SQL engine optimized for running analytical queries on large-scale data in cloud storage. Rather than forcing data into a proprietary data warehouse format, Databricks SQL could query open data formats directly in cloud storage, delivering warehouse-like performance without warehouse lock-in.

The Open Data Format: By building the lakehouse on top of open formats and open standards (rather than proprietary storage formats), Databricks ensured that customers would never be trapped. Their data would always be portable, queryable by any tool, not locked into Databricks’ ecosystem.

This was a fundamental intellectual shift. Snowflake had won by modernizing the traditional data warehouse for the cloud—taking the proprietary, closed-off architecture of systems like Teradata and Oracle, and making them cloud-native, elastic, and accessible to smaller companies. Databricks was now attacking from the opposite direction: starting with the open flexibility of the data lake, but adding the performance and governance of a warehouse.

It was a more ambitious vision. If successful, it would reshape the entire data industry.

Building an AI/ML Powerhouse

Simultaneously, Databricks was expanding its capabilities for machine learning and AI. The company had long recognized that data engineering and machine learning were deeply intertwined. You couldn’t do good machine learning without good data engineering. And increasingly, the tools for managing data and training models were converging.

Databricks invested heavily in several key technologies:

MLflow: A project for managing the machine learning lifecycle—from experimentation to production deployment. MLflow solved a critical pain point: data scientists were experimenting with hundreds of model variations, but there was no standard way to track experiments, manage parameters, and deploy the best models to production. MLflow provided a unified platform for this entire workflow.

Spark ML: Advanced machine learning libraries native to Spark, enabling distributed training of complex models on massive datasets.

Collaborative Notebooks: Databricks notebooks provided a collaborative environment where data scientists, engineers, and analysts could work together on the same codebase, share results, and iterate rapidly.

By 2021, Databricks was not just a data infrastructure company. It was evolving into a comprehensive platform for data engineering, analytics, and machine learning—everything that enterprises needed to extract value from their data and build AI applications.

The Snowflake War Begins

By 2019-2021, the data industry had split into distinct camps. Snowflake, which had gone public in September 2020 at an IPO price of $120, had emerged as the clear winner of the data warehouse revolution. The stock surged, hitting $401 in November 2021 as growth-focused investors bid up software companies without regard to profitability.

But even as Snowflake celebrated its public success, Databricks was quietly building something that would compete not just in data warehousing, but across the entire data and AI infrastructure stack.

The competitive positioning was interesting and subtle. Snowflake had come from the warehouse and was moving toward AI. Databricks had come from AI and machine learning (via Spark) and was moving toward the warehouse. They were approaching from opposite directions, but converging toward the same market.

Snowflake’s strength was simplicity. A Snowflake user could load data and run SQL queries without deep technical expertise. The platform was “zero-admin”—it just worked. This made Snowflake the natural choice for data analysts and BI teams.

Databricks’ strength was flexibility and power. For organizations with complex data pipelines, advanced machine learning requirements, or diverse data types (not just structured SQL data), Databricks offered an open, extensible platform that could handle anything you threw at it. But it required more technical sophistication to operate effectively.

The battle lines were clear by 2021: Snowflake was the SQL-first warehouse. Databricks was the Spark-first lakehouse.

Part V: The AI Era and the $10 Billion Milestone (2022-2024)

The MosaicML Acquisition

In 2023, Databricks made a strategic acquisition that signaled its commitment to the new AI era: it acquired MosaicML, a company focused on generative AI and large language models, for $1.4 billion.

This was bold. Databricks was not just building infrastructure for traditional data engineering and analytics. It was positioning itself as a comprehensive platform for the AI era, where enterprises needed to train, fine-tune, and deploy LLMs, manage vector embeddings, and integrate generative AI into their applications.

The integration of MosaicML’s capabilities into the Databricks platform marked a fundamental shift in the company’s strategic positioning. Databricks was no longer competing primarily on data management. It was competing on data and AI intelligence.

DBRX: The Open-Source LLM

In 2024, Databricks released DBRX, an open-source foundation model built on the MegaBlocks project. This was a striking move. Rather than building proprietary LLMs locked behind API gates (as OpenAI and other startups were doing), Databricks released a powerful, efficient open-source model that enterprises could deploy, fine-tune, and customize within their own infrastructure.

This was consistent with Databricks’ entire philosophy: open formats, open standards, open source, and avoiding vendor lock-in. You should be able to use cutting-edge AI without being trapped in a proprietary ecosystem.

The $10 Billion Funding Round

In December 2024, Databricks announced a Series L funding round led by Andreessen Horowitz (which had been there from the beginning in 2013) and Thrive Capital. The funding valued Databricks at an astounding $62 billion. Some reports suggested the round itself was $10 billion, making it one of the largest private funding rounds in venture capital history.

To put this in perspective: Databricks had grown from a $1 billion valuation in 2016 (with nearly zero revenue) to a $62 billion valuation in 2024 (with a $4.8 billion annual run rate, growing at 55% year-over-year).

Ali Ghodsi, the reluctant entrepreneur who had come to UC Berkeley intending to return to Sweden, had built something extraordinary.

Part VI: The Competitive Landscape

Snowflake’s Dominance and Databricks’ Ascent

By 2024, a fascinating dynamic had emerged in the data and AI infrastructure market. Snowflake and Databricks were both at approximately $5 billion in annual recurring revenue (ARR), yet valued very differently. Snowflake’s market cap had compressed as investors rotated away from “growth at all costs” toward profitability and sustainable business models. Databricks, by contrast, had reached a $62 billion valuation as a private company, reflecting market confidence in its trajectory and the AI opportunity.

The key metric that told the story was Net Revenue Retention (NRR)—the measure of how much existing customers increase their spending with the company year over year. Databricks reported an NRR above 140%, meaning that existing customers were spending 40% more annually, even before new customer acquisition. Snowflake’s NRR had declined from 158% at IPO to 125% by 2024, a troubling trend that signaled a product velocity problem or increased competition.

At $5 billion ARR scale with this level of net retention, Databricks was demonstrating that customers found extraordinary value in the platform. They were expanding usage, adding new workloads, and going deeper into Databricks for data engineering, analytics, ML, and AI applications.

The Three-Way War: Cloud Providers, Warehouse vs. Lakehouse

But both Databricks and Snowflake faced a more fundamental competitive threat: the cloud providers themselves. Amazon Web Services, Google Cloud, and Microsoft Azure all had their own data infrastructure offerings. AWS had Redshift, Athena, and Glue. Google Cloud had BigQuery, which was becoming increasingly dominant in certain markets. Microsoft had invested in Databricks but also had Azure Synapse as a competing option.

The competitive dynamic was unusual: Databricks and Snowflake both benefited from running on cloud providers’ infrastructure, but competed with those same providers. Snowflake, uniquely, had maintained independence from any single cloud provider, running on AWS, Azure, and Google Cloud equally.

Databricks, for its part, had strategic partnerships with all three cloud providers but was not beholden to any of them.

The Warehouse vs. Lakehouse Debate

The fundamental architectural question was whether enterprises should consolidate on a single unified data platform (the lakehouse vision, favoring Databricks) or maintain separate, specialized systems for different workloads (the traditional warehouse + lake approach, favoring Snowflake and BigQuery for the warehouse piece, with separate lake infrastructure for more complex workloads).

By 2024, the evidence was increasingly in favor of the unified lakehouse approach. Enterprises were tired of managing multiple systems, multiple teams, and the complexity that came with data synchronization between systems. The lakehouse offered the promise of a single source of truth, unified governance (Databricks’ Unity Catalog was the industry’s only unified governance system for both data and AI), and the flexibility to handle any type of workload.

Snowflake was responding by expanding its own capabilities—adding Snowpark (a developer environment), Cortex (generative AI capabilities), and Polaris (an open-source catalog built on the Iceberg format). But these felt reactive, playing defense in Databricks’ court rather than playing offense on Snowflake’s home turf.

Epilogue: From Academic Refuge to AI Infrastructure Leader

The story of Databricks is the story of what happens when world-class researchers decide to build something meant to be used by millions of people rather than remaining trapped in academic papers and research projects.

Ali Ghodsi arrived in the United States in 2009 as a Swedish computer scientist with a Ph.D. and a temporary position at UC Berkeley. He was planning to return to Sweden. Instead, he found himself part of a team building Apache Spark, one of the most important open-source projects in computing history.

When the moment came to start a company, he was reluctant. But he recognized, with the clarity of someone who had fled one country and built a life in another, that the world would not spontaneously adopt Spark unless someone created an organization to make it accessible. He built that organization.

By 2024, Databricks had become a $62 billion company with a $4.8 billion run rate, growing at 55% annually, serving some of the world’s largest enterprises, and positioning itself as the infrastructure platform for the AI era.

The sorting benchmark victories were never really about sorting. They were about proving, unambiguously, that Spark was the most efficient engine for processing data in the cloud. But more importantly, they were about shifting the conversation from technical merit (which Databricks had all along) to practical demonstration of value.

The architectural innovations—Delta Lake, Databricks SQL, Unity Catalog—were never mere feature launches. They were fundamental shifts in how the industry thought about data infrastructure. Databricks was arguing, successfully, that you didn’t need to choose between the reliability of a warehouse and the flexibility of a lake. You could have both. You didn’t need to accept vendor lock-in. You could build on open formats and open standards.

And most recently, with MosaicML and DBRX, Databricks was arguing that you didn’t need to choose between proprietary LLM providers and open-source models that you had to fine-tune yourself. You could have enterprise-grade AI within your own data platform.

These were not trivial innovations. These were architectural shifts that were reshaping an entire industry.

By 2026, Databricks’ IPO was an inevitability—a question not of if but of when. The company had demonstrated hypergrowth, approaching sustainability, and a clear market leadership position in the most important infrastructure category of the AI era: unified data and AI platforms.

The refugee who learned to code on a broken Commodore 64 in Stockholm, who arrived at UC Berkeley planning a one-year visit, had built something that would serve millions of users and touch the infrastructure of global business for generations.



This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit guylouzon.substack.com
...more
View all episodesView all episodes
Download on the App Store

Guy Louzon - a Podcast about businessBy Guy Louzon