The $62 Billion Spreadsheet
In December 2024, Databricks closed a $10 billion Series J round at a $62 billion valuation — the largest private funding round in venture capital history at the time, surpassing the $6.6 billion it had raised just fourteen months earlier at $43 billion. The numbers were staggering not because of what Databricks was — a data analytics platform, the kind of enterprise software that makes most people's eyes glaze — but because of the velocity at which it was becoming something else entirely. Revenue had crossed $2.4 billion in annualized run rate. The customer base included more than 10,000 organizations, over 60% of the Fortune 500. And the company was growing at north of 50% year-over-year, a rate that would be impressive for a $200 million startup and was almost unprecedented for a business approaching $3 billion.
But the number that mattered most was buried in the architecture. By late 2024, Databricks was processing more than 12 exabytes of data daily across its Lakehouse Platform — a volume so vast that the entire Library of Congress, digitized in full, would represent a rounding error. This was the real product: not a tool for querying tables, but the gravitational center of a new data operating system that companies were building their AI strategies around. The $62 billion valuation wasn't paying for what Databricks had built. It was paying for what the market believed it would become — the default substrate on which enterprise AI runs.
That belief, and the architecture behind it, traces back to a seven-year research project at UC Berkeley that most enterprise software companies would never have had the patience to fund.
By the Numbers
Databricks at a Glance (Late 2024)
$2.4B+Annualized revenue run rate
$62BPost-money valuation (Series J)
10,000+Customers worldwide
60%+Fortune 500 as customers
$10BLargest private VC round ever raised
7,000+Employees globally
50%+YoY revenue growth rate
12 EB/dayData processed on the platform daily
Seven Researchers and a Cluster
The founding mythology of Databricks is unusually academic, even by Silicon Valley standards. The company emerged not from a garage or a dorm room but from the AMPLab at the University of California, Berkeley — a research lab jointly funded by DARPA, the NSF, Google, and a handful of other institutional patrons of basic computer science research. The lab's mandate was broad: build the next generation of data analytics tools for problems too large and complex for existing systems.
Ali Ghodsi, an Iranian-born computer scientist who had grown up in Sweden and earned his PhD at KTH Royal Institute of Technology in Stockholm, arrived at Berkeley as a postdoctoral researcher in 2009. He was, by training, a distributed systems theorist — the kind of person who thinks about how to make thousands of machines behave as one. Ghodsi joined a group that already included Matei Zaharia, a Romanian-Canadian PhD student whose dissertation project would become one of the most consequential pieces of open-source software in the history of enterprise computing.
Zaharia's project was Apache Spark.
The problem Spark addressed was deceptively simple: Hadoop, the dominant framework for large-scale data processing, was painfully slow. Hadoop's MapReduce paradigm wrote intermediate results to disk between every computation step — a design choice that was robust but glacial for iterative workloads like machine learning, where the same data needed to be processed dozens or hundreds of times. Zaharia's insight was to keep data in memory across computation steps, a technique called Resilient Distributed Datasets (RDDs). The speedup was not incremental. Spark could run certain workloads 100 times faster than Hadoop MapReduce.
The paper landed in 2010. By 2012, Spark had become the most active open-source project in big data. And seven researchers from the AMPLab — Ghodsi, Zaharia, Ion Stoica, Scott Shenker, Patrick Wendell, Reynold Xin, and Andy Konwinski — faced the decision that defines the trajectory of every successful academic project: commercialize or watch someone else do it.
They incorporated Databricks in 2013, with Ghodsi as CEO and Zaharia as CTO. The choice of Ghodsi to lead was revealing. Stoica and Shenker were the senior professors, established figures in systems research. But Ghodsi had the immigrant's hunger and the operator's instinct — he understood that the technical achievement of Spark was necessary but not sufficient, that the real challenge was wrapping a research project in a product surface that Fortune 500 companies would pay seven figures a year to use.
We realized that just open-sourcing software wasn't enough. Companies needed a managed service, they needed support, they needed someone to call when things broke at scale.
— Ali Ghodsi, CEO of Databricks
The initial product was modest: a managed cloud service for running Apache Spark. Customers could spin up Spark clusters without managing infrastructure, run their ETL pipelines and machine learning experiments, and pay by consumption. It was, in essence, Spark-as-a-service. Andreessen Horowitz led the Series A with a $14 million check in September 2013, recognizing the pattern — open-source project with massive adoption, thin commercial wrapper, land-and-expand into the enterprise.
But the decision to build a company around Spark contained within it a tension that would take nearly a decade to resolve. Spark was a processing engine — it computed things. It did not store things. And in the data world, the money has always been in storage.
The Warehouse Wars
To understand what Databricks became, you have to understand the architecture it was born into — and the company it spent a decade trying to defeat.
The modern data stack, as it existed in 2013, was organized around a simple dichotomy. Data warehouses — Teradata, Oracle, and increasingly cloud-native systems — stored structured data in optimized columnar formats and let analysts query it with SQL. Data lakes — built on Hadoop's HDFS or, increasingly, cloud object stores like Amazon S3 — stored everything else: logs, images, sensor data, unstructured text, the messy exhaust of digital operations. Warehouses were expensive, fast, and governed. Lakes were cheap, slow, and chaotic.
Databricks lived on the lake side. Its customers were data engineers and data scientists — technical users who wrote Python and Scala, who built machine learning models, who needed to process massive volumes of raw data. The warehouse side was dominated by a company that, in 2012, had been founded by three Oracle engineers in the Sausalito houseboat that one of them lived on.
Snowflake.
The rivalry between Databricks and Snowflake became the defining competitive axis of enterprise data infrastructure in the 2010s and 2020s, and understanding it requires appreciating that the two companies started from opposite ends of the same spectrum and spent a decade converging. Snowflake built a cloud-native data warehouse — blazingly fast SQL analytics on structured data, with a consumption-based pricing model that made CFOs weep with joy and CIOs weep with anxiety. Its co-founder Benoit Dageville, a French database theorist who had spent two decades at Oracle, understood something fundamental: most data analytics, for most companies, is SQL. Not Python, not Scala, not TensorFlow. SQL. The lingua franca of business intelligence.
Snowflake's product was elegant, opinionated, and closed. Data went into Snowflake's proprietary storage format. Queries ran on Snowflake's proprietary compute engine. You paid for both. The lock-in was the point — or at least the consequence. By the time Snowflake IPO'd in September 2020 at the largest software IPO in history (raising $3.4 billion, with shares doubling on the first day to a $70 billion market cap), it had become the canonical example of cloud data infrastructure done right.
Databricks, by contrast, was open. Its philosophical DNA — inherited from Berkeley, from the open-source ethos of Spark, from the academic conviction that open standards win — led it toward a fundamentally different architectural bet. Data stayed in the customer's cloud storage. Compute ran on Databricks' managed clusters. The decoupling was both a principle and a wedge: customers who feared Snowflake's lock-in could use Databricks and keep their data on S3 or Azure Blob Storage or Google Cloud Storage, under their own control.
But openness created its own problems. The data lake was a mess. Without the rigid schema enforcement of a warehouse, lakes became swamps — petabytes of data with no governance, no quality guarantees, no ability to run the fast SQL queries that business analysts demanded. Databricks had the data scientists. Snowflake had the business analysts. And business analysts, historically, control more budget.
The strategic problem was clear by 2017: Databricks needed to make the lake work like a warehouse without becoming a warehouse. The solution would redefine the company.
The Lakehouse Thesis
The term "lakehouse" was coined in a 2020 research paper co-authored by Zaharia and other Databricks researchers, but the architectural work began years earlier. The core insight was that the dichotomy between lakes and warehouses was not a law of nature but an artifact of technology limitations — limitations that cloud storage and modern metadata layers could overcome.
The key technology was Delta Lake, which Databricks open-sourced in 2019. Delta Lake added ACID transactions — the guarantees of atomicity, consistency, isolation, and durability that are the foundation of any reliable database — to data stored in open formats (Apache Parquet files) on cloud object storage. This meant that a data lake could now support the kind of reliable, consistent reads and writes that had previously required a warehouse. You could run SQL analytics and machine learning workloads on the same data, in the same system, without copying it between a lake and a warehouse.
The lakehouse combines the best of data lakes and data warehouses. You get the openness and flexibility of a lake with the performance and governance of a warehouse.
— Matei Zaharia, CTO and Co-founder of Databricks
The Lakehouse Platform, as Databricks branded it, was not merely a product update. It was a competitive repositioning — a declaration that the warehouse-versus-lake dichotomy was a false choice, and that the correct architecture was a unified platform that could serve data engineers, data scientists, business analysts, and (eventually) AI applications from a single data layer.
The move was audacious because it required Databricks to become excellent at SQL — the thing Snowflake did best. And for years, Databricks' SQL capabilities were, to put it charitably, a work in progress. The company's traditional customers were Python-writing data scientists; its SQL engine was an afterthought. Building a SQL engine competitive with Snowflake's finely tuned query optimizer required years of engineering investment and a willingness to compete on terrain the rival had been cultivating since its founding.
By 2023, independent benchmarks showed Databricks SQL performing competitively with Snowflake on standard TPC-DS workloads — not definitively faster, but no longer embarrassingly slower. The gap had closed enough that CIOs could credibly evaluate Databricks as a warehouse replacement, not just a warehouse complement.
The lakehouse thesis carried a deeper strategic implication. If data lakes and warehouses converge, then the winner is whoever owns the broadest surface area of the data workflow — from ingestion to transformation to analytics to machine learning to AI model serving. Databricks was betting that the future of data infrastructure was not a best-of-breed stack of specialized tools but an integrated platform. The bet was not original — every enterprise software company makes this bet eventually — but the timing, aligned with the explosion of enterprise AI, would prove extraordinarily fortunate.
The Acquisition Engine
Ghodsi, who by the mid-2020s had established himself as one of the most strategically ambitious CEOs in enterprise software, built the Lakehouse Platform through a combination of organic R&D and aggressive acquisitions. The M&A strategy followed a clear pattern: identify the missing capability in the platform vision, acquire the best team building it (usually a small, technically excellent startup), and integrate the technology into the Databricks runtime.
Building the Lakehouse through M&A
2020Acquired Redash, an open-source SQL query and visualization tool, adding business intelligence capabilities to the platform.
2023Acquired MosaicML for $1.3 billion — a startup building tools for training large language models efficiently — signaling Databricks' push into generative AI infrastructure.
2024Acquired Tabular for approximately $1.8 billion. Tabular was founded by the original creators of Apache Iceberg, the open table format that competed with Databricks' own Delta Lake.
2024Acquired Lilac AI, a startup focused on data curation and enrichment for AI model training.
The MosaicML deal, in particular, was a defining move. Naveen Rao, MosaicML's co-founder and CEO — a former Intel executive who had sold his previous AI chip company, Nervana Systems, to Intel for $350 million in 2016 — had built a platform for training large language models at a fraction of the cost of doing it from scratch. The technology let enterprises fine-tune foundation models on their own data, using their own infrastructure, without sending sensitive information to OpenAI or Anthropic. Databricks paid $1.3 billion for a company with minimal revenue, signaling that the platform's future was not just data analytics but AI model development.
The Tabular acquisition was perhaps even more strategically significant, though less headline-grabbing. Apache Iceberg, the open table format created by Tabular's founders (Ryan Blue, Daniel Weeks, and Jason Reid, all former Netflix engineers), had emerged as a serious competitor to Delta Lake. By acquiring the Iceberg creators, Databricks neutralized a competitive threat and simultaneously positioned itself as the Switzerland of open table formats — supporting Delta Lake, Iceberg, and Apache Hudi, letting customers choose without penalty.
The acquisitions revealed Ghodsi's operating philosophy: better to own the center of the open ecosystem than to let a competitor establish a beachhead on any layer of the stack. Every acquisition expanded the surface area of the platform while keeping the core open enough to prevent the vendor lock-in narrative that Databricks had used against Snowflake for years.
The AI Inflection
The release of ChatGPT in November 2022 changed everything — not because it introduced new technology to Databricks' customer base, but because it created the executive-level urgency to act on technology that had been theoretically available for years. Suddenly, every Fortune 500 CEO wanted an AI strategy. And every AI strategy required data infrastructure.
The logic was simple and powerful: large language models are only as useful as the data they can access. A generic model trained on internet text can write marketing copy and summarize documents. But a model that can answer questions about your company's data — your customer records, your supply chain, your financial history, your proprietary research — requires a data platform that can serve that data to the model in real time, with proper governance, access controls, and quality guarantees. The enterprise AI stack, in other words, looked a lot like the Lakehouse Platform with a model-serving layer on top.
Databricks moved fast. In June 2023, the company released Dolly, an open-source large language model — not because Dolly was competitive with GPT-4, but because it demonstrated that enterprises could train their own models on their own data using Databricks' infrastructure. In November 2023, it launched DBRX, a mixture-of-experts model that achieved state-of-the-art performance among open-source models on several benchmarks. Neither model was the point. The point was the platform for building, fine-tuning, and serving models — what Databricks branded as "Mosaic AI," integrating the MosaicML technology into the broader Lakehouse ecosystem.
Every company is going to become a data and AI company. That's not a slogan — it's an architectural reality. Your AI is only as good as your data platform.
— Ali Ghodsi, Databricks Data + AI Summit 2024
The AI wave had a direct impact on Databricks' financials. Revenue growth, which had been decelerating toward the high 30s in percentage terms, re-accelerated past 50% in 2024. New customer wins increasingly cited AI workloads as the primary driver. And the average contract value expanded as existing customers layered AI model training and serving on top of their existing data engineering and analytics workloads.
The timing was either strategic genius or extraordinary luck — probably both. Databricks had spent a decade building a platform that unified data storage, processing, and governance. When the AI moment arrived, it had the infrastructure that every enterprise needed but almost no one had built. The company didn't have to pivot. It just had to extend.
The Open-Source Paradox
Databricks' relationship with open source is the central paradox of the business — the source of its competitive moat and the constraint on its pricing power, the thing that made the company possible and the thing that keeps its executives up at night.
The pattern is consistent across enterprise open-source companies: release a powerful technology as open source to drive adoption, build a managed service on top, and capture value from the subset of users who need the managed version. Red Hat did it with Linux. MongoDB did it with its database. Elastic did it with search. The model works until it doesn't — until a cloud provider takes the open-source technology and offers it as a service, capturing the value without contributing to the project.
Databricks experienced this threat directly. Amazon EMR (Elastic MapReduce) offered managed Spark clusters on AWS, effectively commoditizing the technology that Databricks had built its initial product around. Google Dataproc did the same. The cloud providers' message was blunt: why pay Databricks a premium for managed Spark when you can run Spark on our infrastructure at lower cost?
The lakehouse strategy was, in part, a response to this existential threat. By moving the value proposition from Spark-as-a-service to a unified platform with proprietary optimizations (Photon, Databricks' C++ query engine that replaced the JVM-based Spark SQL execution layer; Unity Catalog, its governance and metadata management system; and the integrated machine learning and AI serving capabilities), Databricks created layers of proprietary value on top of the open-source foundation. You could run open-source Spark anywhere. You could only run the full Lakehouse Platform on Databricks.
The tension persists. Every open-source release is simultaneously a community investment and a competitive risk. Delta Lake's open-sourcing created a massive ecosystem of compatible tools — and also meant that competitors like Microsoft (through its Fabric platform) could adopt Delta Lake without using Databricks. Unity Catalog's open-sourcing in 2024 was hailed by the community and questioned by investors who wondered whether Databricks was giving away too much.
Ghodsi's answer has been consistent: the open-source layer drives adoption, and the proprietary platform layer captures value. The bet is that the integration, the performance optimizations, the managed experience, and the AI capabilities create enough differentiated value to justify premium pricing even when the foundational components are free. It's a bet that has worked so far — $2.4 billion in revenue is substantial proof. But it requires constant innovation, an ever-expanding surface area of proprietary value, and the ability to stay ahead of both cloud providers and open-source competitors who are always one architectural layer behind.
The Private Company Gambit
Databricks is, as of early 2025, the most valuable private technology company in the United States that is not a social media or consumer platform. At $62 billion, its valuation exceeds the public market capitalizations of many established enterprise software companies — Workday, ServiceNow (before its AI-driven run-up), Splunk (before its acquisition by Cisco for $28 billion). The decision to remain private, with $10 billion in fresh capital and no evident need for public market liquidity, is itself a strategic choice with consequences.
The advantages are significant. Private, Databricks can invest aggressively in R&D and market expansion without quarterly earnings pressure. It can absorb the losses inherent in its land-and-expand strategy — winning customers with small initial deployments and growing them over years into seven- and eight-figure annual contracts — without explaining negative free cash flow to public market analysts who fetishize short-term margins. It can make $1.3 billion acquisitions without shareholder votes. And it can recruit with equity grants whose value is tied to a valuation narrative (ever upward, if history is a guide) rather than a public stock price subject to macro sentiment.
The risks are equally significant. Employees holding illiquid stock options face personal financial constraints. The $10 billion raise, even at a premium valuation, diluted existing shareholders. And the longer Databricks stays private, the more compressed its eventual public market return may be — investors who buy at a $62 billion valuation need the company to be worth considerably more to generate venture-scale returns.
The Series J investor list — Thrive Capital (which led the round), Andreessen Horowitz, DST Global, GIC, Insight Partners, and WCM Investment Management — reads like a who's who of growth-stage tech investing, with a notable emphasis on crossover funds that invest in both private and public companies. Several of these investors reportedly purchased secondary shares from early employees, providing liquidity without an IPO and alleviating one of the major pain points of extended private company life.
Ghodsi has been deliberately ambiguous about IPO timing, saying only that Databricks will go public "when it makes sense." Translation: not until the AI narrative is fully priced into the revenue trajectory, and ideally not until margins demonstrate the leverage that public market investors demand. The whisper consensus among bankers and investors, as of early 2025, places an IPO in 2025 or 2026 — likely through a traditional offering rather than a direct listing, given the scale of capital that institutional investors will want to deploy.
The Ghodsi Method
To run a company from academic research project to $62 billion valuation in eleven years requires a particular kind of leadership. Ghodsi's is unusual in enterprise software — less the polished corporate operator and more the relentless systems thinker who happens to have discovered that building companies is the hardest distributed systems problem of all.
His management philosophy centers on what insiders describe as "aggressive transparency" — a willingness to share internal metrics, strategic debates, and even competitive anxieties with the entire company in ways that most CEOs would find reckless. All-hands meetings reportedly include detailed financial breakdowns that would be considered material nonpublic information at a public company. The logic: if you want engineers to make good decisions about what to build, they need to understand the business context with the same precision they bring to systems architecture.
The cultural DNA is distinctly academic. Databricks' engineering organization operates more like a research lab than a traditional software company — publication is encouraged, open-source contributions are celebrated, and the boundary between product development and research is deliberately blurred. The MosaicML acquisition brought in a team that had published extensively on efficient model training; those researchers continue to publish while simultaneously building commercial products. It's a model borrowed from Google's early years, when the lines between Google Brain, DeepMind, and the production engineering teams were productively porous.
The risk of this culture is diffusion — too many interesting problems, not enough focus on the products that generate revenue. Ghodsi has managed this tension by organizing the company into platform teams (focused on the core runtime, SQL engine, and infrastructure) and solution teams (focused on specific workloads like machine learning, streaming, and data governance), with clear revenue accountability at the solution level and technical autonomy at the platform level. It's a matrix structure that works when the CEO is technical enough to arbitrate disputes and commercially ruthless enough to kill projects that don't map to customer demand.
I tell the team: we're not a research lab that happens to have a product. We're a product company that happens to invest heavily in research. The distinction matters.
— Ali Ghodsi, interview with The Information, 2023
One signal of the culture's effectiveness: Databricks' employee retention rate has remained well above enterprise software industry averages even during the 2021–2022 period when talent competition reached absurd levels. Engineers who join for the research stay for the scale. Data scientists who join for the open-source community stay for the commercial impact. And executives who join for the growth trajectory stay because Ghodsi gives them enough autonomy to feel like founders within the larger machine.
The Cloud Provider Tightrope
Databricks' most important strategic relationships are also its most dangerous. The company runs on all three major cloud platforms — AWS, Microsoft Azure, and Google Cloud Platform — and each of these partners is simultaneously a distribution channel, an infrastructure provider, and a potential competitor.
The Microsoft relationship is the most complex and the most lucrative. In 2017, Databricks partnered with Microsoft to create Azure Databricks, a first-party service on the Azure platform. The deal gave Databricks access to Microsoft's enterprise sales force — the largest and most effective in the technology industry — and gave Microsoft a best-in-class data analytics offering to compete with AWS. Azure Databricks became Databricks' fastest-growing deployment and, by many estimates, accounts for a significant majority of the company's revenue.
But Microsoft launched Fabric in 2023 — a unified analytics platform that integrates Power BI, Azure Synapse Analytics, and a lakehouse layer built on Delta Lake (the open-source technology Databricks created). Fabric is, in architecture if not yet in capability, a direct competitor to the Databricks Lakehouse Platform. Microsoft can bundle Fabric with its E5 enterprise licenses, offer it at marginal cost, and distribute it through the same sales force that currently sells Azure Databricks.
The AWS relationship is simpler but no less fraught. Databricks runs natively on AWS and competes with Amazon Redshift (the warehouse), Amazon EMR (managed Spark), and Amazon SageMaker (machine learning). AWS has no obvious incentive to build a Databricks clone — the company generates significant AWS consumption — but the history of AWS building managed versions of open-source projects (ElastiCache, Amazon Elasticsearch, Amazon DocumentDB) suggests that no partner is safe forever.
Google Cloud, the smallest of the three relationships, represents both the least competitive tension and the least revenue. Google's BigQuery is a formidable warehouse competitor, but Google's enterprise sales motion is weaker than Microsoft's or AWS's, giving Databricks more independent operating room.
Ghodsi's approach to this tripartite tightrope has been to make Databricks indispensable enough on each platform that the cloud providers' incentive to compete is outweighed by their incentive to cooperate. The strategy works as long as Databricks continues to grow fast enough that the cloud consumption it generates exceeds what the providers could capture by building a competitor. The moment that calculus flips — because growth slows, or because a cloud provider's competitive offering reaches parity — the tightrope becomes a high wire.
The Race to Own Enterprise AI's Plumbing
By 2025, the competitive landscape around Databricks had expanded far beyond the Snowflake rivalry. The question was no longer "lake versus warehouse" but "who owns the data layer for enterprise AI" — a market that every major technology company wanted to dominate.
The competitors fell into three tiers. First, Snowflake, which had responded to the AI wave by acquiring Neeva (an AI-powered search startup founded by former Google SVP Sridhar Ramaswamy) for $160 million and launching Cortex AI, its own suite of LLM-powered analytics features. Snowflake was doing what Databricks had done in reverse — starting from the warehouse and expanding toward AI workloads, adding Python support and model-serving capabilities to a platform built for SQL.
Second, the cloud providers themselves — Microsoft Fabric, Google BigQuery with its Gemini integrations, AWS with its increasingly integrated analytics stack. These platforms had the advantage of bundling, distribution, and pricing leverage, but the disadvantage of fragmentation (each works only on its own cloud) and the innovator's dilemma (they couldn't cannibalize their existing warehouse revenue without internal political warfare).
Third, a new wave of startups attacking specific layers of the AI data stack — companies like Motherduck (serverless analytics), dbt Labs (data transformation), Pinecone (vector databases for AI retrieval), and dozens of others building point solutions for the AI era. Databricks' response was to absorb their functionality into the platform, either through acquisition or organic development — a strategy that worked when the startups were small and works less well as they grow.
The competitive position was, paradoxically, both stronger and more precarious than at any point in the company's history. Stronger because the AI wave had dramatically expanded the total addressable market and because Databricks' unified platform was, architecturally, better positioned for AI workloads than any competitor's. More precarious because the stakes had attracted the attention of the largest technology companies on Earth, each with orders of magnitude more resources than Databricks could deploy.
⚔️
Competitive Landscape, 2025
Key competitors across Databricks' platform surface area
| Competitor | Primary Strength | AI Strategy | Threat Level |
|---|
| Snowflake | SQL analytics, warehouse dominance | Cortex AI, Neeva acquisition | High |
| Microsoft Fabric | Bundling, enterprise distribution | Copilot integration, Azure OpenAI | High |
| Google BigQuery | Serverless scale, Gemini integration | Vertex AI, native ML | Medium |
The Geometry of Data Gravity
The deepest strategic insight in Databricks' positioning is one that the company rarely articulates explicitly but that undergirds every product decision: data gravity is the most powerful force in enterprise technology.
The concept is borrowed from physics by way of Dave McCrory, who coined the term in 2010. Data gravity holds that as data accumulates in a location, it attracts applications, services, and more data — the digital equivalent of mass warping spacetime. Move a terabyte and it's trivial. Move a petabyte and it requires planning. Move an exabyte and you're effectively stuck.
Databricks' entire strategy is an exercise in data gravity engineering. The Lakehouse Platform is designed to be the location where data accumulates — not because Databricks locks it in (the data stays in open formats on the customer's cloud storage), but because the platform provides enough value that customers keep adding more data, more workloads, more users. Every new data source connected, every new ML model trained, every new dashboard built increases the gravitational pull. And once an organization's data and analytics are deeply embedded in the platform, switching costs become enormous — not because of lock-in but because of inertia, institutional knowledge, and the sheer complexity of migrating thousands of pipelines and models.
This is the distinction between technical lock-in (proprietary formats, contractual restrictions) and practical lock-in (the accumulated operational context that makes migration prohibitively expensive). Databricks has carefully avoided the former while assiduously cultivating the latter. It's the more defensible position — customers don't resent practical lock-in the way they resent technical lock-in, because they chose it through years of investment rather than having it imposed through contractual terms.
The geometry of the strategy becomes clear when you map it against the AI opportunity. Enterprise AI requires not just data but curated, governed, high-quality data — the kind that accumulates through years of data engineering on a platform like Databricks. A company that has spent three years building its data lakehouse on Databricks, with Unity Catalog governing access and lineage, with thousands of Delta Lake tables optimized and maintained, is not going to rip it out and rebuild on Snowflake or Fabric to run AI workloads. It's going to add the AI layer on top of what it already has.
Data gravity, in other words, turns a data platform into an AI platform — not through any product innovation but through the accumulated weight of the data itself.
The $10 Billion Question
The $10 billion in fresh capital raised in December 2024 was not a sign of financial need. Databricks was reportedly approaching cash-flow breakeven and could have reached profitability by constraining growth spending. The raise was a sign of strategic intent — a war chest for the AI era, positioned to fund acquisitions, geographic expansion, and the kind of aggressive R&D investment that a public company with quarterly earnings pressure would struggle to justify.
Thrive Capital's Josh Kushner led the round, reportedly after a competitive process involving multiple top-tier growth funds. The deal included unusual provisions — a 20% annualized IPO ratchet, meaning that if Databricks goes public at a valuation below the Series J price within a certain window, early investors in the round receive additional shares to protect their returns. The ratchet signaled both confidence (Databricks believed it would IPO above $62 billion) and pragmatism (the investors were sophisticated enough to demand downside protection in a volatile market).
The use of proceeds, to the extent that Databricks has discussed it, falls into three categories. First, continued platform development — specifically, the AI and machine learning capabilities that are driving the current growth re-acceleration. Second, international expansion — Databricks generates an estimated 30–35% of revenue outside North America and sees significant whitespace in Europe, Japan, and emerging markets. Third, and most consequentially, acquisitions — the Tabular and MosaicML deals demonstrated that Ghodsi is willing to pay premium prices for strategic assets, and the $10 billion gives him extraordinary firepower to continue.
The implicit message to Snowflake, to Microsoft, to AWS: we can outspend you on innovation for the next three to five years without needing to generate a dollar of profit. Whether that's courage or hubris depends entirely on whether the AI workload growth materializes at the pace Databricks is betting on.
The answer, as of early 2025, was that it was materializing faster than almost anyone had predicted. Databricks' consumption growth from AI-related workloads was reportedly growing at over 100% year-over-year, driven by customers using the platform for model training, fine-tuning, retrieval-augmented generation (RAG) applications, and feature engineering for ML systems. The AI wave was not theoretical for Databricks. It was hitting the revenue line.
On a desk in Databricks' San Francisco headquarters — a glass-walled floor in the Mission Bay district, overlooking the bay, a short walk from where the warehouses that stored physical goods once lined the waterfront — there sits, according to multiple employees, a simple framed printout. It reads: "12 exabytes per day." No context. No chart. Just the number, updated quarterly, a reminder that in the data business, mass is destiny.