The emergence and evolution of metadata management
Data growth can't come without complexity and risk, but a new wave of tools is transforming the way enterprises improve data quality, reliability, and security.
As cloud adoption increases, storage costs decline, and businesses increasingly recognize the mission-critical value data can deliver, a dramatic shift is happening: The volume of data is exploding across businesses small and large. As detailed in our Data Infrastructure Roadmap, an evolving set of vendors is providing the modern data stack to support this growth. These tools have made it easier than ever to move, store, transform, query, analyze and visualize data and, as a result, businesses now have more data sources, pipelines, workflows, data stores, models, and dashboards than ever before.
The volume of data generated by businesses is growing and the type is increasing in complexity, which carries risks. Given that data sits at the center of so many processes and decisions, errors in data (or how data is used) often have a cascading effect. With this risk in mind, companies are turning to a range of metadata management tools to ensure their data infrastructure can reliably operate at scale, and that individuals and systems can find and understand data, and trust its quality, reliability, and security.
Metadata management tools facilitate the governance and monitoring of data, ensuring data producers and consumers can access, analyze and operationalize data, while minimizing the time associated with doing so, including time spent cleaning, organizing, or debugging data. These technologies also help reduce the costs—both financial and reputational—associated with the poor decisions or experiences that commonly result from data errors, leaks, and misuse.
Within core data infrastructure, the jobs to be done across the data pipeline are relatively clear—moving, storing, transforming, querying, analyzing, and visualizing data. However, for metadata management (sometimes referred to as “DataOps” due to its similarities to DevOps) the goals are broader and less clear cut. These goals may encompass anything that allows a company to effectively and reliably make use of its data.
And as with DevOps, DataOps does not simply involve a specific tool serving the well-defined needs of one type of user within a company (e.g., Looker for helping data analysts analyze data). Instead, it includes a system of people with defined roles and responsibilities, governed by standards and policies, following specific processes, and leveraging a number of tools to make use of data.
Given the complexity and relative nascency of this category, there are several companies with overlapping functionality that are each vying to name and define the category. While the terms and their definitions will continue to evolve, we’ve arrived at two (admittedly broad, interconnected, and imperfect) buckets that seem to matter most in metadata management:
- Data governance, a term that’s been around for several years, includes tools that help users find, understand, access, and secure data.
- Data monitoring, a relatively newer term (and largely interchangeable with “observability” and “reliability”), includes tools that help users trust the data, including its quality and reliability.
While these products haven’t yet reached the level of sophistication of their software engineering counterparts, they are rapidly improving by leaps and bounds.
Below we’ll cover some of the main companies and products in each category.
New global and federal regulation supporting data and consumer privacy, such as GDPR and CCPA, has underscored the importance of data governance; the goals of which include improving the discoverability, usability, accessibility, and security of data. Because governance sits at the intersection of physical systems, data models, and processes, effective governance is extremely difficult both to define and to implement. It’s even more challenging because data governance products need to adapt to the unique environments, cultures, and demands of their customers. However, with tricky problems comes ample opportunity, and many startups have recognized this opportunity over the years and built products to address one or several of these goals.
The central component of most governance products is the data catalog (i.e. “data dictionary”). A data catalog is a centralized list of a company’s data assets that exist across various sources, business applications, and BI tools. Data catalogs, at a minimum, will include descriptions of the data assets, their source, and key owners and users. Ultimately, data catalogs help analysts easily find, understand, and access the data they need to do their jobs.
Alation and Collibra are two mature growth-stage companies that have been providing data catalog products for about a decade, and were originally built for large enterprises with their complex monolith architectures and Hadoop implementations. However, each company has meaningfully innovated on their base offering since. For example, Alation’s product uses machine learning to surface insights on how data is used, and leverages that usage data to identify subject matter experts, discern business terms from technical terms, and automatically create a business glossary that promotes semantic consistency and a common understanding of data. Collibra, aiming to serve the related governance needs of its customers, offers a data catalog with embedded privacy, accessibility, and quality controls.
More recently, a new cohort of data catalog providers have emerged, including Stemma (from the creators of Amundsen at Lyft), Acryl Data (from the creators of DataHub at LinkedIn), Metaphor (from contributors to DataHub at LinkedIn), and Atlan. Other notable examples of in-house solutions from large tech companies include Airbnb’s Dataportal, Facebook’s Nemo, Netflix’s Metacat, and Uber’s Databook. Each of these are built for a cloud-first world and aim to work seamlessly with the products that make up the modern data stack, such as Snowflake, Looker, dbt, and Airflow. While each solution differs in notable ways (e.g., level of native collaboration enabled, analytics provided, integrations, etc.), they all share a focus on easing deployment and shortening time-to-value through cloud deployments and automation of traditionally manual documentation processes.
Another key component of data governance is privacy, an area that has become critical because of recent regulations like GDPR and CCPA. Given the breadth of this subcategory alone, Bessemer has an entire roadmap devoted specifically to data privacy. For example, BigID*, helps businesses comply with data privacy regulation with its intelligence platform that finds, analyzes, and de-risks personally identifiable information (PII) data, allowing enterprises to understand where their sensitive data lives, at scale.
A set of related providers, Okera*, Immuta, and Privacera, exist to secure access to sensitive data once it is identified. And we’re seeing alliances emerge among these data privacy leaders: Given their complementary capabilities, last year BigID and Okera formed a partnership to help their joint customers automate the discovery of sensitive data and the enforcement of data access policies.
One of the more recent subcategories of data governance is data lineage. Data lineage providers help customers trace and follow the path data takes through pipelines, databases, models, and analytical tools to better identify where issues might arise and to minimize response and remediation time. For example, to aid in incident investigation and response, Manta* provides complete visibility into the flow of data within an organization (and the complex web of interdependent relationships among datasets).
To expand on the DevOps analogy, companies in the data monitoring category are akin to companies like Datadog, New Relic, and Splunk in application performance management (APM). Data monitoring aims to do for data what those companies do for applications and infrastructure. Players in data monitoring (most notably, Monte Carlo) have coined terms that borrow from APM, such as “data downtime” (a play on “application downtime”) and “data reliability engineering” (a play on “site reliability engineering”).
There are myriad ways to describe the important but somewhat amorphous set of processes that comprise data monitoring. However, the main focus for companies in this category is to improve the trust users have in their company’s data and the decisions and experiences that it drives, by reliably identifying, remediating, and ultimately preventing data quality issues.
Data monitoring tools differ in a number of notable ways, including whether or not they are open source, what is being monitored, how the monitoring occurs (rules-based vs. automated/ML), and how alerting happens (native vs. via integrations). However, it’s best to understand these offerings and their differences by the type of customer they each serve.
For individual data scientists and analysts that might just want basic unit tests to profile their data, there are a few open-source options, including Great Expectations, dbt, and Amazon deequ. These tools operate at the code level, lending themselves to an open source model. In the case of Great Expectations, the tests are done in a declarative fashion—for example, compiling in code the statement that “there are no values missing from a row” and determining whether this is true or false.
Acceldata, on the other hand, targets site reliability engineers (SREs) and data engineers in large legacy enterprises that have petabytes of data and might be running both Hadoop and Redshift. Their platform monitors data as it moves from source to output and tracks compute/resource outages, growth in data volumes, changes in data type or expired data, and missing values.
Monte Carlo, Anomalo, Sifflet and Lightup are specifically focused on companies with a cloud-native architecture and modern data stack (e.g., FiveTran, dbt, Snowflake, Looker). They also track data across pipelines, warehouses, data lakes and BI tools, while aiming to be fast and seamless in deployment. They also leverage machine learning to a greater extent to automatically identify and prevent data quality issues, as opposed to using a rules-based engine.
Furthermore,, there are companies like Databand* that monitor data pipelines from source to data lake, and specifically target mature data-intensive companies that are not just interacting with Fivetran data sources, but dozens of sources from different providers that are all pulled into a data lake in an unstructured form, where data is extensively processed and staged before moving downstream. Their customers’ more sophisticated data stack might include S3, Airflow, Spark and Snowflake.
Given the particular needs of different customers, we believe there’s potentially room for multiple large companies to be built in data monitoring, each serving a distinct customer profile. However, the common goal among these different solutions is to increasingly automate identification of data issues, share the relevant contextual information, and chart a path to efficient remediation and ideally future prevention.
There are several companies offering standalone point solutions in the categories of data catalog, data privacy, data lineage, and data monitoring. And as we’ve witnessed with the emergence of the canonical data stack, the unbundling of monolithic platforms into best-of-breed products has allowed companies to benefit from the focused innovation and flexibility this trend delivers. Companies of different sizes, with unique tech stacks, and varying levels of data team maturity and sophistication are able to leverage the products that best suit their needs, technology environments, and cultures.
However, given the interrelated needs served by these products (i.e., improving ease of access and use of data, and trust in the system’s ability to reliably perform at scale), the value of an “all-in-one” platform that provides a single source of truth to meet metadata management needs is compelling to many prospective customers. Companies are already capitalizing on this opportunity.
Though Collibra was founded in 2008 and built for a legacy, on-premise environment, it has evolved since and today offers the Collibra Data Intelligence Cloud which is meant to serve as the “system of engagement for data,” combining the catalog, lineage, privacy, quality and other governance capabilities into a single integrated platform. Atlan has a similar vision and platform—built from the beginning for a cloud-first world for teams fully utilizing the modern data stack.
What’s key for these players and others in building an integrated solution is that they first capture the full range of data assets within a company, including SQL queries, models, notebooks, tables and dashboards. Secondly, they need to be designed with the diversity of users in mind, integrating into their various workflows and associated tools, and promoting cross-functional collaboration. While the players here have different levels of extensibility based on their perspectives on whether an integrated suite vs best of breed tooling is optimal, incumbent upon each of them is the need for seamless interoperability with adjacent products.
Given the scale and complexity of metadata management, we expect to see new solutions join this crowded though still nascent market, alongside the likes of Collibra and BigID. With so much opportunity available, we’re excited to meet with founders that are championing this current wave of metadata management and supporting their customers with building a data system on which they can rely.
If you’re building in this area, we want to hear from you. Reach out to us via email via firstname.lastname@example.org.
*Indicates Bessemer portfolio company