Looking for a data lineage tool? These are the key “gotchas” and features you should be asking about.
Data lineage can be a mess.
Think of it like knitting a blanket. There are threads coming and going from every direction, far too many to count. All of these have to come together perfectly in an intricate pattern. If you get it right, it’s art. If even one element gets out of line, it’s chaos.
Lineage is hard to get right because of the sheer number of variables at play — data flowing from a variety of sources (both ever-changing old ones and the latest new ones), transformations at every stage, complex language involved in naming and describing data assets, different styles for writing data logic and code, and much more.
As difficult as this is, we can’t give up. Lineage is indispensable in the data team toolbox, revealing data flows and powering important use cases like impact analysis, root cause analysis, governance, and compliance.
Here are 14 questions to ask during your search for the right data lineage tool to fully assess its depth (number of unique sources supported), breadth (number of fields or objects supported for each source), and utility (ability to power insights and actions across diverse data personas).
1. Does it automatically parse SQL queries?
Many data platforms provide a lineage API, so it’s easy for any lineage system to ingest and use lineage from those sources. However, not every platform does this. Automatic SQL parsing is critical to plug these gaps and ensure that your lineage is complete, covering all data sources, processes, and assets.
If you only parse SQL at the warehouse layer, SQL queries from sources without native query history (e.g. relational databases like PostgreSQL and MySQL) will slip through the cracks.
Look for the ability to read a dump of SQL queries from source systems that don’t possess a “query history” feature.
2. Which types of SQL statements are supported?
To avoid gaps in your lineage, it’s important to parse and register lineage from various types of SQL statements:
CREATE TABLE
CREATE TABLE AS SELECT
CREATE VIEW
MERGE
INSERT INTO
UPDATE
Most SQL parsers support SQL CREATE
and, in some cases, MERGE
statements. However, many don’t support INSERT INTO
and UPDATE
statements. These account for most transformations in data warehouses, so they are important for full lineage coverage.
Look for lineage that can also parse MERGE
, INSERT INTO
, and UPDATE
statements.
3. Does the lineage API support programmatic lineage creation and retrieval?
The data ecosystem is constantly evolving, and new data sources are emerging all the time. Programmatically processing lineage from unsupported sources (via an open API) is key to scaling lineage without worrying about which new platforms you can and can’t adopt.
Look for two key features:
- Ability to retrieve and create lineage programmatically via an API.
- Ability to publish and retrieve table and column-level lineage across any object type.
4. Is lineage a native capability, or is it provided by an external partnership?
In lineage, dealing with edge cases is the norm, and new edge cases often require custom features or support from your lineage vendor.
Lineage is often packaged as part of a larger catalog or anomaly detection product. Sometimes this lineage is natively available and supported by the product’s team. However, sometimes it comes via an external partnership, which can lead to slower help and fixes.
Look for three key features:
- Whether the lineage capability is natively supported or externally provided.
- Whether the product’s team has direct control over the lineage development.
- Clear SLAs for support and an engineering dependency matrix (if there is an external dependency).
5. How is it future-proofed against changes in the modern data stack?
Data transformation tools and processes are always evolving. Even as customers switch from legacy stacks to the latest data tools, lineage should always stay reliable.
Pulling lineage farther away from a data source — e.g. from within the transformation process — can lead to problems if the source system changes. Pulling lineage from as close to the source as possible is often safer and more future-proof.
Look for lineage that pulls from a source system’s query history (e.g. natively from Snowflake) rather than integrating with a downstream transformation tool or process.
6. Does it have cloud-native flexibility to scale up SQL parsing demands?
In lineage, it’s easy to end up with large-scale SQL parsing demands. (We’ve personally seen customers with over one million queries per day.) Parsing these queries takes significant computational resources, so it’s important that your lineage can keep up.
Cloud-native products use the latest design patterns and microservices invented by companies like Netflix for limitless scalability. Beware of platforms that weren’t built for the cloud or have legacy tech debt — they will be hard to maintain, leading to performance problems as your lineage scales.
Look for modern, cloud-native architecture that supports SQL parsing at scale.
7. Does it offer lineage down to the column level?
Table-level lineage is considered “table stakes”, but column-level lineage should be too. It’s crucial for a range of use cases:
- Tracing sensitive data classifications for transformed PII data
- Impact analysis from things like schema changes
- Root cause analysis — e.g. investigating why a dashboard looks off by tracing a BI field to upstream columns in the data warehouse
Without the ability to dive into granular columns or field lineage, data engineers and analysts may miss key depth during their investigations.
Look for two key features:
- Native column-level experience in the UI, including viewing graph linkages at the column level.
- Support for
MERGE
,INSERT INTO
, andUPDATE
SQL statements, which are key for column-level transformations.
8. Does it automatically connect upstream SQL sources with downstream BI assets?
Often, the goal of lineage is to identify why something at the last mile doesn’t look right. As the managers for company data, data engineering teams are responsible for making sure the data that feeds end-user assets is trustworthy and reliable. When this fails, lineage is a crucial diagnostic tool.
Not all lineage will natively connect to your chosen BI tool (e.g. Looker, Tableau, Power BI, etc). Some rely on time-consuming manual scripts and asset pushing, even for major BI tools.
Look for either native connectors or automated scripts that automatically connect to your BI tool of choice.
9. Does it support field-level lineage for BI dashboards?
Anyone doing root cause analysis needs to dive into an incorrect field (i.e. dimension, measure, calculated field, etc.) in the dashboard, and work backwards to zero in on the upstream fields or columns that are broken. This is only possible with field-level lineage for the BI tool.
Field-level lineage is also important for impact analysis. If a data engineer is trying to make a schema change, they need to understand the specific downstream columns and fields that will be affected — not just which dashboards will be affected in some unspecified way.
Some platforms support lineage for a few fields, but don’t go deep with BI fields that are crucial for these types of analysis.
Look for two key features:
- Coverage of both column-level lineage for SQL sources and BI field-level lineage.
- Whether your BI tool’s objects are supported and exposed in lineage. (E.g. in Looker, will lineage cover all the fields/objects you care about, such as Dashboards, Looks, Explores, Tiles, Fields, and Views?)
10. Can it create upstream lineage with data in Salesforce (at both the object and field levels)?
We often hear that Salesforce is the “Wild West” and no one knows what is happening with that data in the ETL pipeline. However, opening up Salesforce (and other critical SaaS source systems) can be a game-changer for helping data and business teams to collaborate. Impact analysis is a major use case here, since Salesforce fields get changed all the time and wreak havoc downstream.
If it’s available, make sure to investigate the depth of Salesforce lineage. Some lineage starts at the storage layer (i.e. data warehouse, lake, etc). Can lineage be generated upstream of the storage layer for SaaS tools like Salesforce?
If so, how deep does it go? Some systems can’t go down to Salesforce object and field levels, which are crucial for making downstream lineage useful and understanding context for downstream data assets.
Look for object and field-level lineage from Salesforce down to the data warehouse layer.
11. Does it natively integrate with modern data integration tools?
Building lineage upstream of a data warehouse is hard. Doing so at scale, especially if you draw data from multiple source systems, is even harder.
If you follow an ELT approach, it’s important that your lineage can connect with modern data integration tools like Fivetran. This lets you build upstream lineage, creating true end-to-end lineage and showing what happens with data before it enters the storage layer.
Look for whether it natively connects to Fivetran or other modern data integration tools.
12. Can it integrate with Databricks and generate lineage for Spark jobs?
Spark lineage is difficult to generate. But if you use Databricks, this is key to unlocking visibility into your transformations and creating usable lineage to help data scientists, engineers, and analysts with ML and analytics workloads in Databricks.
Look for two key features:
- Whether it ingests lineage from Databricks’ Unity Catalog API (which includes Spark, Scala, and SQL)
- Whether it supports field-level lineage in BI tools downstream of Databricks.
13. Does it incorporate other types of metadata to give additional context for assets in the lineage graph?
In isolation, lineage only tells part of the story and, therefore, only provides part of the value. Lineage becomes actionable when it’s combined with key metadata and context:
- Operational metadata: How and when were assets orchestrated?
- Quality and anomaly metadata: What state are the assets in? Are they reliable?
- Business/semantic metadata: How do the assets link to key business terms or KPIs?
- Owner and expert metadata: Who should you contact or collaborate with during troubleshooting?
- Social metadata: What is the human context for this asset — e.g. relevant Slack discussions or Jira tickets about the asset? This is what machines alone will miss.
Sometimes lineage graphs appear as yet another a siloed view. Without the other metadata for these assets, it can be hard to put lineage in context.
Look for three key features:
- Openness: An “open by design”, extensible platform where you can harvest data and metadata from any source via APIs (including custom-built connectors).
- Flexibility: Support for a wide range of technical, operational, anomaly/quality, and business/semantic metadata from these sources.
- Personalization: A personalized data experience, where each persona sees the metadata that is right for them, rather than drowning in all the metadata.
14. Can it be used not just to investigate issues, but also to drive action programmatically?
In addition to enabling data people’s work, lineage can also enable automated system actions and workflows.
For example, if an upstream table has data quality issues, it’s important to automatically add announcements to downstream BI dashboards. This keeps business users from creating “Garbage In, Garbage Out” analysis, and saves data analysts and engineers from manually sending alerts or warnings.
Some platforms don’t have the underlying architecture and scalability to perform automated actions based on lineage.
Look for open APIs, the ability to build or customize automated workflows, and the ability to read metadata-change events and trigger changes in linked assets across the lineage graph.
This blog was co-written with Mark Pavletich (Director of Sales Engineering) and Swaminathan Kumar (Strategy & Intelligence).
Header photo: Crawford Jolly on Unsplash.
This blog was originally published on Towards Data Science.