DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Low-Code Development: Leverage low and no code to streamline your workflow so that you can focus on higher priorities.

DZone Security Research: Tell us your top security strategies in 2024, influence our research, and enter for a chance to win $!

Launch your software development career: Dive head first into the SDLC and learn how to build high-quality software and teams.

Open Source Migration Practices and Patterns: Explore key traits of migrating open-source software and its impact on software development.

Related

  • 3 Best Tools to Implement Kubernetes Observability
  • Data Security Considerations in Cloud Data Warehouses
  • Making APM a Company-Wide Effort
  • Transitioning from Monolith to Microservices

Trending

  • Implementing Real-Time Credit Card Fraud Detection With Apache Flink on AWS
  • You Can Shape Trend Reports: Participate in DZone Research Surveys + Enter the Prize Drawings!
  • Build Your Business App With BPMN 2.0
  • Theme-Based Front-End Architecture Leveraging Tailwind CSS for White-Label Systems
  1. DZone
  2. Data Engineering
  3. Big Data
  4. What is Data Lineage and How Can It Ensure Data Quality?

What is Data Lineage and How Can It Ensure Data Quality?

By providing a clear representation of the flow of data, a data lineage system essentially allows you to have your cake and eat it too.

By 
Michael Bogan user avatar
Michael Bogan
DZone Core CORE ·
Nov. 10, 21 · Analysis
Like (2)
Save
Tweet
Share
3.8K Views

Join the DZone community and get the full member experience.

Join For Free

Introduction

Are you spending too much time tracking down bugs for your C-level dashboards? Are different teams struggling to align on what data is needed throughout the organization? Or are you struggling with getting a handle on what the impact of a potential migration could be?

Data lineage could be the answer you need for data quality issues. By improving data traceability and visibility, a data lineage system can improve data quality across your whole data stack and simplify the task of communicating about the data that your organization depends on.

Hold on though—what exactly is data lineage?

What is Data Lineage?

Data lineage is a representation of the flow of data through different systems and transformations. In a modern data stack, data is not stored solely in application databases; this data flows from one application to another and from application databases to data warehouses, where it gets transformed and eventually consumed by any number of reporting tools and other downstream applications.

This flow of data allows each system to access data in a format that makes sense for it. The source applications can optimize towards improving the performance of read-write transactions. At the same time, reporting clients can access denormalized data which is convenient for querying.

Unfortunately, this convenience comes at the cost of traceability and visibility. Once the data leaves the source database and undergoes any number of transformations, an additional layer has been added which can obscure the underlying data. Reporting teams often struggle to understand where their data is coming from or determining the right data to use for a given report. When they ask the application team, the team might tell them that the data doesn’t exist, as—after going through the transformation process—the terms used to refer to a piece of data have changed.

Additionally, solving any bugs or problems takes longer and requires the involvement of three teams: the reporting team, the data warehouse team, and the application team. Typically, the burden of solving the issue falls onto the data team, who then need to dig through version control to try and understand why the problem arose in the first place. This slows down the development of new reports as well.

Data lineage solves these issues. Let’s discuss how.

Why use Data Lineage?

By providing a clear representation of the flow of data, a data lineage system essentially allows you to have your cake and eat it too. You can have both the separation of roles and the performance of a data warehouse while still having clear data understanding and traceability across all your systems and teams.

Clear data understanding and traceability allow you to trace important data across the system. For example, this can allow you to verify that no personally identifiable information (PII) is leaving the application systems and being consumed where it should not be. It also allows you to see what data is frequently consumed downstream, giving visibility into the impact of any potential changes or migrations. Similarly, you can identify any unused information, allowing simple cleanup of unused tables or columns.

By increasing data understanding, data lineage systems reduce incident response times and improve team communication. Instead of confused discussions about where a piece of data in a report comes from, the data lineage system makes it clear to all parties where the data comes from and how it is consumed. This speeds up both resolution of any errors and new development.

Now that we know why data lineage is critical for the modern data stack, let’s look at the various types of data lineage systems.

Types of Data Lineage

There are two main categories of data lineage systems: active and passive.

An active data lineage system is “active” because you must create it yourself. This is done by programming the relevant source and transformation information into the system or tagging your data with the appropriate metadata. One example of an active system is Apache Atlas. A properly configured active data lineage system can provide traceability for your data to a very fine degree of detail. However, in order to gain those benefits, constant updating and maintenance are required. This adds complexity to your overall data infrastructure and can be time-consuming.

In contrast, a passive data lineage system attempts to understand your data on its own. Some passive systems look at the data coming from the data warehouse. Through pattern recognition, a passive system attempts to recognize where that data is coming from and how it is being transformed. While this can work well for simpler data sets and transformations, it is inexact and can generate inaccurate results.

A parsing-based system is another kind of passive data lineage system which generates lineage data by reverse-engineering your data warehouse. Rather than entering in the lineage data manually (active systems) or guessing based on data patterns (pattern recognition), a parsing-based data lineage system can see exactly where the data came from and how it is being consumed. Datafold is an example of this type of system. Datafold analyzes all the DQL code in your data warehouse and generates column-level lineage graphs. This is significantly more detailed than table-level lineage and allows you to see exactly which column a given piece of data is sourced from and where it is consumed. This level of detail leads to improved outage response time, allows for faster troubleshooting, and decreases the frequency of breaking changes making it to production.

With numerous data warehouse integrations, Datafold is plug-and-play for many, and the generated lineage data is also accessible through the Datafold API. As long as it has support for your data warehouse and related systems, a parsing-based data lineage system is the easy choice from an implementation and maintenance standpoint.

This is all great, but what difference does data lineage make to my day-to-day? Let’s look at that.

How can Data Lineage Ensure Day-to-day Data Quality?

The improved visibility and traceability from a data lineage system has (at least!) three clear effects on your operational day-to-day.

First, it improves your team response time. Investigating the cause of an error in a report no longer requires hours and the coordination of several separate teams. With full visibility into the flow of data across your entire data stack, errors can be investigated and resolved in record time.

Second, it allows for the creation and maintenance of a common data vocabulary. When the report team talks about views, it is clear to the application team what that means and where that data comes from. Similarly, the application team can now see exactly what data is being aggregated for that dashboard which informs all the company decisions and outlook. Over time, discrepancies in terminology can be mitigated or removed, allowing for smoother communication across the organization.

Finally, the data lineage system allows teams to easily and effectively anticipate the effects of any potential changes or migrations. Data schema changes and migrations can be planned out with certainty. Full traceability makes it easy to understand the downstream impact of any changes and to notify the appropriate parties.

Wrap-up

In this article, we covered the basics of what data lineage is, why you might want to use one, the different types of data lineage, and how data lineage can improve your data quality each day. The addition of a data lineage system to your data stack can increase transparency and prevent headaches for your entire organization.

Data infrastructure Data quality teams Database application Data warehouse

Opinions expressed by DZone contributors are their own.

Related

  • 3 Best Tools to Implement Kubernetes Observability
  • Data Security Considerations in Cloud Data Warehouses
  • Making APM a Company-Wide Effort
  • Transitioning from Monolith to Microservices

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: