DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Low-Code Development: Leverage low and no code to streamline your workflow so that you can focus on higher priorities.

DZone Security Research: Tell us your top security strategies in 2024, influence our research, and enter for a chance to win $!

Launch your software development career: Dive head first into the SDLC and learn how to build high-quality software and teams.

Open Source Migration Practices and Patterns: Explore key traits of migrating open-source software and its impact on software development.

Related

  • Python Function Pipelines: Streamlining Data Processing
  • Offline Data Pipeline Best Practices Part 2:Optimizing Airflow Job Parameters for Apache Hive
  • 6 Best Practices to Build Data Pipelines
  • Building Robust Real-Time Data Pipelines With Python, Apache Kafka, and the Cloud

Trending

  • Maintain Chat History in Generative AI Apps With Valkey
  • Getting Started With Microsoft Tool Playwright for Automated Testing
  • Enhance IaC Security With Mend Scans
  • Tackling Records in Spring Boot
  1. DZone
  2. Data Engineering
  3. Data
  4. Is Data Lineage a Pain Killer or Vitamin?

Is Data Lineage a Pain Killer or Vitamin?

Discover how data lineage is used by organizations, its benefits, and the critical questions to ask before implementation. Learn from real customer insights.

By 
Yuliia Tkachova user avatar
Yuliia Tkachova
·
May. 21, 24 · Opinion
Like (1)
Save
Tweet
Share
478 Views

Join the DZone community and get the full member experience.

Join For Free

TL;DR: I might be biased on this, but I’m also equipped with analytics on column-level lineage usage from a number of customers and users.

Data lineage image

Image courtesy of the Masthead Data team: Data Lineage

Is Data Lineage a Pain Killer or Vitamin?

First, it very much depends on the user organization’s current use cases and their level of maturity.

In my humble opinion, data engineers love looking at data flows and have that visual understanding of dependencies, but do they really use data lineage at the end of the day? What is the usage frequency? What are the specific use cases?

From what we observed, data lineage certainly drives interest. However, when it comes to actual usage, it is not the central feature. This could be because our implementation is limited to some data sources. However, having lineage limited to only some pipelines also seems less meaningful to me (i.e., lineage in dbt or Dataform), as ingestion and other processes are left in shades. A typical use case might involve someone in the organization searching for a specific pipeline or model about twice a week for a few minutes.

Common Uses for Data Lineage

The most common use cases for lineage we saw were:

  • The company is migrating or rebuilding its data platform.
  • The organization is onboarding new teammates, often for new data initiatives.

These are the times when lineage becomes very handy. Basically, it’s when the company starts not just maintaining what is in their data warehouse or data lake, but actually building and modernizing the data ecosystem.

Does this mean that having lineage is a must in this case? Absolutely not. But if you are interested in moving faster and smarter, then the answer is absolutely yes.

Questions To Consider

So, it very much depends on what the organization is currently doing. I am not trying to be assertive here, but rather intelligently honest by asking if you really need data lineage. You might want to start with questions like: 

  1. What is it for?
  2. What level of coverage do you need?
  3. Does it need to visualize production sources, or is a data warehouse enough?
  4. Do you need a BI solution connected? If yes, to what extent?

Then you speak to the universe and decide: buy or build. There’s a lot to consider here. My take is as follows:

  • Will it be used by the data team only, or will business users also be involved? (Consider the level of UX/UI required.)
  • How much are you ready to invest in it? (Calculate the cost of building it internally at the expense of your team’s hours and compare it to purchasing from a vendor.) Please, do not forget to double the hours your team initially promised to you. Hear me out; I'm speaking as a product manager here.
  • Consider what you have already in your data platform: data lake, using third-party data sources, and the stack already in use by the data team. It sounds easy and fun until you start dealing with complex cases like cross-project dependencies, views of temporary tables, or, heaven forbid, sharded tables, etc., and the list goes on.
  • What is your team’s strategic focus and their skill set? Is it a strategic investment for you, and do you have the capacity to maintain and evolve it? Because your data platform, whether you believe it or not, will evolve.

Conclusion

Ultimately, my personal belief is that data lineage as a standalone visualization is not effective. Our use case for data lineage is to help troubleshoot broken pipelines or model errors because when organizations have an active warehouse with hundreds of pipelines and thousands of tables, it is impossible to keep track of if everything is working as expected. When we are talking about data quality, those are SQL rules and something already anticipated and known, but pipelines and models are a different beast. It is a lot about connectivity, compatibility, and effectiveness of the data platforms. Pairing data pipeline/model error detection and data lineage is the area where we see a lot of response and value for users. Additionally, it helps our clients save money as it is also connected to cost insights.

Having lineage alone does not solve the problem; it creates a new one. No one understands how the solution is being used because lineage alone does not move the needle. It rather helps to move it faster in combination with anomaly detection and pipeline error detection.

While data lineage alone may be seen as just another shining tool, its true value emerges when paired with comprehensive monitoring mechanisms and a commitment from the organization and the data team to build up a robust and reliable data platform.

Data (computing) Pipeline (software)

Opinions expressed by DZone contributors are their own.

Related

  • Python Function Pipelines: Streamlining Data Processing
  • Offline Data Pipeline Best Practices Part 2:Optimizing Airflow Job Parameters for Apache Hive
  • 6 Best Practices to Build Data Pipelines
  • Building Robust Real-Time Data Pipelines With Python, Apache Kafka, and the Cloud

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: