DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Low-Code Development: Leverage low and no code to streamline your workflow so that you can focus on higher priorities.

DZone Security Research: Tell us your top security strategies in 2024, influence our research, and enter for a chance to win $!

Launch your software development career: Dive head first into the SDLC and learn how to build high-quality software and teams.

Open Source Migration Practices and Patterns: Explore key traits of migrating open-source software and its impact on software development.

Related

  • Snowflake Data Time Travel
  • SQL Server From Zero To Advanced Level: Leveraging nProbe Data
  • Unveiling the Clever Way: Converting XML to Relational Data
  • Enhancing Performance: Optimizing Complex MySQL Queries for Large Datasets

Trending

  • Difference Between App Development and IaC CI/CD Pipelines
  • Explainable AI: Seven Tools and Techniques for Model Interpretability
  • From JSON to FlatBuffers: Enhancing Performance in Data Serialization
  • Using Agile To Recover Failing Projects
  1. DZone
  2. Data Engineering
  3. Data
  4. What Is Data Validation?

What Is Data Validation?

A data expert answers this question, looking in to the ways data validation is performed, why it's useful, and challenges inherent in the process.

By 
Garrett Alley user avatar
Garrett Alley
·
Jan. 25, 19 · Opinion
Like (6)
Save
Tweet
Share
37.9K Views

Join the DZone community and get the full member experience.

Join For Free

Data validation is a method for checking the accuracy and quality of your data, typically performed prior to importing and processing. It can also be considered a form of data cleansing. Data validation ensures that your data is complete (no blank or null values), unique (contains distinct values that are not duplicated), and the range of values is consistent with what you expect. Often, data validation is used as a part of processes such as ETL (Extract, Transform, and Load) where you move data from a source database to a target data warehouse so that you can join it with other data for analysis. Data validation helps ensure that when you perform analysis, your results are accurate.

Steps to Data Validation

Step 1: Determine Data Sample

Determine the data to sample. If you have a large volume of data, you will probably want to validate a sample of your data rather than the entire set. You’ll need to decide what volume of data to sample, and what error rate is acceptable to ensure the success of your project.

Step 2: Validate the Database

Before you move your data, you need to ensure that all the required data is present in your existing database. Determine the number of records and unique IDs, and compare the source and target data fields.

Step 3: Validate the Data Format

Determine the overall health of the data and the changes that will be required of the source data to match the schema in the target. Then search for incongruent or incomplete counts, duplicate data, incorrect formats, and null field values.

Methods for Data Validation

You can perform data validation in one of the following ways:

  • Scripting: Data validation is commonly performed using a scripting language such as Python to write scripts for the validation process. For example, you can create an XML file with source and target database names, table names, and columns to compare. The Python script can then take the XML as an input and process the results. However, this can be very time intensive, as you must write the scripts and verify the results by hand.

  • Enterprise tools: Enterprise tools are available to perform data validation. For example, FME data validation tools can validate and repair data. Enterprise tools have the benefit of being more stable and secure, but can require infrastructure and are costlier than open source options.

  • Open source tools: Open source options are cost-effective, and if they are cloud-based, can also save you money on infrastructure costs. But they still require a level of knowledge and hand-coding to be able to use effectively. Some open source tools are SourceForge and OpenRefine.

Challenges in Data Validation

Data validation can be challenging for a couple of reasons:

  • Validating the database can be challenging because data may be distributed in multiple databases across your organization. The data may be siloed, or it may be outdated.
  • Validating the data format can be an extremely time-consuming process, especially if you have large databases and you intend to perform the validation manually. However, sampling the data for validation can help to reduce the time needed.

Data Validation and ETL

Whether you validate data manually or via scripting, it can be very time-consuming. However, after you have validated your data, a modern ETL tool can help you to expedite the process. As a part of your assessment of your data, you can determine which errors can be fixed at the source, and which errors an ETL tool can repair while the data is in the pipeline. You can then automatically integrate, clean, and transform data as it is moved to your data warehouse.

Data (computing) Data validation Database

Published at DZone with permission of Garrett Alley, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Snowflake Data Time Travel
  • SQL Server From Zero To Advanced Level: Leveraging nProbe Data
  • Unveiling the Clever Way: Converting XML to Relational Data
  • Enhancing Performance: Optimizing Complex MySQL Queries for Large Datasets

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: