DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Low-Code Development: Leverage low and no code to streamline your workflow so that you can focus on higher priorities.

DZone Security Research: Tell us your top security strategies in 2024, influence our research, and enter for a chance to win $!

Launch your software development career: Dive head first into the SDLC and learn how to build high-quality software and teams.

Open Source Migration Practices and Patterns: Explore key traits of migrating open-source software and its impact on software development.

Related

  • Performance of ULID and UUID in Postgres Database
  • Migrating Data From Amazon Neptune to PostgreSQL Using AWS Services
  • 10 Svelte Data Grids: Choose the Right One for Your Project
  • How to Iterate Over Multiple Lists Sequentially in Python

Trending

  • Addressing Memory Issues and Optimizing Code for Efficiency: Glide Case
  • Benchmarking Java Streams
  • GBase 8a Implementation Guide: Performance Optimization
  • Leveraging Test Containers With Docker for Efficient Unit Testing
  1. DZone
  2. Data Engineering
  3. Databases
  4. Calculating Fingerprints for SQL Queries

Calculating Fingerprints for SQL Queries

Learn how to calculate a unique identifier for each SQL query in the database. See how we solved this challenge and what options we considered.

By 
Adam Furmanek user avatar
Adam Furmanek
·
Jun. 20, 24 · Tutorial
Like (1)
Save
Tweet
Share
2.9K Views

Join the DZone community and get the full member experience.

Join For Free

One of the things that Metis does when analyzing query performance is recognizing duplicates. We need to be able to recognize that two queries are the same query but with different parameters. For instance:

SELECT * FROM table WHERE id > 50


and

SELECT * FROM table WHERE id > 100


are the same even though they differ in the query text. Identifying these queries is simple with prepared statements as they are translated into queries with parameters like:

SELECT * FROM table WHERE id > $1


However, not all queries are prepared statements, so we needed a solution for identifying duplicates.

Our goal is to be able to define a single value identifying the query uniquely. We call this value a fingerprint. We don’t need to be 100% correct and we allow for some misses.

We started with a Query ID calculated by Postgres. Let’s see how it works.

Calculating Query ID With Postgres

Starting with version 14, Postgres can calculate the Query ID automatically with the compute_query_id parameter. The identifier is calculated by the engine automatically when processing the query. The implementation is based on the pg_stat_statements extension and follows some principles that we should be aware of:

  • The identifier is stable and comparable as long as the underlying server version and catalog metadata stay the same. Therefore, two servers with WAL replication are expected to have the same identifier for a query, however, different replication schemes may result in different identifiers
  • The identifier is calculated after parsing the query tree. This means that if you drop and recreate the table, the identifier may change

There are other situations when the identifier may change, so consult the documentation to understand it better.

Unfortunately, this mechanism has been available since PostgreSQL 14. We wanted to use it in older versions and we needed to implement our solution. Read on to see what other solutions we explored.

Calculating the Identifier Manually

The first idea was to implement the identifier in our backend. This would involve parsing the query and hashing it to get the identifier.

We decided to reject this solution because it is generally hard. There are many reasons why it’s not straightforward.

First, the identifier should change when statistics change. Therefore, we need to track the statistics as well. However, we can’t do that as we get queries retroactively. We don’t know what statistics were used by the engine when executing the query.

Parsing the query text is not straightforward. Consider the two queries below - are they equal?

SELECT column FROM table;


and

EXEC SQL BEGIN DECLARE SECTION;
const char *stmt = "SELECT column FROM table";
int v1;
EXEC SQL END DECLARE SECTION;
EXEC SQL PREPARE mystmt FROM :stmt;
EXEC SQL EXECUTE mystmt INTO :v1;


Queries depend on the parameter values and indexes. Our ultimate goal is to extract distinct execution plans. Replacing parameter values with NULLs may result in a much different result and we can’t tell that without the help of the engine.

There are so many ways to write a query. You can replace joins with subqueries, use CTEs, use aliases, change the order of fields, and so on. Parsing would be challenging and not reliable.

How to deal with stored procedures? Are we interested in the actual call of the procedure or its body?

What if Postgres changes the way it handles the queries? Should we adapt accordingly?

This list could be much longer. While we can accept some unreliability, we need to understand that this mechanism is hard even in relatively simple situations.

Using Postgres Code

The other idea was to extract the code from Postgres and use it to parse queries. We wanted to use the pg_query library. It’s a Rust package that runs the same code as Postgres. 

We managed to build the sample and wanted to run it in production. Our backend is implemented in JavaScript, so we had to find a way to run the Rust code somehow. We decided to use Neon to wrap the Rust package with JavaScript and run it this way.

The solution worked. Unfortunately, the resulting image was way too big. It consumed nearly 1.5GB of space which was far too much for our backend. We tried to trim it but we didn’t succeed. Therefore, we decided to drop this solution.

Using Postgres Code Differently

Finally, we decided to use a different library - libpg_query. This is a C library for parsing the query using the Postgres code. The library has a JavaScript binding in the form of the libpg-query library which was very convenient to use.

Since this is the same code that the PostgreSQL server uses, we have consistent behavior with the stock engine.

Summary

Calculating the fingerprint of the query turned out to be non-trivial. Fortunately, there are libraries out there that can do the job for us. We can use it to identify queries that should be considered the same despite having different query text. Thanks to that, we can aggregate data better and provide better insights.

Identifier Library PostgreSQL

Published at DZone with permission of Adam Furmanek. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Performance of ULID and UUID in Postgres Database
  • Migrating Data From Amazon Neptune to PostgreSQL Using AWS Services
  • 10 Svelte Data Grids: Choose the Right One for Your Project
  • How to Iterate Over Multiple Lists Sequentially in Python

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: