Calculating Fingerprints for SQL Queries

Learn how to calculate a unique identifier for each SQL query in the database. See how we solved this challenge and what options we considered.

Adam Furmanek

Jun. 20, 24 · Tutorial

Like (1)

Save

2.9K Views

One of the things that Metis does when analyzing query performance is recognizing duplicates. We need to be able to recognize that two queries are the same query but with different parameters. For instance:

SELECT * FROM table WHERE id > 50

and

SELECT * FROM table WHERE id > 100

are the same even though they differ in the query text. Identifying these queries is simple with prepared statements as they are translated into queries with parameters like:

SELECT * FROM table WHERE id > $1

However, not all queries are prepared statements, so we needed a solution for identifying duplicates.

Our goal is to be able to define a single value identifying the query uniquely. We call this value a fingerprint. We don’t need to be 100% correct and we allow for some misses.

We started with a Query ID calculated by Postgres. Let’s see how it works.

Calculating Query ID With Postgres

Starting with version 14, Postgres can calculate the Query ID automatically with the compute_query_id parameter. The identifier is calculated by the engine automatically when processing the query. The implementation is based on the pg_stat_statements extension and follows some principles that we should be aware of:

The identifier is stable and comparable as long as the underlying server version and catalog metadata stay the same. Therefore, two servers with WAL replication are expected to have the same identifier for a query, however, different replication schemes may result in different identifiers
The identifier is calculated after parsing the query tree. This means that if you drop and recreate the table, the identifier may change

There are other situations when the identifier may change, so consult the documentation to understand it better.

Unfortunately, this mechanism has been available since PostgreSQL 14. We wanted to use it in older versions and we needed to implement our solution. Read on to see what other solutions we explored.

Calculating the Identifier Manually

The first idea was to implement the identifier in our backend. This would involve parsing the query and hashing it to get the identifier.

We decided to reject this solution because it is generally hard. There are many reasons why it’s not straightforward.

First, the identifier should change when statistics change. Therefore, we need to track the statistics as well. However, we can’t do that as we get queries retroactively. We don’t know what statistics were used by the engine when executing the query.

Parsing the query text is not straightforward. Consider the two queries below - are they equal?

SELECT column FROM table;

and

EXEC SQL BEGIN DECLARE SECTION;
const char *stmt = "SELECT column FROM table";
int v1;
EXEC SQL END DECLARE SECTION;
EXEC SQL PREPARE mystmt FROM :stmt;
EXEC SQL EXECUTE mystmt INTO :v1;

Queries depend on the parameter values and indexes. Our ultimate goal is to extract distinct execution plans. Replacing parameter values with NULLs may result in a much different result and we can’t tell that without the help of the engine.

There are so many ways to write a query. You can replace joins with subqueries, use CTEs, use aliases, change the order of fields, and so on. Parsing would be challenging and not reliable.

How to deal with stored procedures? Are we interested in the actual call of the procedure or its body?

What if Postgres changes the way it handles the queries? Should we adapt accordingly?

This list could be much longer. While we can accept some unreliability, we need to understand that this mechanism is hard even in relatively simple situations.

Using Postgres Code

The other idea was to extract the code from Postgres and use it to parse queries. We wanted to use the pg_query library. It’s a Rust package that runs the same code as Postgres.

We managed to build the sample and wanted to run it in production. Our backend is implemented in JavaScript, so we had to find a way to run the Rust code somehow. We decided to use Neon to wrap the Rust package with JavaScript and run it this way.

The solution worked. Unfortunately, the resulting image was way too big. It consumed nearly 1.5GB of space which was far too much for our backend. We tried to trim it but we didn’t succeed. Therefore, we decided to drop this solution.

Using Postgres Code Differently

Finally, we decided to use a different library - libpg_query. This is a C library for parsing the query using the Postgres code. The library has a JavaScript binding in the form of the libpg-query library which was very convenient to use.

Since this is the same code that the PostgreSQL server uses, we have consistent behavior with the stock engine.

Summary

Calculating the fingerprint of the query turned out to be non-trivial. Fortunately, there are libraries out there that can do the job for us. We can use it to identify queries that should be considered the same despite having different query text. Thanks to that, we can aggregate data better and provide better insights.

Identifier Library PostgreSQL

Published at DZone with permission of Adam Furmanek. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending