Detecting changes in SQL

Functionality described on this page requires installation via pip install pycodehash[sql]

The same principle holds for SQL as for Python functions: the hash of an SQL query should reflect changes of the implementation and be invariant to non-functional changes.

Hashing SQL queries and files

PyCodeHash supports hashing SQL queries and files.

Currently, our default implementation is invariant to omitting database names, comments, and whitespace, e.g.

/* block comment: set `my_database` as the default database */
USE my_database;

SELECT 
    -- inline comment: select all columns
    * 
FROM 
    hello_world

is equivalent to

USE my_database;
SELECT * FROM my_database.hello_world

This behavior can be extended with user-provided Abstract Syntax Tree (AST) transformers.

The AST parsing from the excellent SQLFluff library is used in our implementation.

This results in many dialects of SQL being supported out of the box, e.g.:

  • ANSI
  • BigQuery
  • PostgreSQL
  • SparkSQL
  • SQLite
  • T-SQL (MSSQL)
  • Trino

The SQLHasher allows for passing on configuration to SQLFluff via configuration files.

Usage examples

Hashing SQL files and queries

The SQL Usage Example demonstrates how to hash SQL queries and files:

"""Hash SQL queries and files (requires `sqlfluff` to be installed)"""

from pathlib import Path

from pycodehash.sql.sql_hasher import SQLHasher

# First query
query_1 = "SELECT * FROM db.templates"

# The second query is equivalent, but has additional whitespace
query_2 = "SELECT\n    * \nFROM \n    db.templates"

# Write the second query to a file
query_2_file = Path("/tmp/query.sql")
query_2_file.write_text(query_2)

# Create the SQLHasher object for SparkSQL
hasher = SQLHasher(dialect="sparksql")

# We can hash a string
result_1 = hasher.hash_query(query_1)
print(result_1)

# Or pass a path
result_2 = hasher.hash_file(query_2_file)
print(result_2)

# Both hashes are identical
assert result_1 == result_2

Hashing files in git history

Below is a more advanced example for checking if there was a behavioural change in a SQL file in a specific git commit:

import subprocess
from typing import Literal

import sqlfluff
from pycodehash.sql.sql_hasher import SQLHasher


def git_show_file(commit_id: str, file_name: str) -> str:
    """Get the contents of a file at a specific commit via `git show`

    Args:
        commit_id: the commit SHA ID
        file_name: the name of the file

    Returns:
        string containing the contents of the file in git

    Raises:
        CalledProcessError: when the git command fails, e.g. when the file does not exist
    """
    result = subprocess.run(
        ["git", "--no-pager", "show", f"{commit_id}:{file_name}"], stdout=subprocess.PIPE, encoding="utf-8", check=False
    )
    result.check_returncode()
    return result.stdout


def detect_file_change(
    hasher: SQLHasher, commit_id: str, file_name: str
) -> Literal["FUNCTIONAL", "NON-FUNCTIONAL", "INVALID"]:
    """Compares the hashes of versions of the same file in git

    Args:
        hasher: the SQL hasher object
        commit_id: the reference to the commit
        file_name: the name of the SQL file

    Returns:
        FUNCTIONAL if the code changed functionally
        NON-FUNCTIONAL, if the code did not change functionally, e.g. whitespace
        INVALID, if the file did not exist or the SQL could not be parsed
    """
    try:
        after = git_show_file(commit_id, file_name)
    except subprocess.CalledProcessError:
        # deleted file
        return "INVALID"

    try:
        before = git_show_file(f"{commit_id}^", file_name)
    except subprocess.CalledProcessError:
        # newly created file
        return "FUNCTIONAL"

    try:
        before_hash = hasher.hash_query(before)
        after_hash = hasher.hash_query(after)
    except sqlfluff.api.simple.APIParsingError:
        return "INVALID"
    return "FUNCTIONAL" if before_hash != after_hash else "NON-FUNCTIONAL"


hasher = SQLHasher(dialect="sparksql")
change_type = detect_file_change(hasher, "<COMMIT_ID>", "<FILE_NAME>")
if change_type == "FUNCTIONAL":
    print("The SQL query should be executed after this commit")
elif change_type == "NON-FUNCTIONAL":
    print("The SQL query does not have to be executed after this commit")
else:
    print("The query file is invalid before or after the commit (e.g. syntax error or deleted)")

SQL query dependencies

In real-world applications, engineers and analysts typically structure multiple SQL queries in separate files that are executed sequentially or according to a topological order. This is often achieved using data transformation frameworks like dbt, SQLMesh , or similar commercial products.

In such scenarios, simply relying on the hash of individual SQL files is insufficient. When a referenced table in a query is updated, the query must be re-executed, regardless of whether the query's contents have changed. To address this, we can automatically extract table references by parsing the SQL Abstract Syntax Tree (AST).

This approach is straightforward, as table references are limited to specific contexts like CREATE, FROM, INTO and Common Table Expressions (CTEs). We've chosen to integrate with existing efforts, leveraging the SQLLineage implementation built on top of SQLFluff, to prevent duplication of effort.

Usage:

from pycodehash.sql import extract_table_references

query = "SELECT * INTO output_table FROM my_database.input_table"

input_tables, output_tables, dropped_tables = extract_table_references(
    query, 
    default_db="my_database", 
    dialect="t-sql"
)
print(input_tables)
# {'my_database.input_table'}
print(output_tables)
# {'my_database.output_table'}
print(dropped_tables)
# {}