Detecting changes in SQL
Functionality described on this page requires installation via pip install pycodehash[sql]
The same principle holds for SQL as for Python functions: the hash of an SQL query should reflect changes of the implementation and be invariant to non-functional changes.
Hashing SQL queries and files
PyCodeHash supports hashing SQL queries and files.
Currently, our default implementation is invariant to omitting database names, comments, and whitespace, e.g.
/* block comment: set `my_database` as the default database */
USE my_database;
SELECT
-- inline comment: select all columns
*
FROM
hello_world
is equivalent to
USE my_database;
SELECT * FROM my_database.hello_world
This behavior can be extended with user-provided Abstract Syntax Tree (AST) transformers.
The AST parsing from the excellent SQLFluff library is used in our implementation.
This results in many dialects of SQL being supported out of the box, e.g.:
- ANSI
- BigQuery
- PostgreSQL
- SparkSQL
- SQLite
- T-SQL (MSSQL)
- Trino
The SQLHasher
allows for passing on configuration to SQLFluff via configuration files.
Usage examples
Hashing SQL files and queries
The SQL Usage Example demonstrates how to hash SQL queries and files:
"""Hash SQL queries and files (requires `sqlfluff` to be installed)"""
from pathlib import Path
from pycodehash.sql.sql_hasher import SQLHasher
# First query
query_1 = "SELECT * FROM db.templates"
# The second query is equivalent, but has additional whitespace
query_2 = "SELECT\n * \nFROM \n db.templates"
# Write the second query to a file
query_2_file = Path("/tmp/query.sql")
query_2_file.write_text(query_2)
# Create the SQLHasher object for SparkSQL
hasher = SQLHasher(dialect="sparksql")
# We can hash a string
result_1 = hasher.hash_query(query_1)
print(result_1)
# Or pass a path
result_2 = hasher.hash_file(query_2_file)
print(result_2)
# Both hashes are identical
assert result_1 == result_2
Hashing files in git history
Below is a more advanced example for checking if there was a behavioural change in a SQL file in a specific git commit:
import subprocess
from typing import Literal
import sqlfluff
from pycodehash.sql.sql_hasher import SQLHasher
def git_show_file(commit_id: str, file_name: str) -> str:
"""Get the contents of a file at a specific commit via `git show`
Args:
commit_id: the commit SHA ID
file_name: the name of the file
Returns:
string containing the contents of the file in git
Raises:
CalledProcessError: when the git command fails, e.g. when the file does not exist
"""
result = subprocess.run(
["git", "--no-pager", "show", f"{commit_id}:{file_name}"], stdout=subprocess.PIPE, encoding="utf-8", check=False
)
result.check_returncode()
return result.stdout
def detect_file_change(
hasher: SQLHasher, commit_id: str, file_name: str
) -> Literal["FUNCTIONAL", "NON-FUNCTIONAL", "INVALID"]:
"""Compares the hashes of versions of the same file in git
Args:
hasher: the SQL hasher object
commit_id: the reference to the commit
file_name: the name of the SQL file
Returns:
FUNCTIONAL if the code changed functionally
NON-FUNCTIONAL, if the code did not change functionally, e.g. whitespace
INVALID, if the file did not exist or the SQL could not be parsed
"""
try:
after = git_show_file(commit_id, file_name)
except subprocess.CalledProcessError:
# deleted file
return "INVALID"
try:
before = git_show_file(f"{commit_id}^", file_name)
except subprocess.CalledProcessError:
# newly created file
return "FUNCTIONAL"
try:
before_hash = hasher.hash_query(before)
after_hash = hasher.hash_query(after)
except sqlfluff.api.simple.APIParsingError:
return "INVALID"
return "FUNCTIONAL" if before_hash != after_hash else "NON-FUNCTIONAL"
hasher = SQLHasher(dialect="sparksql")
change_type = detect_file_change(hasher, "<COMMIT_ID>", "<FILE_NAME>")
if change_type == "FUNCTIONAL":
print("The SQL query should be executed after this commit")
elif change_type == "NON-FUNCTIONAL":
print("The SQL query does not have to be executed after this commit")
else:
print("The query file is invalid before or after the commit (e.g. syntax error or deleted)")
SQL query dependencies
In real-world applications, engineers and analysts typically structure multiple SQL queries in separate files that are executed sequentially or according to a topological order. This is often achieved using data transformation frameworks like dbt, SQLMesh , or similar commercial products.
In such scenarios, simply relying on the hash of individual SQL files is insufficient. When a referenced table in a query is updated, the query must be re-executed, regardless of whether the query's contents have changed. To address this, we can automatically extract table references by parsing the SQL Abstract Syntax Tree (AST).
This approach is straightforward, as table references are limited to
specific contexts like CREATE
, FROM
, INTO
and Common Table Expressions
(CTEs). We've chosen to integrate with existing efforts, leveraging the
SQLLineage implementation built on top of SQLFluff, to prevent duplication
of effort.
Usage:
from pycodehash.sql import extract_table_references
query = "SELECT * INTO output_table FROM my_database.input_table"
input_tables, output_tables, dropped_tables = extract_table_references(
query,
default_db="my_database",
dialect="t-sql"
)
print(input_tables)
# {'my_database.input_table'}
print(output_tables)
# {'my_database.output_table'}
print(dropped_tables)
# {}