Detecting code changes in Python function

How `pycodehash` detects code changes

pycodehash attempts to accurately detect code modifications by focusing on the following aspects that reflect true changes to your implementation

Functional Changes:

Implementation: Changes to the underlying logic of your code, including new functionality or updates to existing behavior.
Dependencies: Modifications to the external libraries, frameworks, or modules upon which your code relies.
Transitive Dependencies: Updates to the dependencies of your dependencies, ensuring that pycodehash captures changes that may not be immediately apparent.

Ignored Non-Functional Changes:

On the other hand, pycodehash intentionally ignores non-functional modifications that do not affect the actual behavior of your code. These include:

Function Name: Renaming functions or variables does not change their functionality.
Formatting: Code formatting changes, such as indentation or line wrapping, trailing commas, are disregarded.
Comments and Docstrings: Comments and documentation strings may be updated without affecting the code's execution.
Type Hints: Changes to type hints do not impact the code's behavior.
Dead code elimination: Unused imports, unused variables, unused arguments in formatting, duplicate dictionary keys
Newer Python Syntax: super calls, set literals, etc. See pyupgrade for details.
Miscellaneous code style idiom and conventions: simplified comprehensions, dictionary key membership checks, import ordering, unnecessary return syntax

By focusing on functional changes while ignoring non-functional modifications, pycodehash provides a reliable way to detect true changes in your code.

See the two equivalent code snippets in the Equivalence example for examples of the non-functional changes listed below.

Code change detection guarantees

For code change detection: False positive matches are possible, but false negatives are not*. In other words, if the hash of two functions is equal, it's guaranteed there is no functional change. If the hashes are not equal, then it's likely, but not guaranteed that there is no functional change. These guarantees are analogous to that of the bloom filter probabilistic data structure.

Except for hash collisions that are caused by the choice of hashing algorithm, by default SHA512.

Simplifying Hash Computation with Pure Functions

In the context of Python, implementing hash computations for arbitrary functions can be challenging due to the language's dynamic nature. To simplify this process and avoid having to implement a full Python execution module, we restrict hashing to pure functions.

In simple terms, a pure function is a function that:

Always returns the same output given the same inputs
Has no side effects

Requiring Python functions to be pure is not a significant restriction, but rather a natural design principle. Pure functions are easier to test and have predictable behavior.

Algorithm

Given a Python function as source code we:

Initialize or load the function's package source
Create an Abstract Syntax Tree (AST) for the provided function
Replace the name of called functions with the hash of the function definition
Remove invariant changes
- AST: Strip docstrings and type hints, and remove the function name
- Unparse the AST to a string presentation
- Lines: Normalize whitespace
Unparse the AST and normalise
Hash

1. Package initialization

PyCodeHash initializes a package when it's first encountered, e.g., in from my_package.functions import func, my_package would be the package. The initialization phase consists of:

Copying the source to a temporary directory (enabled by default)
Indexing the source via static object analysis on all python files (via rope)
Brining the source in a canonical form where possible via auto-fixing lint violations (via ruff)

2. Abstract Syntax Tree

The AST is parsed using Pythons standard library. This step removed comments and normalises formatting.

3. Find call definitions

The inspiration for solving the dependency invariance comes from compilers/interpreters:

Source code -> ... -> Interpreted/compiled -> ... -> Machine instructions

Compilers often inline code to enable additional transformations. PyCodeHash applies the same concept.

For example, in the following code:

def multiply(y, z):
    return y * z

def shift_left(x):
    return multiply(x, 2)

If we hash the source code of shift_left, then the hash is invariant to changes in multiply. This does not meet our desiderata. By inlining multiply, this is no longer true:

def shift_left(x):
    def multiply(y, z):
        return y * z
    return multiply(x, 2)

In our implementation, the function call is replaced with the hash of the source definition, rather than inlined:

def shift_left(x):
    return c_9e8c617fe2e0d524469d75f43edb1ff91f9a5387af6c444017ddcd194c983aed(x, 2)

Note that the hash is prefixed with an c_ (for call) to ensure syntactical validity. Hashes can start with digits, which makes the code snippet is no longer valid Python syntax due to the function name constraint: identifiers (such as function names) cannot start with digits. According to the Python reference documentation, this is a fundamental rule for Python's lexical analysis. Syntactical validity is required to be able to use tools that transform the code afterward.

The implementation builds on rope, an advanced open-source Python refactoring library. This package performs a lot of heavy lifting. See the section "The Challenge of Finding Call Definitions in Python" for details.

4. Strip invariant AST changes

In this step, PyCodeHash transforms the AST to remove invariant syntax. For this we implemented multiple NodeTransformers that can be found in src/pycodehash/preprocessing/.

5. Unparse AST and normalise

Then we unparse the AST representation to obtain the Python source code (without comments and formatting).

On this string, we apply whitespace normalisation to ensure platform-independent hashes.

6. Hashing

Finally, the resulting source code is hashed using hash_string. The function uses the SHA256 algorithm provided by the standard library.

The Challenge of Finding Call Definitions in Python

Python's dynamic nature makes it difficult to find call definitions due to the various ways functions can be defined. This section highlights some examples that illustrate this challenge.

Naive approach to hashing

We can use the inspect module to get the source code of the function as a string, and then compute its hash.

Here's an example:

import inspect

from pycodehash.hashing import hash_string

def my_function(x):
    return x * 2

# Get the source code of the function as a string
func_str = inspect.getsource(my_function)

# Compute the hash value. Uses the stdlib `hashlib.sha256` underneath
hash_value = hash_string(func_str)

print(hash_value)

Please note that this approach has a serious limitation:

It only works for functions with simple source code. If the function has a very long or complex implementation (e.g., due to many nested loops), computing its hash might be problematic.

Function Definition Variations

The following code snippets demonstrate different styles of function definition:

data = [1, 2, 3, 4, 5]

# Built-in function call (no explicit definition)
sum(data)

# Explicit function definition
def sum(x):
    return x + [6, 7, 8, 9]

sum(data)

from stats import sum

# Importing a function from another module
sum(data)

Function Definition in Nested Scopes

Functions can be defined within other functions or modules, leading to nested scopes. For example:

def processing(y):
    # Import alias within local scope
    from another.lib import multiply as sum

    sum(y)

    # Enclosing scope function definition
    def sum(z):
        return z * 2

    sum(y)

Dynamic Function Creation

Functions can also be created dynamically using closures. For instance:

def create_func(y):
    def func(x):
        return x + y

    return func

sum = create_func(3)
sum(data)

Understanding these variations is crucial for effectively resolving call definitions in Python.

Read more about the LEGB Rule for Python Scope to better grasp how scope and naming work in Python.

Cant we use Python's built-in hashing functions?

In Python, there is the built-in hash() function or the __hash__ method in classes to compute a hash value for an object, which is an integer that represents the "identity" of the object being hashed. When you call hash() on an object, it computes the hash value on-the-fly using the object's attributes and methods. The built-in hash() function in Python cannot be used to reliably compare if the content of a function has changed because even if two functions have the same source code, there's no guarantee that their hash values will remain stable across different runs of your program or in different environments.

Python also has a built-in library to "hash" your functions, namely hashlib, which provides secure and efficient hashing algorithms. The cryptography package is an example of a third-party packages that also offers such cryptographic algorithms. However, these libraries are designed for general-purpose hashing, not specifically for computing hashes of Python functions.

Tips for writing Pure Functions

How to make functions with randomness pure?

To make functions with randomness pure, pass the seed value or random number generator object as an argument.

For instance:

import numpy as np

def my_random_function(seed: int) -> int:
    """A function that generates a random number between 1 and 10.

    Args:
        seed: The seed value for the random number generator.

    Returns:
        A random number between 1 and 10.
    """
    rng = np.random.default_rng(seed)
    return rng.integers(1, 10)

# Usage example:
print(my_random_function(42))  # Always generates the same output (if seeded correctly)

This way, you can control the sequence of random numbers generated by the RNG and make your function's output deterministic.

Tools for enforcing pureness

In modern software engineering practices, lints like mr-proper can aid in identifying and enforcing the purity of functions, making it a more maintainable and efficient development process.

Implementation Considerations

To maximize development efficiency and minimize long-term maintenance burdens, this library uses established third-party libraries rather than implementing similar functionality from scratch. Hence, PyCodeHash is built on top of rope (for call tracing) and ruff (for applying a multitude of invariant changes). The current implementation in addition uses the asttokens package to map AST nodes to their offset in the Python source code, which is required to interact with rope. This results in an effective library that can be used today.

This approach does come with some trade-offs, however. By relying on external libraries, we sacrifice access to internal state from these programs, such as the control-flow graph that may be useful for more highly optimized or specialized functionality. Users who already have access to their own control-flow graphs may find it more beneficial to implement equivalent functionality directly within their own systems.