Skip to content
James Broome By James Broome Director of Engineering
How to test Azure Synapse notebooks

Interactive notebooks are an incredibly powerful tool for data exploration and experimentation. We're huge advocates of notebook technologies, and notebook-based development, and love how they can significantly deliver decreased time to (business) value, and eliminate waste by improving collaboration and productivity. This is especially true inside Azure Synapse, where a notebook can be "productionised" into an automated, repeatable cloud-hosted data pipeline with a few clicks of a button. But with this agility comes a risk - if business insights are being surfaced then there's a need for quality assurance, the absence of which can mean that inaccuracies in the data ended up going unnoticed. Like any application, the need to validate business rules and security boundaries within a data solution is important, as well as the need for ensuring that quality doesn't regress over time as the solution evolves.

But how do you tackle testing inside notebooks? This post argues that with a simple shift in mindset around how notebooks are structured, tests can be added relatively easily to any notebook logic, just like any other form of application logic. An end-to-end notebook sample is also included, which can be used as a reference for your own projects.

Structuring the notebook for testing

Whilst notebooks might often start off as a tool for data exploration or experimentation, when they get added into a repeatable process they're typically doing some form of ETL or ELT - applying a series of transformation steps over an data input, resulting in a different data output.

If each logical step (cleansing, joining, filtering, modelling etc) is treated as a separate action, and the notebook is structured accordingly, then each step can be tested in isolation just like any software sub-routine or module.

Notebooks have a natural way to modularise the steps through cells. In order to test a cell in isolation, the code it contains simply needs to be wrapped in a function that can be called from elsewhere in the notebook.

So, the key to testing notebooks is to treat each cell as a logical step in the end-to-end process, wrapping the code in each cell in a function so that it can be tested.

For example, the simple function in the PySpark sample below removes duplicates in a dataframe. Even though it's only one line of code, it still contains a rule about how duplicates are identified - in this case just using the OrderId column. This logic could have been implemented differently - e.g. finding duplicates across all the columns, and therefore arguably should have tests around it to validate that the outputs are as expected. Wrapping it in a function also has the added advantage of being explicit about what this step is doing in the naming of the function.

def remove_duplicate_orders(df):
    return df.dropDuplicates(["OrderId"])

It's worth noting that this restructuring, whilst allowing code to be tested, does come with some loss of flexibility in running individual cells in isolation. Now that it's wrapped in a function, running a cell will just define the function rather than execute it. So there also needs to be code in another cell that calls the function.

There's a couple of approaches that you could then follow to orchestrate the end to end process once the notebook is made up of a series of modular cell-based functions. Firstly, add the "calling cells" alongside the "function cells" as logical pairs, so that they're together in the context of the overall notebook. Alternatively, add a final cell in the notebook that acts as the orchestration/workflow cell, calling each step in the right order. The effect is the same as running the whole notebook from top to bottom, but this explicit structuring makes it then possible to add tests around the logic.

Adding test cases

Tests can be added to the notebook in exactly the same way, as a series of cells that test specific scenarios over your data transformation functions. Wrapping each test in a separate function also allows you to describe what's being tested, using well-named test function names, But again, this also requires a separate test orchestration/workflow cell in the notebook to run all the tests.

Using the simple example above of removing duplicates, here's an example test case to validate the logic:

def orders_with_duplicated_order_id_are_removed():
    
    # Arrange
    df = spark.createDataFrame(
            [
                (10,"01/01/2020","North","Chicago","Bars","Carrot",33,1.77,58.41),
                (10,"11/03/2020","North","Chicago","Bars","Carrot",33,1.77,58.41),
            ],
            orders_schema 
        )

    #Act
    df_result = remove_duplicate_orders(df)

    #Assert
    assert df_result, "No data frame returned from remove_duplicate_orders()"

    expected_orders = 1
    number_of_orders = df_result.count()
    assert number_of_orders == 1, f'Expected {expected_orders} order after remove_duplicate_orders() but {number_of_orders} returned.'

Being explicit about what the test is doing in the name of the test function then naturally forces us to think about other scenarios that we might want to test.

For example, if we've decided that our de-duplication logic is based around the OrderId column, we might want to validate that rows that have identical values across all the other columns aren't treated as duplicates:

def similar_orders_with_different_order_id_are_not_removed():
    
    # Arrange
    df = spark.createDataFrame(
            [
                (10,"01/01/2020","North","Chicago","Bars","Carrot",33,1.77,58.41),
                (11,"01/01/2020","North","Chicago","Bars","Carrot",33,1.77,58.41),
                (12,"01/01/2020","North","Chicago","Bars","Carrot",33,1.77,58.41),
            ],
            orders_schema 
        )

    #Act
    df_result = remove_duplicate_orders(df)

    #Assert
    expected_orders = 3
    number_of_orders = df_result.count()
    assert number_of_orders == 3, f'Expected {expected_orders} order after remove_duplicate_orders() but {number_of_orders} returned.'

Splitting out tests from workflow

Depending on the complexity of the notebook, it might be perfectly acceptable to leave everything in one place - the functions, the tests, and the orchestration workflow all in the same notebook, however, as your notebook complexity grows, this could get quite unmanageable quite quickly.

If your notebook is simple and you do want to keep everything together in one place, Azure Synapse has a feature that allows one notebook cell to be treated as the "parameters cell", meaning any variables defined in that cell will be overwritten with any pipeline parameters. Using this approach, a simple test_mode flag could be added, allowing the same notebook to be run inside a pipeline to execute the tests, or run the actual workflow.

Alternatively, if you want to split things out (either because your notebook is becoming unmanageable, or for pure "separation of concerns" reasons) the use of the magic %run command can help - allowing you to run one notebook from another. With this approach, you can separate your business logic notebook from your test notebook. With your test notebook using %run to execute the main notebook, your logical functions can now be tested easily in isolation in your test notebook.

In either case, the tests and the notebook logic can then be added to a pipeline, with the tests now acting as a quality gate in the end to end process.

Limitations in Azure Synapse notebooks

At time of writing, there are two limitations around the use of the magic %run command within Azure Synapse:

  1. It doesn't work if you have a Git-enabled workspace and are working on a branch (rather than in Live mode)

  2. It doesn't work when using a Notebook Activity inside a Synapse Pipeline

With this in mind, the end-to-end notebook example below combines the tests and workflow in the same notebook, using the test_mode approach described above to switch between running the tests or not.

When these limitations are removed, the example could be easily refactored to be split out into two separate notebooks.

Using test frameworks in the notebook

How you actually make assertions in the test functions will entirely depend on which language you're using in your notebook, and which is your preferred testing framework.

In the example PySpark notebook below, PyTest is used, which is pre-installed on the Azure Synapse spark pool. Assertions are made simply using the assert statement in the test code.

Capturing test failures

If you're executing your tests manually by running the notebook, then the output will appear in the test orchestration cell as your tests are called.

If you're running your notebooks from a Synapse Pipeline, then the notebook output can be captured using the following expression:

@activity('Notebook Name').output.status.Output.result.exitValue

By orchestrating the running of all the tests from a single cell, then capturing any assertion failures into the notebook output is a fairly trivial process, so that the calling pipeline can respond accordingly.

Putting it all together

The example notebook below ties all of this together, demonstrating how PyTest can be used inside a PySpark notebook running inside Azure Synapse, to test data transformation logic, controlled via a test_mode variable that can be set within a Synapse Pipeline.

There's always a way to test something

Notebooks provide an agile, collaborative and productive way to develop data transformation logic and ETL processes, meaning the time from exploration and experimentation to productionised processes is significantly reduced.

But whilst there's no "obvious" way to add quality assurance to notebook-based development, this post has shown that it's actually relatively simple.

The best hour you can spend to refine your own data strategy and leverage the latest capabilities on Azure to accelerate your road map.

James Broome

Director of Engineering

James Broome

James has spent nearly 20 years delivering high quality software solutions addressing global business problems, with teams and clients across 3 continents. As Director of Engineering at endjin, he leads the team in providing technology strategy, data insights and engineering support to organisations of all sizes - from disruptive B2C start-ups, to global financial institutions. He's responsible for the success of our customer-facing project delivery, as well as the capability and growth of our delivery team.