How to test Azure Synapse notebooks
Interactive notebooks are an incredibly powerful tool for data exploration and experimentation. We're huge advocates of notebook technologies, and notebook-based development, and love how they can significantly deliver decreased time to (business) value, and eliminate waste by improving collaboration and productivity. This is especially true inside Azure Synapse, where a notebook can be "productionised" into an automated, repeatable cloud-hosted data pipeline with a few clicks of a button. But with this agility comes a risk - if business insights are being surfaced then there's a need for quality assurance, the absence of which can mean that inaccuracies in the data ended up going unnoticed. Like any application, the need to validate business rules and security boundaries within a data solution is important, as well as the need for ensuring that quality doesn't regress over time as the solution evolves.
But how do you tackle testing inside notebooks? This post argues that with a simple shift in mindset around how notebooks are structured, tests can be added relatively easily to any notebook logic, just like any other form of application logic. An end-to-end notebook sample is also included, which can be used as a reference for your own projects.
Structuring the notebook for testing
Whilst notebooks might often start off as a tool for data exploration or experimentation, when they get added into a repeatable process they're typically doing some form of ETL or ELT - applying a series of transformation steps over an data input, resulting in a different data output.
If each logical step (cleansing, joining, filtering, modelling etc) is treated as a separate action, and the notebook is structured accordingly, then each step can be tested in isolation just like any software sub-routine or module.
Notebooks have a natural way to modularise the steps through cells. In order to test a cell in isolation, the code it contains simply needs to be wrapped in a function that can be called from elsewhere in the notebook.
So, the key to testing notebooks is to treat each cell as a logical step in the end-to-end process, wrapping the code in each cell in a function so that it can be tested.
For example, the simple function in the PySpark sample below removes duplicates in a dataframe. Even though it's only one line of code, it still contains a rule about how duplicates are identified - in this case just using the OrderId
column. This logic could have been implemented differently - e.g. finding duplicates across all the columns, and therefore arguably should have tests around it to validate that the outputs are as expected. Wrapping it in a function also has the added advantage of being explicit about what this step is doing in the naming of the function.
def remove_duplicate_orders(df):
return df.dropDuplicates(["OrderId"])
It's worth noting that this restructuring, whilst allowing code to be tested, does come with some loss of flexibility in running individual cells in isolation. Now that it's wrapped in a function, running a cell will just define the function rather than execute it. So there also needs to be code in another cell that calls the function.
There's a couple of approaches that you could then follow to orchestrate the end to end process once the notebook is made up of a series of modular cell-based functions. Firstly, add the "calling cells" alongside the "function cells" as logical pairs, so that they're together in the context of the overall notebook. Alternatively, add a final cell in the notebook that acts as the orchestration/workflow cell, calling each step in the right order. The effect is the same as running the whole notebook from top to bottom, but this explicit structuring makes it then possible to add tests around the logic.
Adding test cases
Tests can be added to the notebook in exactly the same way, as a series of cells that test specific scenarios over your data transformation functions. Wrapping each test in a separate function also allows you to describe what's being tested, using well-named test function names, But again, this also requires a separate test orchestration/workflow cell in the notebook to run all the tests.
Using the simple example above of removing duplicates, here's an example test case to validate the logic:
def orders_with_duplicated_order_id_are_removed():
# Arrange
df = spark.createDataFrame(
[
(10,"01/01/2020","North","Chicago","Bars","Carrot",33,1.77,58.41),
(10,"11/03/2020","North","Chicago","Bars","Carrot",33,1.77,58.41),
],
orders_schema
)
#Act
df_result = remove_duplicate_orders(df)
#Assert
assert df_result, "No data frame returned from remove_duplicate_orders()"
expected_orders = 1
number_of_orders = df_result.count()
assert number_of_orders == 1, f'Expected {expected_orders} order after remove_duplicate_orders() but {number_of_orders} returned.'
Being explicit about what the test is doing in the name of the test function then naturally forces us to think about other scenarios that we might want to test.
For example, if we've decided that our de-duplication logic is based around the OrderId
column, we might want to validate that rows that have identical values across all the other columns aren't treated as duplicates:
def similar_orders_with_different_order_id_are_not_removed():
# Arrange
df = spark.createDataFrame(
[
(10,"01/01/2020","North","Chicago","Bars","Carrot",33,1.77,58.41),
(11,"01/01/2020","North","Chicago","Bars","Carrot",33,1.77,58.41),
(12,"01/01/2020","North","Chicago","Bars","Carrot",33,1.77,58.41),
],
orders_schema
)
#Act
df_result = remove_duplicate_orders(df)
#Assert
expected_orders = 3
number_of_orders = df_result.count()
assert number_of_orders == 3, f'Expected {expected_orders} order after remove_duplicate_orders() but {number_of_orders} returned.'
Splitting out tests from workflow
Depending on the complexity of the notebook, it might be perfectly acceptable to leave everything in one place - the functions, the tests, and the orchestration workflow all in the same notebook, however, as your notebook complexity grows, this could get quite unmanageable quite quickly.
If your notebook is simple and you do want to keep everything together in one place, Azure Synapse has a feature that allows one notebook cell to be treated as the "parameters cell", meaning any variables defined in that cell will be overwritten with any pipeline parameters. Using this approach, a simple test_mode
flag could be added, allowing the same notebook to be run inside a pipeline to execute the tests, or run the actual workflow.
Alternatively, if you want to split things out (either because your notebook is becoming unmanageable, or for pure "separation of concerns" reasons) the use of the magic %run
command can help - allowing you to run one notebook from another. With this approach, you can separate your business logic notebook from your test notebook. With your test notebook using %run
to execute the main notebook, your logical functions can now be tested easily in isolation in your test notebook.
In either case, the tests and the notebook logic can then be added to a pipeline, with the tests now acting as a quality gate in the end to end process.
Limitations in Azure Synapse notebooks
At time of writing, there are two limitations around the use of the magic %run
command within Azure Synapse:
It doesn't work if you have a Git-enabled workspace and are working on a branch (rather than in Live mode)
It doesn't work when using a Notebook Activity inside a Synapse Pipeline
With this in mind, the end-to-end notebook example below combines the tests and workflow in the same notebook, using the test_mode
approach described above to switch between running the tests or not.
When these limitations are removed, the example could be easily refactored to be split out into two separate notebooks.
Using test frameworks in the notebook
How you actually make assertions in the test functions will entirely depend on which language you're using in your notebook, and which is your preferred testing framework.
In the example PySpark notebook below, PyTest is used, which is pre-installed on the Azure Synapse spark pool. Assertions are made simply using the assert
statement in the test code.
Capturing test failures
If you're executing your tests manually by running the notebook, then the output will appear in the test orchestration cell as your tests are called.
If you're running your notebooks from a Synapse Pipeline, then the notebook output can be captured using the following expression:
@activity('Notebook Name').output.status.Output.result.exitValue
By orchestrating the running of all the tests from a single cell, then capturing any assertion failures into the notebook output is a fairly trivial process, so that the calling pipeline can respond accordingly.
Putting it all together
The example notebook below ties all of this together, demonstrating how PyTest can be used inside a PySpark notebook running inside Azure Synapse, to test data transformation logic, controlled via a test_mode
variable that can be set within a Synapse Pipeline.
There's always a way to test something
Notebooks provide an agile, collaborative and productive way to develop data transformation logic and ETL processes, meaning the time from exploration and experimentation to productionised processes is significantly reduced.
But whilst there's no "obvious" way to add quality assurance to notebook-based development, this post has shown that it's actually relatively simple.