Transcript
Hello. My name is James Broome, I'm the Director of Engineering at a UK based technology consultancy called endjin. We're Microsoft Gold Partners for Cloud, for Data Platform, for Data & Analytics, for DevOps and for Power BI. So we do lots of work for our customers around data and analytics and in Azure, and that's meant over the past couple of years doing lots of work in Azure Synapse.
Today I'm going to talk for the next 10 minutes or so around two things very close to my heart:
One is Synapse notebooks - but, everything I'm going to talk about is applicable to other notebook technologies, for example, Databricks or Jupiter notebooks, or other kind of notebook technologies you might've used.
And the second is testing - and I will explain why testing is important in relation to notebooks. As we go through the talk you can see from the big clock behind me, this is a pre-recorded talk. The reason I wanted to pre-record it is because I'm going to be doing things with Spark, and Spark isn't instant in terms of running queries. I don't want to waste 9 of those 10 minutes waiting for Spark sessions to load and queries to run. So there might be points where I skip or fast forward the video a little bit, to save some time, but I'm going to try and keep it as real time as possible as I can.
I'm going to switch straight over to Synapse Studio and get stuck straight in.
An example use case
So let's imagine that we're working for an organization that sells things. In the Data Lake we've got an extract from our source system called Orders.csv. If we look at that, it's a fairly simple schema. We've got an order ID, a date, region, city category, product, quantity, price, and total price. So basically "what did we sell", "how much of it", and "where did we sell it"?. And let's imagine that the business has said: okay, our source system doesn't give us any reporting requirements, and we really want to start calculating things like "how much do we sell per region?" Let's start at that as an initial use case.
So we think, okay, a good place to start would be to use notebooks, to start exploring the data and lok at the shape of it. Maybe do some data prep and cleansing, and then start to calculate the reporting metrics that the business wants. So our notebook looks something like this.
Using notebooks for ETL
We're going to load the data in from the lake pointing to a CSV into a data frame. We discovered that there's some duplicated data from the source system. So we're going to drop duplicates, at which point we can then calculate our sales by region by selecting the region and total price columns, group by region selling total price and doing a quick column rename, writing the results back out to the data lake.
And we run this notebook. You'll see it executes fairly quickly, because a session has already started up. And when this is finished running, we're actually going to see the output in the final cell here. And if we go back to the data lake, we'll see the output written back to the data lake, as well as - here we go East and West region, and total sales.
If we go back into here, go into output, you'll see we have a CSV file that looks a bit like it. There we go. So a very simple use case. And obviously in real life, this would probably be a lot more complicated. The point is we've defined an ETL process in a notebook.
Agility comes with a risk
One of the most powerful things I think about Synapse notebooks is that it's very easy to go from this ad-hoc experimental exploratory kind of process, that notebooks are really powerful for, to be able to go to "Add to pipeline" to a new pipeline. And suddenly our notebook can now be a repeatable automated process that the business can rely on. This could run overnight if we add a trigger. It could run whenever new data lands in the lake. And suddenly we've gone from exploratory and ad hoc to line of business.
And the business now can depend on these insights because they're running all the time. And that's amazing, but it does present some challenges. And that challenge is we've very quickly gone from exploring an idea to producing something that the business is now going use to make decisions on. So we want to bake in quality. How do we know that the results have just produced are correct?
For example, how do we know that we've dropped the duplicates. How do we know that's working correctly, that we haven't taken too much data out or that we've targeted the ones we do consider duplicates? How do we know that we've aggregated the data correctly to create our sales by region?
So really what we're saying is the flexibility and power of notebooks is brilliant, but the downside is often that it builds the habit to start to feel like it's justified to ignore a lot of very sensible engineering practices. And what I'm gonna show you is how you can actually, just with very few kind of mindset changes, a bit of restructuring, a bit of refactoring, start to introduce testing to the notebook process so you get the best of both worlds. You get a notebook to get going with really quickly, and you can start to add tests in to build in regression quality gates, if you like. So we don't lose quality over time, and also make sure that the logic we add into that notebook is correct.
Restructuring notebooks for testing
So let's move away from this notebook and into our second one. And the first thing we're going to do is restructure the logic we had in the notebook into a series of functions. Typically a notebook is going to be used for some kind of data prep, probably some kind of ETL process. And that's quite nice because you've generally got a series of steps anyway. So if we break that ETL down, we did something that loaded data, we removed our duplicates, we calculated our sales by region and we saved the output. So each of those steps, I've put into a separate cell. I've been explicit that this is a piece of functionality, a unit of of, code that we want to test.
I'm being explicit about what that thing is going to do by defining as a function and calling it something that describes the intent of that piece of logic - load_data
. remove_duplicate_orders
takes in a dataframe, drops duplicates, and then sends it back out again. calculate_sales_by_region
takes a data frame, does what it needs to do and send it back out again.
Now if we were to run this notebook once we've defined all the cells as functions, something interesting would happened - we'd actually not get any results. And the reason for that is when the notebook runs, all it is doing is defining the functions. In order to actually execute the functions the same way as we did previously, we need to add another cell at the bottom. It's actually going to act as a kind of orchestration cell. You need something that's going to control the flow of execution that's going call all those functions we've just defined to produce the same results.
If we were to run this it would load the data and move the duplicate orders, calculate the sales by region, save the output, and we'd end up with the same results. And that's brilliant because now we've got the same behaviour, with a simple change, but we've structured the notebook in a way that we can now test it.
Adding tests to the notebook
Okay, so in this version of the notebook, we've started to add some tests and if you've done any work with Python before, this is a PySpark-backed notebook, then it's just like writing Python unit tests. PyTest, for example, is one of the imported packages in a Synapse workspace, so you can work with it straight away by importing it into your notebook.
So what actually are we testing? We've got a function called remove_duplicate_orders
so that'd be a good place to start. Let's make sure that it behaves as we expect. So we can add another cell, again we're defining this as a function that says we want to test that the orders with the duplicated order ID removed.
We're being explicit about what we're testing. We can manually create Spark data frames inline to stimulate our test data scenarios. In this case, two rows of data. The order IDs are the same, and the fields are slightly different. When we run our remove_duplicate_orders
function over the data frame we want to make sure that one of them is removed because there are there are two of the same. And we expect one to be returned, and if it does not work like that, then we're going to have a custom message explaining what's going on. So it's a simple Arrange - Act - Assert unit test pattern using built-in Python testing frameworks on the Synapse workspace, and using the ability to create spark data frames in-memory in order to test it.
So that a simple unit test, but you tend to find in testing, as you test one scenario, you immediately start to think of something else. So if you want to make sure that duplicated order IDs are removed, what about orders that look exactly the same, but have different order IDs? So here we go. This leads me to build a different scenario, which is similar orders with a different order ID are not removed. So we've got three rows of data here. All the fields are the same. The order IDs are different. Based on our current logic we expect all of these rows to be returned. So we run this test and we expect three orders to come back. And if they don't, then something's gone wrong and we want to throw an error accordingly.
The final unit test I've got in this notebook is around the regional sales calculations. Again, we can simulate some more test data, we've got three sales in Eastern region, one in the West. And if we call our calculate_sales_by_region
, we expect what our figures are going to be based on a manual calculation we can do, based on this subset of data, and check this as we expect. And if not, we're going to throw an error.
Running the tests in the ETL process
So we defined all these tests as functions like in the previous notebook, which means we need to orchestrate running them. So down the bottom here, we've got our ETL process. But before we do that, we're going to call all our tests. And if we run all that, you'll see that all the tests are currently passing because our logic is correct, but equally, I can show you what happens when a test fails.
Okay, so the notebook has run and we've got our output as usual and there's no errors. We go up to our function that defines remove_duplicate_orders
and we change the logic here. Let's comment this out and say that it's not actually based on just the order ID, but actually based on distinct()
- so every field has to be the same for us to consider it a duplicate.
If we run that again now, our logic has changed, which means our unit tests are going to fail because the behavior behind that piece of business logic is different. So when we run and we get down to the final orchestration step, we've got an error. So in our test_orders_that_have_duplicate_order_ids_are_removed
step, we expected one order, but two are returned, and that's because all the fields aren't the same so we didn't consider them duplicates, and therefore the test that we expected to pass has now failed.
So we can easily test our notebooks by restructuring them accordingly, wrapping up business logic and functions and adding unit test code into the notebook to execute that logic and prove it behaves as we expect.
Refactoring for maintainability
But as you can probably imagine this might get quite unwieldly quite quickly. This is a very simple use case in this ETL process, but already our notebook's getting quite long. We've got test logic mixed interspersed with our actual ETL logic. We've got a big orchestration thing down here that could get get longer and longer. So how do we manage this going forward? How do we restructure things a bit more to make this simpler to work with, as things get more complicated?
I'm going to show you two ways that we can do this. Both involve splitting out the test logic from the core business process logic, and the first example, we're going to rely on something that's built into notebooks, a little, what they call a "magic command".
The %run
command allows us to reference one notebook from another, and by doing that, it's going to basically run the existing notebook and pull it into memory. So all the variables that are defined in the core notebook are available in the calling notebook. So with that in mind, we can split out our notebooks into a test notebook and a main processing notebook. In the test notebook, we are going to call into our processing notebook. And another thing we can do is take advantage of the parameters that are available in the notebooks to pass in the fact that that we want to have what I've called test_mode
.
If we look at the processing notebook, we've set a variable called test_mode
, and set it to false
. We've marked this cell as a parameter cell, which means this is one of the parameters available to set. If we were to run the processing notebook on its own, test_mode
is false
. It's going to run all the way through as before. And all I've done is say, if not test_mode
, then run the workflow. So if we ran this notebook, it would run the end to end detailed process. Because we're not running the tests, it's going to run the orchestration cell and everything's going to happen and our output is going to get written to the data lake.
If we run the test notebook, it's slightly more interesting because when we run this notebook all those function definitions that we defined in that notebook and now available in the test notebook. So all our tests can run as normal because we've passed in test_mode
, and it's not actually going to run the orchestration. So at the end, no data is going to get written into the data lake. We're not actually going to do the processing. We're just going to define the functions, which means at the end, our tests have run successfully and nothing has been written.
So that's a really easy way to split out test logic from your processing and use the parameter cells and the magic command %run
to link the two together.
Refactoring further - modular notebooks
And the second and final approach I'm going to show you, which is probably a better solution long-term if you get more and more notebooks and more logic in those notebooks is to actually split things out even further.
In this case, I've now got three notebooks. I've defined a Sales Data Functions notebook, which literally just contains the definitions of functions that we're going to use in both our processing and our tests. So running this notebook, there's no orchestration cell, running this notebook we just define the functions and make them available for use. As you can probably guess, the notebook that actually runs the orchestration now becomes a lot simpler. It also uses the magic command to run the functions notebooks, to define the functions.
And here's our orchestration cell, which ties the whole thing together. We load the data and remove the duplicate orders, calculate the sales by region and save the output back to the data lake. Equally, our test notebook does a similar thing and no longer needs to pass in the test mode parameters anymore. We just define the functions by calling the magic command %run
. And then we we run our unit tests.
What's great is that now we've got two entry points. We've got an entry point into the ETL process. We've got an entry point into the tests. So now we're going back to thinking about adding this into a pipeline, we can start to create a pipeline that runs our tests, and then runs our ETL process using all the same logic and the same functions that are defined in the central functions notebook.
Adding tests to the ETL pipeline
I'm going to show you what it looks like. So here we've got a pipeline that has two steps that are both notebooks steps. The first one calls into the tests notebook, so the tests are run first. And obviously if the tests succeed, it's going to run the second notebook which calls into the Process Sales Data main ETL orchestration to actually run the process.
Now what's interesting is obviously if the test pass, the processing is going to happen, but if the tests fail, the second activity isn't going to run. And I want to show you that as well. So in order to do that, let's go back to our functions. And let's change that logic again, so instead of dropping the duplicates by order ID, we're going to make sure that we fail the process because we're only returning distinct orders of which the test doesn't account for.
So when we run this pipeline, our test step is going to fail, which means the processing won't occur. And we should see that test error message in the pipeline error as well so we can see what's gone wrong.
So our pipeline has finished running and as we expected, it's failed because our unit test has failed. If we dig into the error message - our unit test errors messages are available in here - we can see why it's failed. Someone's changed our business logic around how our duplicate orders are removed, and we can go and dig into that and and sort it out. But the important thing is our test acted as a quality gate in our end to end pipeline, meaning that we ensure the quality of the results that are going to get published to the business.
Wrapping up
So that's end of the session. I know it's been fairly fast paced. Hopefully it's made sense and hopefully I've proved it's actually easier than you think to structure your notebooks for testing.
The benefit of notebooks is the agility and the speed of getting things working very quickly, and then turning things into pipelines, and running them to get value out of that process as quick as possible. But don't forget about good solid engineering practices - it is quite straightforward with a slight shift in mindset, with a bit of restructuring, that you can actually bake in quality gates and unit testing into these processes, to make sure that the results that you're providing to the business, the decisions that you're making on can be trusted and therefore useful.
Thanks very much for listening.