Import and export notebooks in Databricks
Sometimes we need to import and export notebooks from a Databricks workspace. This might be because you have a bunch of generic notebooks that can be useful across numerous workspaces, or it could be that you're having to delete your current workspace for some reason and therefore need to transfer content over to a new workspace.
Certain aspects can be done relatively easily, manually. You can export workspace directories by hovering over the drop down in the workspace view in the UI. This is at any level - at the root or in child directories (provided you have access to the directory in question).
You can export files and directories as .dbc
files (Databricks archive). If you swap the .dbc
extension to .zip
, within the archive you'll see the directory structure you see within the Databricks UI. Exporting the root of a Databricks workspace downloads a file called Databricks.dbc
.
You can also import .dbc
files in the UI, in the same manner. This is fine for importing the odd file (which doesn't already exist). However, through the UI there is no way to overwrite files/directories; if you try to import a file/directory that already exists, a copy of that artifact will be created.
An alternative solution is to use the Databricks CLI. The CLI offers two subcommands to the databricks workspace utility, called export_dir
and import_dir
. These recursively export/import a directory and its files from/to a Databricks workspace, and, importantly, include an option to overwrite artifacts that already exist. Individual files will be exported as their source format.
How it works
First of all, if you don't have the Databricks CLI installed locally, run pip install databricks-cli
.
Next, we need to authenticate to the Databricks CLI. The easiest way to do this is to set the session's environment variables DATABRICKS_HOST
and DATABRICKS_TOKEN
. Otherwise, you will need to run databricks configure --token
and insert your values for the host and token when you are prompted. The value for the host is the databricks url of the region in which your workspace lives (for me, that's https://uksouth.azuredatabricks.net
). If you don't know where to get an access token, see this link.
Now authentication is out of the way, we can address the subject of this blog.
Export
The general template is:
databricks workspace export_dir "<databricks-source-path>" "<local-path-to-export-to>"
To export the workspace root to the temp folder on your C drive, this would be:
databricks workspace export_dir "/" "C:/Temp/"
If you try to export any files that already exist in your local directory, the CLI will skip those files. You can tell the command to overwrite the local files by passing -o
to the command.
databricks workspace export_dir "/" "C:/Temp/" -o
Import
The general template is:
databricks workspace import_dir "<local-path-where-exports-live>" "<databricks-target-path"
For example, if my directories live within (C:/Temp/DatabricksExport/) on my machine, and I want to import them into the root of a Databricks workspace, this is the command:
databricks workspace import_dir "C:/Temp/DatabricksExport" "/"
However, if you're importing any files that already exist, you'll get an error. Get around round this error by, again, adding -o
to the command.
databricks workspace import_dir "C:/Temp/DatabricksExport" "/" -o
In an ideal world
A Databricks notebook can by synced to an ADO/Github/Bitbucket repo. However, I don't believe there's currently a way to clone a repo containing a directory of notebooks into a Databricks workspace. It'd be great if Databricks supported this natively. However, using the CLI commands I've shown above, there are certainly ways around this - but we'll leave that as content for another blog!