Background
Azure Batch is one of many Azure services that can be used to run analyses in the cloud. It uses a hierarchy of Pools which contain Jobs which contain Tasks. A Pool is a group of Nodes (similar to virtual machines) which are the computers that will run the analyses. When a pool is created you choose the number of nodes and configure the operating system, RAM and number of cores that each node will have. Jobs are used for scheduling different related tasks, in simple cases they are just a folder that contains your tasks. Tasks contain the code that you want the node to run and information on where to get data from and save it to.
There are several options available for interacting with Azure Batch including the Azure web portal and the Azure command line interface (CLI). This tutorial will focus on the Azure CLI method since it is the simplest and fastest method. In both cases you need to be on the ECCC network to access the cloud, either in the office or on VPN. Azure CLI can be used on any terminal in Windows but the code below is written assuming you are using a Bash Terminal and will not run exactly as written in other terminals.
Get Access
To be added to the LERS Azure Batch account email Sarah Endicott, she will email the CCOE (Cloud Centre of Expertise) to ask them to give your account the required permissions. I recommend allowing a week for this process.
Install Azure CLI
Azure CLI
- Install using install wizard https://learn.microsoft.com/en-us/cli/azure/install-azure-cli-windows?tabs=azure-cli#install-or-update If you don’t have elevated privileges you will need to call service desk to have them complete the installation for you.
Use Azure CLI to run an analysis
Sign in
- Run
az login
in the terminal. This will open a web browser for you to login. - Run
az batch account login -g EcDc-WLS-rg -n ecdcwlsbatch
to connect to the “EcDc-WLS-rg” resource group and “ecdcwlsbatch” service name. - Run
az batch pool list
to show any existing pools. Note: if you get an error at this point saying “This request is not authorized to perform this operation.” you are likely not signed on to the VPN, sign on and try again.
Create a pool
You can create a pool from the CLI either by providing the configuration as individual arguments or by supplying a JSON file. I use the JSON file option because then it is easy to modify previous versions and not all functionality is available as an argument.
A template JSON file is provided in this repository under cloud_config/template_pool_cli.json. Some parts of the JSON file should stay the same in most cases while others will need to be adjusted based on the type of pool you want to build.
{
"type": "Microsoft.Batch/batchAccounts/pools",
"apiVersion": "2016-12-01",
"id": "test_pool_json_cli",
"vmSize": "standard_A1_v2",
"virtualMachineConfiguration": {
"imageReference": {
"publisher": "microsoft-azure-batch",
"offer": "ubuntu-server-container",
"sku": "20-04-lts",
"version": "latest"
},
"nodeAgentSKUId": "batch.node.ubuntu 20.04",
"containerConfiguration": {
"type": "dockerCompatible",
"containerImageNames": [
"rocker/r-bspm:jammy"
]
},
"nodePlacementConfiguration": {
"policy": "regional"
}
},
"targetDedicatedNodes": 1,
"networkConfiguration": {
"subnetId":"/subscriptions/8bc16bb2-5633-480b-8f4c-3ba6a6c769b4/resourceGroups/EcDc-WLS-rg/providers/Microsoft.Network/virtualNetworks/EcDc-WLS-vnet/subnets/EcDc-WLS_compute-cluster-snet"
}
}
Fields to edit:
- id: Give the pool a name. It should start with the first letter of your first name and your last name so we can differentiate them to track billing eg “sendicott_test_pool”
- vmSize: This is the family of VM and the number of CPUs different families will have different # CPU to RAM ratios. We have access to a subset of VM families with a quota for each one. See this Microsoft website for some details on the different families. The easiest way to filter the options and choose a size is to go to the Azure portal > Batch > Pools page and click “Add” and then scroll to vmSize and follow the link for “View full pricing details”. In this example we use “Standard_A1_v2”. All the names start with “Standard_” and then you substitute the sku which contains the Family, # CPUs and version.
- targetDedicatedNodes: The number of nodes to have in the pool. Typically 1 if you are going to run one task that might use multiple cores on one machine for parallelization or one for each task you want to run on a separate machine.
- containerImageNames: The name of the docker image that you want to
use. This will pre-fetch this docker image for use with your tasks. You
can enter the name of any DockerHub image. For R users see the
rocker
project for some good options. The one used in the example file is
“rocker/r-bspm:jammy” which has is based on ubuntu jammy and has R
installed and set up to use the bspm package so that R packages can be
installed on linux from binary using
install.packages
as usual.
Optional customization fields:
- imageReference: I have provided two pool template JSON files, one
for windows and one for linux so you can just use one of those unless
you have another reason to customize. (TODO: windows version did not
work with example) This is basically the operating system that you want
to use. You can choose from windows or linux options, but if you want to
use docker you need to choose one that is docker compatible. To do this
you can run
az batch pool supported-images list --filter "osType eq 'linux'"|tee test_linux.txt
this lists the supported images for the linux osType, change to ‘windows’ to see windows options, and outputs the result to a text file. I then searched the text file for “dockerCompatible”
Once you are happy with the JSON file you can create your pool by
running
az batch pool create --json-file cloud_config/template_pool_cli.json
Run az batch pool list
to see the status of the
pool.
Create a task
Similar to creating a pool we create a task by referring to a JSON configuration file and the job id where the task should be assigned.
The JSON file will depend on the task to be performed. “template_task_hello.json” reflects the simplest case where there are no input or output files and it runs an R function directly from the command line.
{
"id": "sampleTask",
"commandLine": "R -e 'mean(mtcars$mpg)'",
"containerSettings": {
"imageName": "rocker/r-bspm:jammy",
"containerRunOptions": "--rm"
},
"userIdentity": {
"autoUser": {
"scope": "task",
"elevationLevel": "nonadmin"
}
}
}
JSON file fields:
- id: name of the task
- commandLine: command line code to run. In this case because R is already installed on the docker container this just works.
- imageName: The DockerHub image to use
- containerRunOptions: options that would normally be supplied with
the
docker run
command - userIdentity: The identity of the user running the task, scope can be pool or task and elevationLevel can be admin or nonadmin.
The task will start to execute as soon as we create the task.
az batch task create --json-file cloud_config/template_task_hello.json --job-id test_job
We can check the status of the task and filter the output with a query to only get the information we need.
az batch task show --job-id test_job --task-id sampleTask
az batch task show --job-id test_job --task-id sampleTask --query "{state: state, executionInfo: executionInfo}" --output jsonc
The “state” should be active, running, or completed (active is the state when the task is starting up) and executionInfo will tell you the result (success or failure) and start and end time.
To find out more about what is going on in you task you can download the stdout or stderr files. stdout.txt will show the console output on the machine and stderr.txt will show any error messages.
az batch task file download --task-id sampleTask --job-id test_job --file-path "stdout.txt" --destination "./run_out.txt"
Clean up
Always remember to delete your pools and jobs when they are complete. We will be charged for as long as a pool is up regardless of whether anything is actually running on it. Tasks are deleted when the pool they are part of is deleted but jobs need to be deleted separately.
To delete the pool and job we created and confirm they were deleted run the following:
az batch job delete --job-id test_job
az batch pool delete --pool-id test_pool_json_cli
az batch job list --query "[].{id:id, state:state}"
az batch pool list --query "[].{id:id, state:state}"
You may need to wait awhile for the pool to finish being deleted.
Access data in your task
One thing to keep in mind is that your task has access to the internet so any method that you could normally use to access data and/or code over the internet should work in a script as long as you can set it up to work non-interactively. You might consider storing data on OSF or google drive or GitHub so that anyone can run your script and have the data load automatically. Another option that I will demonstrate here is to load your data onto an Azure Storage Container and connect it to your task. This uses the azcopy tool that we installed at the beginning in the background.
To run this example clone this GitHub repo by running
usethis::create_from_github("LandSciTech/cloudDemo")
in R.
Then you will have the analyses folder which contains two folders:
- scripts:
- script_read_csv.R A super simple R script that reads a csv subsets the table and saves a new csv
- run_script.sh A bash script that runs the R script
- data:
- raw-data:
- fruits.csv data used by the script
- derived-data:
- just has a README for now but is where results are stored.
- raw-data:
Upload files to container
If you don’t have an existing container under your name in the Azure storage account create one using your first initial and last name.
az storage container create --name jdoe --account-name ecdcwls --auth-mode login
Next, get a Shared Access Token for the container you want to access
by running the below replacing “sendicott” with the name of your own
storage container. This will save it as a variable
sastoken
, the token will expire after one day, in most real
use cases I would change the expiry to as long as you expect the
analysis to run. The sasurl
is the full url for the
container with the SAS token included and can be used as a single
argument in some cases.
end=`date -u -d "1 day" '+%Y-%m-%dT%H:%MZ'`
sastoken=`az storage container generate-sas --account-name ecdcwls --expiry $end --name sendicott --permissions racwdli -o tsv --auth-mode login --as-user`
sasurl=https://ecdcwls.blob.core.windows.net/sendicott/?$sastoken
Then copy all files from a local directory to the container url created above then list the filenames in the container.
az storage copy -d $sasurl -s analyses --recursive
az storage blob list -c sendicott --account-name ecdcwls --sas-token $sastoken --query "[].{name:name}"
Connect container to task
To access files in the container from the task we can modify the simple JSON above to add some additional options.
The “outputFiles” option lets you link the container as the place where output files will be stored. Use “path” to set the location within the container where you want the file to be saved. Under filePattern select the file to save, either with a path to a specific file or a pattern to match multiple files. Note that the file upload only occurs when the task completes so if the task ends without completing no files will be saved.
Below I have defined two “outputFiles” one to save the results created in derived-data and one to store the stdout.txt and stderr.txt files that show the console output and error output. The stdout files are located above the working directory of the task which is why we need “../”
Adding the input files is much simpler, in this case you just need to supply the name of the container to the autoStorageContainerName option. All the files in the container will be copied into the working directory of the task.
In addition, I have changed the commandLine to run the bash script stored in the scripts folder. The bash script is not necessary in this simple example but is useful if you want to run multiple scripts or do anything else from the commandLine before running a script.
{
"id": "test_outFile",
"commandLine": "sh analyses/scripts/run_script.sh",
"resourceFiles": [{
"autoStorageContainerName": "sendicott"
}
],
"outputFiles": [
{
"destination": {
"container": {
"containerUrl": "https://ecdcwls.blob.core.windows.net/sendicott/?<sastoken>",
"path": "logs"
}
},
"filePattern": "../std*.txt",
"uploadOptions": {
"uploadCondition": "taskcompletion"
}
},
{
"destination": {
"container": {
"containerUrl": "https://ecdcwls.blob.core.windows.net/sendicott/?<sastoken>",
"path": "analyses/data/derived-data/output_price2.csv"
}
},
"filePattern": "analyses/data/derived-data/output_price2.csv",
"uploadOptions": {
"uploadCondition": "taskcompletion"
}
}
],
"containerSettings": {
"imageName": "rocker/r-ubuntu:jammy",
"containerRunOptions": "--rm"
},
"userIdentity": {
"autoUser": {
"scope": "task",
"elevationLevel": "admin"
}
}
}
Next we need to replace the <sastoken>
placeholder
in the template with the value of the sastoken
variable we
created above and save the resulting file. We do this to avoid saving
the sastoken in the template script so it won’t accidentally get shared
with others. We can delete the script with the file once we launch the
task since it is easy to recreate it when needed.
sed 's/<sastoken>/'${sastoken//&/\\&}'/g' cloud_config/template_task_with_file_in_out.json > cloud_config/task_to_use.json
Follow the same process as above to create a pool, job and task, this time with our new JSON file. Then delete the modified JSON file that contains the sastoken
az batch pool create --json-file cloud_config/template_pool_cli.json
az batch job create --pool-id test_pool_json_cli --id "test_job"
az batch task create --json-file cloud_config/task_to_use.json --job-id test_job
rm cloud_config/task_to_use.json
Check the task status and confirm that files have been uploaded to the container
az batch task show --job-id test_job --task-id test_outFile --query "{state: state, executionInfo: executionInfo}" --output jsonc
az storage blob list -c sendicott --account-name ecdcwls --sas-token $sastoken --query "[].{name:name}"
Download the files that you want to save locally.
az storage copy -s https://ecdcwls.blob.core.windows.net/sendicott/analyses/data/derived-data?$sastoken -d analyses/data --recursive
Delete the analyses folder from your container to remove all the files. This doesn’t have to happen right away but I try not to leave things here since we have other long term storage solutions and I don’t want to clutter the container.
az storage remove -c sendicott --account-name ecdcwls --sas-token $sastoken -n analyses --recursive
To delete all files in the container don’t supply a name argument
az storage remove -c sendicott --account-name ecdcwls --sas-token $sastoken --recursive
Delete the job and pool so that we are no longer charged. Note after you do this the local copy on your machine is the only copy of the files from the analysis.
az batch job delete --job-id test_job
az batch pool delete --pool-id test_pool_json_cli
az batch job list --query "[].{id:id, state:state}"
az batch pool list --query "[].{id:id, state:state}"
Install AzCopy if needed
AzCopy is another command line tool that is used by the az storage commands above. It should automatically be installed the first time you run one of them but if not here are the instructions for downloading it.
- Download azcopy https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10
- Unzip into an easy to access directory ie C:\ or C:\Users\username you will need to navigate the command line to this directory to run commands unless you add azcopy to your PATH.