Setting up cloud processing with Azure Batch

Background

Azure Batch is one of many Azure services that can be used to run analyses in the cloud. It uses a hierarchy of Pools which contain Jobs which contain Tasks. A Pool is a group of Nodes (similar to virtual machines) which are the computers that will run the analyses. When a pool is created you choose the number of nodes and configure the operating system, RAM and number of cores that each node will have. Jobs are used for scheduling different related tasks, in simple cases they are just a folder that contains your tasks. Tasks contain the code that you want the node to run and information on where to get data from and save it to.

There are several options available for interacting with Azure Batch including the Azure web portal and the Azure command line interface (CLI). This tutorial will focus on the Azure CLI method since it is the simplest and fastest method. In both cases you need to be on the ECCC network to access the cloud, either in the office or on VPN. Azure CLI can be used on any terminal in Windows but the code below is written assuming you are using a Bash Terminal and will not run exactly as written in other terminals.

Get Access

To be added to the LERS Azure Batch account email Sarah Endicott, she will email the CCOE (Cloud Centre of Expertise) to ask them to give your account the required permissions. I recommend allowing a week for this process.

Install Azure CLI

Azure CLI

Install using install wizard https://learn.microsoft.com/en-us/cli/azure/install-azure-cli-windows?tabs=azure-cli#install-or-update If you don’t have elevated privileges you will need to call service desk to have them complete the installation for you.

Use Azure CLI to run an analysis

Run az login in the terminal. This will open a web browser for you to login.
Run az batch account login -g EcDc-WLS-rg -n ecdcwlsbatch to connect to the “EcDc-WLS-rg” resource group and “ecdcwlsbatch” service name.
Run az batch pool list to show any existing pools. Note: if you get an error at this point saying “This request is not authorized to perform this operation.” you are likely not signed on to the VPN, sign on and try again.

Create a pool

You can create a pool from the CLI either by providing the configuration as individual arguments or by supplying a JSON file. I use the JSON file option because then it is easy to modify previous versions and not all functionality is available as an argument.

A template JSON file is provided in this repository under cloud_config/template_pool_cli.json. Some parts of the JSON file should stay the same in most cases while others will need to be adjusted based on the type of pool you want to build.

{
    "type": "Microsoft.Batch/batchAccounts/pools",
    "apiVersion": "2016-12-01",
    "id": "test_pool_json_cli",
    "vmSize": "standard_DS2_v2",
    "virtualMachineConfiguration": {
        "imageReference": {
            "publisher": null,
            "offer": null,
            "sku": null,
            "version": null,
            "virtualMachineImageId": "/subscriptions/b215566c-fe84-4c8f-a24c-99de0b444b13/resourceGroups/EcPc-SharedImageGallery-rg/providers/Microsoft.Compute/galleries/EcPcSharedImageGallery/images/Ubuntu2404WithDocker/versions/1.0.0",
            "exactVersion": null
        },
        "nodeAgentSKUId": "batch.node.ubuntu 24.04",
        "licenseType": null,
        "containerConfiguration": {
            "type": "dockerCompatible",
            "containerImageNames": [
                "rocker/r-bspm:jammy"
            ]
        },
        "nodePlacementConfiguration": {
            "policy": "regional"
        }
    },
        "targetDedicatedNodes": 1,
    "networkConfiguration": {
        "subnetId":"/subscriptions/8bc16bb2-5633-480b-8f4c-3ba6a6c769b4/resourceGroups/EcDc-WLS-rg/providers/Microsoft.Network/virtualNetworks/EcDc-WLS-vnet/subnets/EcDc-WLS_compute-cluster-snet"
        }
}

Fields to edit:

id: Give the pool a name. It should start with the first letter of your first name and your last name so we can differentiate them to track billing eg “sendicott_test_pool”
vmSize: This is the family of VM and the number of CPUs, different families will have different # CPU to RAM ratios. We have access to a subset of VM families with a quota for each one. See this Microsoft website for some details on the different families. The easiest way to filter the options and choose a size is to go to the Azure portal > Batch > Pools page and click “Add” and then scroll to vmSize and follow the link for “View full pricing details”. In this example we use “Standard_DS2_v2”. All the names start with “Standard_” and then you substitute the sku which contains the Family, # CPUs and version.
targetDedicatedNodes: The number of nodes to have in the pool. Typically 1 if you are going to run one task that might use multiple cores on one machine for parallelization or one for each task you want to run on a separate machine.
containerImageNames: The name of the docker image that you want to use. This will pre-fetch this docker image for use with your tasks. You can enter the name of any DockerHub image. For R users see the rocker project for some good options. The one used in the example file is “rocker/r-bspm:jammy” which is based on ubuntu jammy and has R installed and set up to use the bspm package so that R packages can be installed on linux from binary using install.packages as usual.

Optional customization fields:

imageReference: I have provided two pool template JSON files, one for windows and one for linux so you can just use one of those unless you have another reason to customize. This is basically the operating system that you want to use. You can choose from windows or linux options, but if you want to use docker you need to choose one that is docker compatible. To do this you can run az batch pool supported-images list --filter "osType eq 'linux'"|tee test_linux.txt this lists the supported images for the linux osType, change to ‘windows’ to see windows options, and outputs the result to a text file. I then searched the text file for “dockerCompatible”. Azure is no longer making dockerCompatible images so current image used in the template was created by CCOE and is provided in a Shared Image Gallery. It is just Ubuntu with Docker installed.

Once you are happy with the JSON file you can create your pool by running az batch pool create --json-file cloud_config/template_pool_cli.json

Run az batch pool list to see the status of the pool.

Create a job

az batch job create --pool-id test_pool_json_cli --id "test_job"

Create a task

Similar to creating a pool we create a task by referring to a JSON configuration file and the job id where the task should be assigned.

The JSON file will depend on the task to be performed. “template_task_hello.json” reflects the simplest case where there are no input or output files and it runs an R function directly from the command line.

{
  "id": "sampleTask",
  "commandLine": "R -e 'mean(mtcars$mpg)'",
    "containerSettings": {
    "imageName": "rocker/r-bspm:jammy",
    "containerRunOptions": "--rm"
    },
  "userIdentity": {
      "autoUser": {
          "scope": "task",
          "elevationLevel": "nonadmin"
      }
  }
}

JSON file fields:

id: name of the task
commandLine: command line code to run. In this case because R is already installed on the docker container this just works.
imageName: The DockerHub image to use
containerRunOptions: options that would normally be supplied with the docker run command
userIdentity: The identity of the user running the task, scope can be pool or task and elevationLevel can be admin or nonadmin.

The task will start to execute as soon as we create the task.

az batch task create --json-file cloud_config/template_task_hello.json --job-id test_job

We can check the status of the task and filter the output with a query to only get the information we need.

az batch task show --job-id test_job --task-id sampleTask

az batch task show --job-id test_job --task-id sampleTask --query "{state: state, executionInfo: executionInfo}" --output jsonc

The “state” should be active, running, or completed (active is the state when the task is starting up) and executionInfo will tell you the result (success or failure) and start and end time.

To find out more about what is going on in you task you can download the stdout or stderr files. stdout.txt will show the console output on the machine and stderr.txt will show any error messages.

az batch task file download --task-id sampleTask --job-id test_job --file-path "stdout.txt" --destination "./run_out.txt"

Clean up

Always remember to delete your pools and jobs when they are complete. We will be charged for as long as a pool is up regardless of whether anything is actually running on it. Tasks are deleted when the pool they are part of is deleted but jobs need to be deleted separately.

To delete the pool and job we created and confirm they were deleted run the following:

az batch job delete --job-id test_job
az batch pool delete --pool-id test_pool_json_cli

az batch job list --query "[].{id:id, state:state}"
az batch pool list --query "[].{id:id, state:state}"

You may need to wait awhile for the pool to finish being deleted.

Access data in your task

One thing to keep in mind is that your task has access to the internet so any method that you could normally use to access data and/or code over the internet should work in a script as long as you can set it up to work non-interactively. You might consider storing data on OSF or google drive or GitHub so that anyone can run your script and have the data load automatically. Another option that I will demonstrate here is to load your data onto an Azure Storage Container and connect it to your task. This uses the azcopy tool that we installed at the beginning in the background.

To run this example clone this GitHub repo by running usethis::create_from_github("LandSciTech/cloudDemo") in R. Then you will have the analyses folder which contains two folders:

scripts:
- script_read_csv.R A super simple R script that reads a csv subsets the table and saves a new csv
- run_script.sh A bash script that runs the R script
data:
- raw-data:
  - fruits.csv data used by the script
- derived-data:
  - just has a README for now but is where results are stored.

Upload files to container

If you don’t have an existing container under your name in the Azure storage account create one using your first initial and last name.

az storage container create --name jdoe --account-name ecdcwls --auth-mode login

Next, get a Shared Access Token for the container you want to access by running the below replacing “sendicott” with the name of your own storage container. This will save it as a variable sastoken, the token will expire after one day, in most real use cases I would change the expiry to as long as you expect the analysis to run. The sasurl is the full url for the container with the SAS token included and can be used as a single argument in some cases.

end=`date -u -d "1 day" '+%Y-%m-%dT%H:%MZ'`

sastoken=`az storage container generate-sas --account-name ecdcwls --expiry $end --name sendicott --permissions racwdli -o tsv --auth-mode login --as-user`

sasurl=https://ecdcwls.blob.core.windows.net/sendicott/?$sastoken

Then copy all files from a local directory to the container url created above and list the filenames in the container.

az storage copy -d $sasurl -s analyses --recursive

az storage blob list -c sendicott --account-name ecdcwls --sas-token $sastoken --query "[].{name:name}"

Connect container to task

To access files in the container from the task we can modify the simple JSON above to add some additional options.

The “outputFiles” option lets you link the container as the place where output files will be stored. Use “path” to set the location within the container where you want the file to be saved. Under filePattern select the file to save, either with a path to a specific file or a pattern to match multiple files. Note that the file upload only occurs when the task completes so if the task ends without completing no files will be saved.

Below I have defined two “outputFiles” one to save the results created in derived-data and one to store the stdout.txt and stderr.txt files that show the console output and error output. The stdout files are located above the working directory of the task which is why we need “../”

Adding the input files is much simpler, in this case you just need to supply the name of the container to the autoStorageContainerName option. All the files in the container will be copied into the working directory of the task.

In addition, I have changed the commandLine to run the bash script stored in the scripts folder. The bash script is not necessary in this simple example but is useful if you want to run multiple scripts or do anything else from the commandLine before running a script.

{
  "id": "test_outFile",
  "commandLine": "sh analyses/scripts/run_script.sh",
  "resourceFiles": [{
        "autoStorageContainerName": "sendicott"
    }
    ],
  "outputFiles": [
    {
    "destination": {
            "container": {
                "containerUrl": "https://ecdcwls.blob.core.windows.net/sendicott/?<sastoken>",
                "path": "logs"
            }
        },
        "filePattern": "../std*.txt",
        "uploadOptions": {
            "uploadCondition": "taskcompletion"
        }
    },
    {
        "destination": {
            "container": {
                "containerUrl": "https://ecdcwls.blob.core.windows.net/sendicott/?<sastoken>",
                "path": "analyses/data/derived-data/output_price2.csv"
            }
        },
        "filePattern": "analyses/data/derived-data/output_price2.csv",
        "uploadOptions": {
            "uploadCondition": "taskcompletion"
        }
    }
    ],
    "containerSettings": {
        "imageName": "rocker/r-ubuntu:jammy",
        "containerRunOptions": "--rm"
    },
    "userIdentity": {
        "autoUser": {
            "scope": "task",
            "elevationLevel": "admin"
        }
    }
}

Next we need to replace the <sastoken> placeholder in the template with the value of the sastoken variable we created above and save the resulting file. We do this to avoid saving the sastoken in the template script so it won’t accidentally get shared with others. We can delete the script with the file once we launch the task since it is easy to recreate it when needed.

sed 's/<sastoken>/'${sastoken//&/\\&}'/g' cloud_config/template_task_with_file_in_out.json > cloud_config/task_to_use.json

Follow the same process as above to create a pool, job and task, this time with our new JSON file. Then delete the modified JSON file that contains the sastoken

az batch pool create --json-file cloud_config/template_pool_cli.json
az batch job create --pool-id test_pool_json_cli --id "test_job"
az batch task create --json-file cloud_config/task_to_use.json --job-id test_job

rm cloud_config/task_to_use.json

Check the task status and confirm that files have been uploaded to the container

az batch task show --job-id test_job --task-id test_outFile --query "{state: state, executionInfo: executionInfo}" --output jsonc

az storage blob list -c sendicott --account-name ecdcwls --sas-token $sastoken --query "[].{name:name}"

Download the files that you want to save locally.

az storage copy -s https://ecdcwls.blob.core.windows.net/sendicott/analyses/data/derived-data?$sastoken -d analyses/data --recursive

Delete the analyses folder from your container to remove all the files. This doesn’t have to happen right away but I try not to leave things here since we have other long term storage solutions and I don’t want to clutter the container.

az storage remove -c sendicott --account-name ecdcwls --sas-token $sastoken -n analyses --recursive

To delete all files in the container don’t supply a name argument

az storage remove -c sendicott --account-name ecdcwls --sas-token $sastoken  --recursive

Delete the job and pool so that we are no longer charged. Note after you do this the local copy on your machine is the only copy of the files from the analysis.

az batch job delete --job-id test_job
az batch pool delete --pool-id test_pool_json_cli

az batch job list --query "[].{id:id, state:state}"
az batch pool list --query "[].{id:id, state:state}"

Run tasks in parallel

To truly take advantage of the cloud we may want to run multiple tasks at once across multiple cores on one node or multiple nodes in a pool. See vignette("Parallelization", package = "cloudDemo") for examples of how to do this and use autoscaling.

Install AzCopy if needed

AzCopy is another command line tool that is used by the az storage commands above. It should automatically be installed the first time you run one of them but if not here are the instructions for downloading it.

Download azcopy https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10
Unzip into an easy to access directory ie C:\ or C:\Users\username you will need to navigate the command line to this directory to run commands unless you add azcopy to your PATH.