AML Command Transfer
AML Command Transfer (ACT) is a tool to easily execute any command in the Azure Machine Learning (AML) services. The blog will present the underlying design principle of the tool.
How AML works
First of all, let’s review the basic idea of how AML works.
Assume that the user would like to have 2 nodes in AML and execute python train.py
on each node once. The file of train.py
is quite simple as follows
# train.py
print('hello')
The user will send the following information to AML
- The file of
train.py
- The docker image, e.g. the docker image at docker hub.
- The requested resource, i.e. two nodes.
After receiving the user request, AML will do the following
- Find 2 available nodes.
- Pull the docker image to the nodes.
- Execute
mpirun
to start the job. The basic command line will bempirun --hostfile /job/hostfile -npernode 1 python train.py
- The value of
--hostfile
is a text file, which contains the IPs of the two nodes. Each IP takes one line. In certain cases, the file path is/job/hostfile
or~/mpi-hosts
. The file is prepared by AML and is not allowed to change. - The value of
-npernode
tells how many processes will be launched on each node. Two commonly-used cases are1
and the value equal to the number of GPUs. This value can be specified by the user. - Lastly, the command of
python train.py
is fed tompirun
, so that it knows what command to run.
- The value of
How to Access the Data in AML
Normally, we need data to execute the job. One way AML supports is the blobfuse. The story is as follows.
-
The user uploads the data to the Azure Storage. Any uploaded file can be accessed through a URL with an appropriate authentication header. Let’s say that the storage account is
account
, the storage container iscontainer
, and the file path isdata/a.txt
. The storage container is some concept of Azure Storage. All data within one container can have the same access level, e.g. public. Then, the URL will behttps://account.blob.core.windows.net/container/data/a.txt
. - During the job submission, an ID string can be assigned to the file path
through the AML’s Python SDK,
and will be used as a placeholder for the
script argument. For example, the file path is
blob_data/a.txt
under some storage account and the Python SDK generates a special string to refer to this file, say,ID_of_a.txt
. Then, the submitted command line ispython train.py --data_path ID_of_a.txt
- AML uses the blobfuse to mount the cloud storage as a local folder.
- AML replaces all the ID parameters with the mounted file path. In this case,
the command line will be replaced as
python train.py --data_path /mount_path/blob_data/a.txt
- AML launches
mpirun
as usual.
Design Principal of ACT
Motivation
Now, we have a basic idea of how AML works. With the Python SDK, we can submit
any script job, but with a dedicated script to specify
the Azure Storage information, script parameters, e.t.c.
To reduce the effort, we’d like to have a tool, which behaves like a bridge to
connect the user script and AML. Let’s say the name of the tool is named a
.
- If we’d like to run
train.py
without any parameter, the submission syntax isa submit python script.py
We expect to specify
python
rather than to assume that the interpreter is alwayspython
such that we can support other non-python scripts. In other words, we can always test the script locally bypython script.py
. If we’d like to execute it in AML, we just need to add the prefix of `a submit. - If we’d like to run
train.py
with some parameter of--data imagenet
, the submission syntax isa submit python script.py --data imagenet
- If we’d like to run
nvidia-smi
in AML, the submission syntax isa submit nvidia-smi
That is, we expect the tool of
a
to handle everything related to AML such that the parameters aftersubmit
can be seemingly transferred to AML, which is quite similar with remote procedure call. This is what ACT does!
Design of the Work Flow
The design is based on the client-server model. That is, we have a client script for job submission and a server script that executes any command the client requests.
The client
a init
to upload all the user source codes to the azure blob. The user source codes are referred to as the, e.g. the training codes, which implements the model structure, optimization logics, e.t.c.zip
is used to compress the current source folder to a temporary.zip
file, which is then uploaded to the Azure blob throughazcopy
. To customize the zipping process, we can populate the parameter ofzip_options
in the configuration to, e.g. ignore some folders or files.a submit command
to submit thecommand
to AML. In addition of thecommand
, the client script will also submit the following information.- The blob path of the zipped source code, so that the server knows where to access the source code.
- Azure Blob information, so that AML can mount the azure blob through
blobfuse
. - The data link information, including the local folder name and the
corresponding blob path, so that the server side can create a symbol
link to access the data. As long as the user’s source codes always
use the relative file path, e.g.
data/a.txt
, and the client sets thedata
to the appropriate azure blob path, no change is required to change the data path in the code. That is, if the code is tested well on the local machine, it should also work in AML.
The server side
- unzip the source code
- The destination folder is under
/tmp
since this folder always exists and is writable. Another option is to use the home folder. However, sometimes the home folder is a network share, which could be slow if we need to compile the code. If the home folder is shared among different jobs, synchronization could be very difficult. - As multiple processes could be launched in AML,
we need to avoid the race condition so that only one process is
unzipping the source code. Here, we use the exclusive file open as a
lock to implement it. Another way might be to depend on the
barrier
ofmpi
ortorch.distributed
, both of which might create more dependency. In practice, the file-based exclusive lock works well.
- The destination folder is under
- run
compile.aml.sh
if the file exists after the source folder is unzipped. This allows the user to compile the source code with any kind of command. - pip install the packages in
requirements.txt
if the file exists. This is specifically for python packages. - launch the command under the source folder. This also means that the client is suggested to stick to the root source folder as the current folder for code testing locally.
The configuration file is a YAML file to contain all
associated parameters,
including the blob path for the source code, the data link information between
the local folder and the blob folder, the environment we need to have in
AML, e.t.c.
One YAML file corresponds to one cluster. We can use -c cluster_name
to specify the cluster, whose YAML configuration file is
located at aux_data/aml/cluster_name.yaml
. At
this moment, we hard-code the path and in the future, we may provide a way to
customize the path.
Data Management
During job submission, we specify the data link information. That is, the
tool of a
knows which azure path is mapped to the local folder. Thus, we can
have the following utilities to manage the data in azure blob
- remove the corresponding Blob data of
data/a.txt
bya rm data/a.txt
- list all files corresponding to the Blob folder of
data/data_set
bya ls data/data_set
- upload the local data
data/c.txt
to the Bloba u data/c.txt
- copy the data of
data/b.txt
from the blob used by clusterA to the blob used by clusterB.a -c clusterB -f clusterA u data/b.txt
Conclusion
The tool is small, handy and intuitive. Welcome to give it a try and open any issue.