Cheatsheet for Setting up Deep Learning Cloud VMs

Published: July 16, 2020

VM Setup Notes

My general notes / cheatsheet on how to set up deep learning environments on cloud infrastructure.

AWS

Useful Resource: train-deep-learning-models-on-gpus-using-amazon-ec2-spot-instance

Set up - AWS

Create a volume
Create an instance
- Choose p3.2xlarge for V100 GPU.
- NB: Request Spot Instance
- set Subnet to same location as volume
- In Configure Security Group choose select existing security group and add the jupyter one (need to remember how I set up this initially.)
Launch and Click the Connect button and copy command and run.

Connect Volume to VM

Checks which drives are available and mounts device to folder.

lsblk
sudo mkdir /dltraining
sudo mount /dev/xvdf /dltraining

Create Snapshots

Use interface to create snapshots to move volume between subnets incase availability is limited or cost becomes too high.

GCP

Instructions to follow when I use it again. Currently GCP preemtible VMs availability is unreliable and V100 VMs are kept alive for a few hours at a time (1-6 hours max), then shutdown due to demand most of the time.

Set up - GCP

export IMAGE_FAMILY=pytorch-latest-gpu
export INSTANCE_NAME="pytorch-instance"
export ZONE="europe-west4-b"
export INSTANCE_TYPE="n1-standard-8"

gcloud compute instances create $INSTANCE_NAME \
        --zone=$ZONE \
        --image-family=$IMAGE_FAMILY \
        --image-project=deeplearning-platform-release \
        --maintenance-policy=TERMINATE \
        --accelerator="type=nvidia-tesla-t4,count=1" \
        --machine-type=$INSTANCE_TYPE \
        --boot-disk-size=500GB \
        --metadata="install-nvidia-driver=True" \
        --preemptible

Other

Clone repo:

git clone --single-branch --branch <branchname> https://<username>:<access-code>@github.com/<username>/<repo>.git

Start Jupyter

jupyter notebook --ip=0.0.0.0

Copy files

scp -i <pem-key> -r ubuntu@<vm-instance>:<filedir/filename> <destdir>

NVIDIA Apex

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
cd ..

Tmux session start

tmux
source activate pytorch_p36
python -m pip install -r requirements.txt
...

Exporting Tokens

export NEPTUNE_API_TOKEN="<access_token>"
export SLACK_URL="<slack_url>"

Autoreload notebook

%load_ext autoreload
%autoreload 2

Profile pythong code

import cProfile
cProfile.run('ds.__getitem__(15)')

Linux Commands

Count number of files in folder

ls -1 | wc -l

Sort output of files in folder by size

du -sh -- * | sort -h
du -sh -- * | sort -rh

WIP…