Cheatsheet for Setting up Deep Learning Cloud VMs
General tips on the process to set up a deep learning vm
VM Setup Notes
My general notes / cheatsheet on how to set up deep learning environments on cloud infrastructure.
AWS
Useful Resource: train-deep-learning-models-on-gpus-using-amazon-ec2-spot-instance
Set up - AWS
- Create a volume
- Create an instance
- Choose p3.2xlarge for V100 GPU.
- NB: Request Spot Instance
- set
Subnet
to same location as volume - In
Configure Security Group
choose select existing security group and add thejupyter
one (need to remember how I set up this initially.)
- Launch and Click the Connect button and copy command and run.
Connect Volume to VM
Checks which drives are available and mounts device to folder.
lsblk
sudo mkdir /dltraining
sudo mount /dev/xvdf /dltraining
Create Snapshots
Use interface to create snapshots to move volume between subnets incase availability is limited or cost becomes too high.
GCP
Instructions to follow when I use it again. Currently GCP preemtible VMs availability is unreliable and V100 VMs are kept alive for a few hours at a time (1-6 hours max), then shutdown due to demand most of the time.
Set up - GCP
export IMAGE_FAMILY=pytorch-latest-gpu
export INSTANCE_NAME="pytorch-instance"
export ZONE="europe-west4-b"
export INSTANCE_TYPE="n1-standard-8"
gcloud compute instances create $INSTANCE_NAME \
--zone=$ZONE \
--image-family=$IMAGE_FAMILY \
--image-project=deeplearning-platform-release \
--maintenance-policy=TERMINATE \
--accelerator="type=nvidia-tesla-t4,count=1" \
--machine-type=$INSTANCE_TYPE \
--boot-disk-size=500GB \
--metadata="install-nvidia-driver=True" \
--preemptible
Other
Clone repo:
git clone --single-branch --branch <branchname> https://<username>:<access-code>@github.com/<username>/<repo>.git
Start Jupyter
jupyter notebook --ip=0.0.0.0
Copy files
scp -i <pem-key> -r ubuntu@<vm-instance>:<filedir/filename> <destdir>
NVIDIA Apex
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
cd ..
Tmux session start
tmux
source activate pytorch_p36
python -m pip install -r requirements.txt
...
Exporting Tokens
export NEPTUNE_API_TOKEN="<access_token>"
export SLACK_URL="<slack_url>"
Autoreload notebook
%load_ext autoreload
%autoreload 2
Profile pythong code
import cProfile
cProfile.run('ds.__getitem__(15)')
WIP…