Grid.ai can seamlessly train 100s of machine learning models on the cloud from your laptop, with zero code change. In this example, we will run a model on a laptop, then run the unmodified model on the cloud. On the cloud, we will run hyperparameter sweeps in parallel 8 ways. The experiment will complete 8x faster with the parallel run. The cost of the run will be reduced by 70% with the spot instance.
We will use familiar MNIST. Grid.ai is the creators of PyTorch Lightning. Grid.ai is agnostics to Machine Learning frameworks and 3rd party tools. The benefits of Grid.ai are available to other Machine Learning frameworks and tools. To demonstrate this point, we will NOT use PyTorch Lightning’s Early Stop. Instead, we will use Optuna for early stopping. We will track progress by viewing PyTorch Lightning’s Tensorboard in Grid.ai’s Tensorboard interface.
Grid.ai will launch experiments in parallel using Grid Search strategy. Grid.ai Hyperparameter sweep control batchsize
, epochs
, pruning
– whether Optuna is active or not. Optuna will control the number of layers, hidden units in each layer, and dropouts within each experiment. The following combinations will result in 8 parallel experiments:
A single Grid.ai CLI command initiates the experiment.
grid run --use_spot pytorch_lightning_simple.py --datadir grid:fashionmnist:1 --pruning="[0,1]" --batchsize="[32,128]" --epochs="[5,10]"
This instruction assumes access to a laptop with bash
and conda
. For those with restricted local environment, please use Jupyter
and click on Terminal
on Grid.ai Session.
# conda init bash
conda init bash # exit and come back
# create conda env
conda create --name gridai python=3.8
conda activate gridai
# install packages
pip install lightning-grid
pip install optuna
pip install pytorch_lightning
pip install torchvision
# login to grid
grid login --username <username> --key <grid api key>
# retrieve the model
git clone https://github.com/robert-s-lee/grid-optuna
cd grid-optuna
mkdir data
# Run without Optuna pruning (takes a while)
python pytorch_lightning_simple.py --datadir ./data
# Run with Optuna pruning (takes a while)
python pytorch_lightning_simple.py --datadir ./data --pruning 1
Setup Grid.ai Datastore so that MNIST data is not downloaded on each run. Note the Version number created. Typically this will be 1.
grid datastore create --source data --name fashionmnist
grid datastore list # wait until the Status comes back with `Succeeded`
watch -n 10 grid datastore list # refresh
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Credential Id ┃ Name ┃ Version ┃ Size ┃ Created ┃ Status ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ cc-qdfdk │ fashionmnist │ 1 │ 141.6 MB │ 2021-06-16 15:13 │ Succeeded │
└───────────────┴───────────────────┴─────────┴──────────┴──────────────────┴───────────┘
grid run --use_spot pytorch_lightning_simple.py --datadir grid:fashionmnist:1 --pruning="[0,1]" --batchsize="[32,128]" --epochs="[5,10]"
grid run --use_spot pytorch_lightning_simple.py --pruning="[0,1]" --batchsize="[32,128]" --epochs="[5,10]"
The above commands will show below (abbreviated)
Run submitted!
`grid status` to list all runs
`grid status smart-dragon-43` to see all experiments for this run
grid status smart-dragon-43
shows experiments running in parallel
% grid status smart-dragon-43
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┓
┃ Experiment ┃ Command ┃ Status ┃ Duration ┃ datadir ┃ pruning ┃ batchsize ┃ epochs ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━┩
│ smart-dragon-43-exp7 │ pytorch_lightning_simple.py │ running │ 0d-00:07:24 │ /datastores/fashionmnist │ 1 │ 32 │ 10 │
│ smart-dragon-43-exp6 │ pytorch_lightning_simple.py │ running │ 0d-00:07:27 │ /datastores/fashionmnist │ 1 │ 32 │ 5 │
│ smart-dragon-43-exp5 │ pytorch_lightning_simple.py │ running │ 0d-00:07:14 │ /datastores/fashionmnist │ 1 │ 128 │ 5 │
│ smart-dragon-43-exp4 │ pytorch_lightning_simple.py │ pending │ 0d-00:12:52 │ /datastores/fashionmnist │ 0 │ 128 │ 5 │
│ smart-dragon-43-exp3 │ pytorch_lightning_simple.py │ running │ 0d-00:07:13 │ /datastores/fashionmnist │ 0 │ 32 │ 10 │
│ smart-dragon-43-exp2 │ pytorch_lightning_simple.py │ running │ 0d-00:07:03 │ /datastores/fashionmnist │ 0 │ 128 │ 10 │
│ smart-dragon-43-exp1 │ pytorch_lightning_simple.py │ running │ 0d-00:07:02 │ /datastores/fashionmnist │ 1 │ 128 │ 10 │
│ smart-dragon-43-exp0 │ pytorch_lightning_simple.py │ pending │ 0d-00:12:52 │ /datastores/fashionmnist │ 0 │ 32 │ 5 │
└──────────────────────┴─────────────────────────────┴─────────┴─────────────┴──────────────────────────┴─────────┴───────────┴────────┘
grid logs smart-dragon-43-exp0
shows logs from that experiment
grid logs smart-dragon-43
grid run --use_spot pytorch_lightning_simple.py
grid run --use_spot pytorch_lightning_simple.py --datadir grid:fashionmnist:1"
Example of on-demand pricing (top at $0.09) and spot pricing (bottom at $0.03)
Example Metric from Grid.ai WebUI
Example Metric from Tensorboard