The code here is forked from kuangliu and adapted for submitting the model to IBM Watson Machine Learning on Watson Studio for training. It is meant to get you quick-started. We hope you have some fun running your first models on IBM Cloud.
If executing locally, this python script only needs pytorch and torchvision . Install as described on the pytorch webpage. Then clone the repo and run the following to train a VGG19 locally:
python3 download_cifar.py
python3 main.py --lr 0.01 --epochs 100 --model vgg19
usage: main.py [-h] [-l] [--lr LR] [--resume] [--cifar_path CIFAR_PATH]
[--checkpoint_path CP] [--epochs EPOCHS]
[--model MODEL]
PyTorch CIFAR10 Training
optional arguments:
-h, --help show this help message and exit
-l lists all available models (default: False)
--lr LR learning rate (default: 0.1)
--resume, -r resume from checkpoint (default: False)
--cifar_path path to cifar tar.gz (default: ./cifar10)
--checkpoint_path CP checkpoint path (default: .)
--epochs EPOCHS no of epochs (default: 10)
--model MODEL Model type: vgg16,... (default: vgg16)
Below are the steps to setup your account to run your ML training runs in the cloud on different numbers and types of GPUs. After setup (steps 1-6), you will be able to train your machine learning models on Watson studio in less than a minute! If you already have your environment setup, start at step 7.
If you are interested in Watson Studio in general we recommend to read more about it here: https://www.ibm.com/cloud/watson-studio
python3 download_cifar.py
python3 main.py --lr 0.01 --epochs 100 --model vgg19
Now to run the code on Watson, all you need is an IBM Cloud account and a cloud storage bucket to hold the input data (CIFAR10) and output data (models, performance stats).
Visit https://console.bluemix.net/ to sign up!
First, the bx
CLI allows you to start and manage IBM cloud instances from the command-line.
OSX:
curl -fsSL https://clis.ng.bluemix.net/install/osx | sh
Linux:
curl -fsSL https://clis.ng.bluemix.net/install/linux | sh
Windows:
powershell.exe -command "iex (New-Object Net.WebClient).DownloadString('https://clis.ng.bluemix.net/install/powershell')"
bx login
or:
bx login -sso (if within IBM)
The awscli lets you setup and upload data to your buckets.
pip install awscli
Lastly, the machine learning plugin for bx
lets you use the watson ml part of the bluemix cli to start, view, and stop ml jobs on watson.
bx plugin install machine-learning
A bucket is like a huge folder in the cloud which you can put and get any file or folder (specifically your datasets) using an api-style interface.
First, we create your own personal cloud storage instance to hold your bucket and name the instance my_instance
.
bx resource service-instance-create my_instance cloud-object-storage lite global
bx resource service-instance my_instance
We then create and get your personal instance credentials naming it my_cli_key
so that you can create and access your bucket.
Create key and print it:
bx resource service-key-create "my_cli_key" Writer --instance-name "my_instance" --parameters '{"HMAC":true}' > /dev/null 2>&1
access_key_id=`bx resource service-key my_cli_key | grep "access_key_id"| cut -d\: -f2`
secret_access_key=`bx resource service-key my_cli_key | grep "secret_access_key"| cut -d\: -f2`
echo ""; echo "Credentials:"; echo "access_key_id - $access_key_id"; echo "secret_access_key - $secret_access_key"; echo ""
Add access_key_id
and secret_access_key
to your awscli profile and name it my_profile
(leave the other fields as None).
aws configure --profile my_profile
Copy these keys! You'll need them again a little later.
Now, we make a bucket and name it something unique! Buckets are named globally, which means that only one IBM Cloud account can have a bucket with a particular name. Please choose a name without any spaces or underscores.
bucket_name=<DEFINE_A_NAME>
and run:
aws --endpoint-url=http://s3-api.us-geo.objectstorage.softlayer.net s3api create-bucket --bucket $bucket_name --profile my_profile 2>&1
In order to run jobs on Watson, you will need to create an organization and space to hold your jobs. Organization names are also globally unique.
org_name=...
space_name=...
bx account org-create $org_name
bx target -o $org_name
bx account space-create $space_name
bx target -o $org_name -s $space_name
bx service create pm-20 lite CLI_WML_Instance
bx service key-create CLI_WML_Instance cli_key_CLI_WML_Instance
instance_id=`bx service key-show CLI_WML_Instance cli_key_CLI_WML_Instance | grep "instance_id"| awk -F": " '{print $2}'| cut -d'"' -f2`
username=`bx service key-show CLI_WML_Instance cli_key_CLI_WML_Instance | grep "username"| awk -F": " '{print $2}'| cut -d'"' -f2`
password=`bx service key-show CLI_WML_Instance cli_key_CLI_WML_Instance | grep "password"| awk -F": " '{print $2}'| cut -d'"' -f2`
url=`bx service key-show CLI_WML_Instance cli_key_CLI_WML_Instance | grep "url"| awk -F": " '{print $2}'| cut -d'"' -f2`
export ML_INSTANCE=$instance_id
export ML_USERNAME=$username
export ML_PASSWORD=$password
export ML_ENV=$url
This will upload our dataset to the bucket we created:
aws --endpoint-url=https://s3-api.us-geo.objectstorage.softlayer.net --profile my_profile s3 cp cifar10/ s3://$bucket_name/cifar10 --recursive
You have completed the one-time setup! Now, we can start sending jobs to Watson.
This yaml file holds all the information needed for our job including what bucket, ml framework, and computing instance to use.
cp pytorch-cifar-template.yml pytorch-cifar.yml
Add your author info and replace the values of aws_access_key_id
, aws_secret_access_key
, and bucket
in pytorch-cifar.yml
with your storage instance credentials and your chosen bucket name from before for both the data and results references.
model_definition:
framework:
#framework name and version (supported list of frameworks available at 'bx ml list frameworks')
name: pytorch
version: 0.3
#name of the training-run
name: cifar10 in pytorch
#Author name and email
author:
name: John Doe
email: johndoe@ibm.com
description: This is running cifar training on multiple models
execution:
#Command to execute -- see script parameters in later section !!
command: python3 main.py --cifar_path ${DATA_DIR}/cifar10
--checkpoint_path ${RESULT_DIR} --epochs 10
compute_configuration:
#Valid values for name - k80/k80x2/k80x4/p100/p100x2/v100/v100x2
name: k80
training_data_reference:
name: training_data_reference_name
connection:
endpoint_url: "https://s3-api.us-geo.objectstorage.service.networklayer.com"
aws_access_key_id: < from cloud portal >
aws_secret_access_key: < from cloud portal >
source:
bucket: < bucket name >
type: s3
training_results_reference:
name: training_results_reference_name
connection:
endpoint_url: "https://s3-api.us-geo.objectstorage.service.networklayer.com"
aws_access_key_id: < from cloud portal >
aws_secret_access_key: < from cloud portal >
target:
bucket: < bucket name >
type: s3
Notice that under execution in the yaml file, we have already specified a command that will be executed when the job reaches the server.
python3 main.py --cifar_path ${DATA_DIR}/cifar10
--checkpoint_path ${RESULT_DIR} --epochs 10
This will execute main.py, which starts a training run of a specified model. Since no model is specified, it will train the default model, vgg16, for 10 epochs using the dataset in the bucket we created.
zip model.zip main.py models/*
bx ml train model.zip pytorch-cifar.yml
That's it! The command should generate a training ID for you, meaning our model has started training on Watson!
bx ml list training-runs
bx ml monitor training-runs < trainingID >
As training proceeds, you should also see results from the training process being copied to the results bucket specified in your training job yaml
file - training_results_references.target bucket
.
You can also inspect the status of training by downloading and viewing the training log file which has been copied to the result bucket. This is useful in debugging errors and failed jobs.
aws --endpoint-url=https://s3-api.us-geo.objectstorage.softlayer.net --profile my_profile s3 cp s3://my_bucket/ < trainingID Rig >/learner-1/training-log.txt -
See https://dataplatform.ibm.com/docs/content/analyze-data/ml_dlaas_working_with_new_models.html
usage: main.py [-h] [-l] [--lr LR] [--resume] [--cifar_path CIFAR_PATH]
[--checkpoint_path CP] [--epochs EPOCHS]
[--model MODEL]
PyTorch CIFAR10 Training
optional arguments:
-h, --help show this help message and exit
-l lists all available models (default: False)
--lr LR learning rate (default: 0.1)
--resume, -r resume from checkpoint (default: False)
--cifar_path path to cifar tar.gz (default: ./cifar10)
--checkpoint_path CP checkpoint path (default: .)
--epochs EPOCHS no of epochs (default: 10)
--model MODEL Model type: vgg16,... (default: vgg16)
# on local machine:
python3 main.py --lr 0.01 --epochs 100 --model vgg19
# in Watson Machine Learning yaml file:
python3 main.py --cifar_path ${DATA_DIR}/cifar10
--checkpoint_path ${RESULT_DIR}
--epochs 100 --model vgg19 --lr 0.01
DPN26 ( <function DPN26 at 0x1129dbc80> )
DPN92 ( <function DPN92 at 0x112a01048> )
DenseNet121 ( <function DenseNet121 at 0x1129e10d0> )
DenseNet161 ( <function DenseNet161 at 0x1129e16a8> )
DenseNet169 ( <function DenseNet169 at 0x1129e1598> )
DenseNet201 ( <function DenseNet201 at 0x1129e1620> )
GoogLeNet ( <class 'models.googlenet.GoogLeNet'> )
Inception ( <class 'models.googlenet.Inception'> )
LeNet ( <class 'models.lenet.LeNet'> )
MobileNet ( <class 'models.mobilenet.MobileNet'> )
MobileNetV2 ( <class 'models.mobilenetv2.MobileNetV2'> )
PNASNetA ( <function PNASNetA at 0x112a018c8> )
PNASNetB ( <function PNASNetB at 0x112a01ea0> )
PreActBottleneck ( <class 'models.preact_resnet.PreActBottleneck'> )
PreActResNet101 ( <function PreActResNet101 at 0x1129ed730> )
PreActResNet152 ( <function PreActResNet152 at 0x1129ed7b8> )
PreActResNet18 ( <function PreActResNet18 at 0x1129ed1e0> )
PreActResNet34 ( <function PreActResNet34 at 0x1129ed620> )
PreActResNet50 ( <function PreActResNet50 at 0x1129ed6a8> )
ResNeXt29_2x64d ( <function ResNeXt29_2x64d at 0x1129eea60> )
ResNeXt29_32x4d ( <function ResNeXt29_32x4d at 0x1129ed0d0> )
ResNeXt29_4x64d ( <function ResNeXt29_4x64d at 0x1129eef28> )
ResNeXt29_8x64d ( <function ResNeXt29_8x64d at 0x1129ed048> )
ResNet101 ( <function ResNet101 at 0x1129ee840> )
ResNet152 ( <function ResNet152 at 0x1129ee8c8> )
ResNet18 ( <function ResNet18 at 0x1129ee2f0> )
ResNet34 ( <function ResNet34 at 0x1129ee730> )
ResNet50 ( <function ResNet50 at 0x1129ee7b8> )
SENet18 ( <function SENet18 at 0x112a01378> )
SepConv ( <class 'models.pnasnet.SepConv'> )
ShuffleNet ( <class 'models.shufflenet.ShuffleNet'> )
ShuffleNetG2 ( <function ShuffleNetG2 at 0x1129e1c80> )
ShuffleNetG3 ( <function ShuffleNetG3 at 0x1129ee158> )
VGG11 ( config in <class 'models.vgg.VGG'> )
VGG13 ( config in <class 'models.vgg.VGG'> )
VGG16 ( config in <class 'models.vgg.VGG'> )
VGG19 ( config in <class 'models.vgg.VGG'> )
Model | Acc. |
---|---|
VGG16 | 92.64% |
ResNet18 | 93.02% |
ResNet50 | 93.62% |
ResNet101 | 93.75% |
MobileNetV2 | 94.43% |
ResNeXt29(32x4d) | 94.73% |
ResNeXt29(2x64d) | 94.82% |
DenseNet121 | 95.04% |
PreActResNet18 | 95.11% |
DPN92 | 95.16% |
Hendrik Strobelt (IBM Research), Evan Phibbs (IBM Research), Victor C. Dibia (IBM Research)