Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Qualcomm] How to know the compilation options used for RetinaNet? #13

Closed
arjunsuresh opened this issue Dec 19, 2023 · 8 comments
Closed

Comments

@arjunsuresh
Copy link
Contributor

arjunsuresh commented Dec 19, 2023

When we try to reproduce the instructions given here we get the below error

krai@9eef7784f211:~$ export SUT=q2_pro_dc
krai@9eef7784f211:~$ axs byquery sut_name=${SUT},kilt_ready,device=qaic,model_name=retinanet,index_file=openimages_cal_images_list.txt,loadgen_scenario=Offline
WARNING:root:[base_qaic_config] parameters file /home/krai/work_collection/axs2qaic_mlperf_3.1/base_qaic_config/data_axs.json did not exist, initializing to empty parameters
WARNING:root:[work_collection] byquery(sut_name=q2_pro_dc,kilt_ready,device=qaic,model_name=retinanet,index_file=openimages_cal_images_list.txt,loadgen_scenario=Offline) did not find anything, but there are tags: {'kilt_ready'} , trying to find a producer...
WARNING:root:[work_collection] A total of 1 matched rules found.

WARNING:root:Matched Rule #1/1: ['kilt_ready', 'device=qaic', 'model_name=retinanet'] from Entry 'model_qaic_retinanet_recipe'...
WARNING:root:Pipeline: [['run']], Cumulative params: {'__query': 'sut_name=q2_pro_dc,kilt_ready,device=qaic,model_name=retinanet,index_file=openimages_cal_images_list.txt,loadgen_scenario=Offline', 'return_saved_record_entry': True, 'device': 'qaic', 'model_name': 'retinanet', 'sut_name': 'q2_pro_dc', 'index_file': 'openimages_cal_images_list.txt', 'loadgen_scenario': 'Offline', 'tags': ['kilt_ready']}
WARNING:root:[base_qaic_model] touch _BEFORE_CODE_LOADING=/home/krai/work_collection/axs2qaic_mlperf_3.1/qaic_tool_parser
WARNING:root:[work_collection] byquery(profile,sut_name=gen_qaic_profile,model_name=retinanet,device=qaic,index_file=openimages_cal_images_list.txt) did not find anything, but there are tags: {'profile'} , trying to find a producer...
WARNING:root:[work_collection] A total of 1 matched rules found.

WARNING:root:Matched Rule #1/1: ['profile', 'device=qaic', 'model_name=retinanet'] from Entry 'profile_qaic_retinanet_recipe'...
WARNING:root:Pipeline: [['run']], Cumulative params: {'__query': 'profile,sut_name=gen_qaic_profile,model_name=retinanet,device=qaic,index_file=openimages_cal_images_list.txt', 'return_saved_record_entry': True, 'device': 'qaic', 'model_name': 'retinanet', 'sut_name': 'gen_qaic_profile', 'index_file': 'openimages_cal_images_list.txt', 'tags': ['profile']}
WARNING:root:[base_qaic_profile] touch _BEFORE_CODE_LOADING=/home/krai/work_collection/axs2qaic_mlperf_3.1/qaic_tool_parser
WARNING:root:[work_collection] byquery(sut_config,sut=gen_qaic_profile,model=retinanet,loadgen_scenario=Offline,device_id=all) did not find anything, but there are tags: {'sut_config'} , trying to find a producer...
WARNING:root:[work_collection] A total of 0 matched rules found.

------------------------------------------------------------------------------------------------------------------------
While computing nested_calls in ['^^', 'execute', [[['get', 'sut_entry'], ['get', 'config_compiletime_profile']]]] the following exception was raised: In pipeline [
	['get', 'sut_entry']
	['get', 'config_compiletime_profile']
] step ['get', 'config_compiletime_profile'] cannot be executed on value (None) produced by ['get', 'sut_entry']

The referred config file given here doesn't look right (for bert?)

We tried the old compiler params given here and this is giving the below output and no elf binary being produced.

$ /opt/qti-aic/exec/qaic-exec -model=`pwd`/retinanet.onnx -load-profile=/home/arjun/profile.yaml -aic-binary-dir=`pwd`/elfs -enable-channelwise -profiling-threads=8 -onnx-define-symbol=batch_size,1 -node-precision-info=`pwd`/node-precision.yaml -aic-enable-depth-first -aic-num-cores=1 -mos=1 -ols=1 -batchsize=1 -quantization-schema=asymmetric -quantization-calibration=None  -execute-nodes-in-fp16=Sigmoid
-quantization-schema is going to be deprecated in a future release, use -quantization-schema-activations and -quantization-schema-constants instead.
Reading ONNX Model from /home/arjun/CM/repos/local/cache/25f4253bb0c7433a/retinanet.onnx
Compile started ............... 
Compiling model with Int8 precision using PGQ.


Dev00 NSP_00 t5: network doorbell wait timeout exceeded (1 time(s)): db@0x400 waiting for 108 last got 98
Dev00 NSP_00 t5: last completed op HVX db@0x400 98
Dev00 NSP_00 t5: last completed op HMX db@0x404 97
Dev00 NSP_00 t5: last completed op DMAIssue db@0x408 107
Dev00 NSP_00 t5: last completed op HVX0DMAComplete db@0x40c 0
Dev00 NSP_00 t5: last completed op HVX1DMAComplete db@0x410 0
Dev00 NSP_00 t5: last completed op HVX2DMAComplete db@0x414 0
Dev00 NSP_00 t5: last completed op HVX3DMAComplete db@0x418 0
Dev00 NSP_00 t5: last completed op HMXDMAComplete db@0x41c 95
Dev00 NSP_00 t5: last completed op DMAComplete db@0x420 107
Dev00 NSP_00 t5: last completed op inputsReadyForReadDB db@0x0 0
Dev00 NSP_00 t5: last completed op inputsReadyForReadDB db@0x4 0
Dev00 NSP_00 t5: last completed op inputsReadyForReadDB db@0x8 0
Dev00 NSP_00 t5: last completed op inputsReadyForReadDB db@0xc 0
Dev00 NSP_00 t5: last completed op inputsReadyForReadDB db@0x10 0
Dev00 NSP_00 t5: last completed op inputsReadyForReadDB db@0x14 0
Dev00 NSP_00 t5: last completed op inputsReadyForReadDB db@0x18 0
Dev00 NSP_00 t5: last completed op inputsReadyForReadDB db@0x1c 0
Dev00 NSP_00 t5: last completed op inputsReadyForReadDB db@0x20 0
Dev00 NSP_00 t5: last completed op inputsReadyForReadDB db@0x24 0
Dev00 NSP_00 t5: last completed op inputsReadyForReadDB db@0x28 0
Dev00 NSP_00 t5: last completed op inputsReadyForReadDB db@0x2c 0
Dev00 NSP_00 t5: last completed op inputsReadyForReadDB db@0x30 0
Dev00 NSP_00 t5: last completed op inputsReadyForReadDB db@0x34 0
Dev00 NSP_00 t5: last completed op inputsReadyForReadDB db@0x38 0
Dev00 NSP_00 t5: last completed op inputsReadyForReadDB db@0x3c 0
Dev00 NSP_00 t5: last completed op outputsReadyForWriteDB db@0x40 1
Dev00 NSP_00 t5: last completed op outputsReadyForWriteDB db@0x44 1
Dev00 NSP_00 t5: last completed op outputsReadyForWriteDB db@0x48 1
@psyhtest
Copy link

@arjunsuresh Thank you for reporting the issues. I believe you are reproducing this on AWS DL2q instances with Qualcomm Cloud AI 100 (QAIC100) Standard accelerator cards?

To build the image krai/axs.qaic:deb_1.9.1.25, you'd need access to the QAIC100 Apps/Platform SDK v1.9.1.25. Hopefully, you can build krai/axs.qaic:deb_1.10.0.x.

The config file you referred to is for a machine with 2x Pro (16NSP core) cards, i.e. incompatible with the DL2q instance. Having said that, the contents is indeed incorrect. We'll look into updating the config file for HPE's DL385 Q8 Std machine, which is the closest to AWS DL2q instances, after the holidays.

@arjunsuresh
Copy link
Contributor Author

Thank you @psyhtest for your reply. The docker error was on our side - we got passed it and could build the docker image. But the model compilation failed - error is updated in the issue. We are trying to compile the model on an x86 machine - not on AWS, to get the ELF binary to run on Thundercomm RB6. I think the config files are wrong even for ResNet50 but there we could manage with the old ones from ck-qaic.

@psyhtest
Copy link

@arjunsuresh Which SDK do you have for RB6?

We'll take a look at the assembled SUT configs. We normally generate them on-the-fly when running experiments. For the krai/axs2config repository, however, we regenerated them after collecting all the results, expecting them to be exactly the same. It sounds that it's not the case, unfortunately.

@arjunsuresh
Copy link
Contributor Author

Thank you @psyhtest . I tested with 1.10.0.193 - would you recommend trying with 1.9.0? For compilation, I tried playing with the -submit-timeout option but that didn't help with the timeout error. The profile generation takes about a day - that's the challenge in changing the SDK.

@arjunsuresh
Copy link
Contributor Author

arjunsuresh commented Dec 22, 2023

Tried with 1.9.1.25 SDK.

/opt/qti-aic/exec/qaic-exec -model=/home/arjun/CM/repos/local/cache/853549596cd84450/retinanet.onnx -load-profile=/home/arjun/CM/repos/local/cache/c3b78c3b70554bf3/profile.yaml -aic-binary-dir=/home/arjun/CM/repos/local/cache/b972b152d55a4739/elfs -enable-channelwise -onnx-define-symbol=batch_size,1 -node-precision-info=/home/arjun/CM/repos/local/cache/853549596cd84450/node-precision-info.yaml -quantization-schema=asymmetric -quantization-calibration=None  -execute-nodes-in-fp16=Sigmoid -aic-enable-depth-first -aic-num-cores=1 -mos=1 -ols=1
-quantization-schema is going to be deprecated in a future release, use -quantization-schema-activations and -quantization-schema-constants instead.
Reading ONNX Model from /home/arjun/CM/repos/local/cache/853549596cd84450/retinanet.onnx
Compile started ............... 
Compiling model with Int8 precision using PGQ.
Iter[0/0]: model execution took 2295014 ms

No timeout error now. But the elf binary folder is missing - it exists during the compilation time but vanishes when the compilation ends.

@psyhtest
Copy link

@arjunsuresh We only used SDK v1.9.1.25 in the previous round, so I'm not sure how v1.10.x.y would fare.

Here's the Offline compilation command currently used by the official axs automation workflow:

/opt/qti-aic/exec/qaic-exec -m=/home/krai/work_collection/downloaded_retinanet.onnx/retinanet.onnx \
-aic-binary-dir=/home/krai/work_collection/model_qaic_retinanet_Offline/./elfs \
-aic-num-cores=1 -ols=1 -mos=1 -batchsize=1 -aic-enable-depth-first \
-aic-hw -aic-hw-version=2.0 -compile-only -onnx-define-symbol=batch_size,1 \
-quantization-schema-constants=symmetric_with_uint8 \
-quantization-schema-activations=asymmetric \
-quantization-calibration=None -enable-channelwise \
-node-precision-info=/home/krai/work_collection/axs2qaic-dev_main/node_precision_info_retinanet/node-precision.yaml \
-load-profile=/home/krai/work_collection/profile_qaic_retinanet_openimages_cal_images_list.txt_bs.1/profile.yaml

It should be the same for v3.1.

@psyhtest
Copy link

Also, please note that, starting from SDK v1.11.u.v (and probably v1.10.x.y), you need to define some constant indices differently:

#ifdef SDK_1_11_X
  const int CLASSES_INDEX = 5;
  const int BOXES_INDEX = 10;
  const int TOPK_INDEX = 0;
#else
  const int CLASSES_INDEX = 0;
  const int BOXES_INDEX = 5;
  const int TOPK_INDEX = 10;
#endif

@arjunsuresh
Copy link
Contributor Author

Thanks a lot @psyhtest especially for the NMS changes. My bad - -aic-hw is the flag I had missed and instead of compiling to exe I believe qaic-exec was doing a simulation run.

Feedback to qaic team: a verbose output can be given by qaic-exec about what it is doing (simulation or compilation). The current outputs are the same.

Reading ONNX Model from /home/arjun/CM/repos/local/cache/baeff5e7bd24491c/retinanet.onnx
Compile started ............... 
Compiling model with Int8 precision using PGQ.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants