Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][DRAFT][onert] Optimized BatchMatMul in CPU backend #13919

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jiwaszki
Copy link
Contributor

@jiwaszki jiwaszki commented Sep 3, 2024

This commit introduces improved BMM kernel for CPU.

ONE-DCO-1.0-Signed-off-by: Jan Iwaszkiewicz j.iwaszkiewi@samsung.com

This commit introduces improved BMM kernel for CPU.

ONE-DCO-1.0-Signed-off-by: Jan Iwaszkiewicz <j.iwaszkiewi@samsung.com>
@jiwaszki jiwaszki added PR/NO TEST Tell CI to not run test DRAFT A draft issue or PR for sharing one's current working status and discussion. type/discussion We need discussion. Discussion itself can help. Even without conclusions! labels Sep 3, 2024
@jiwaszki
Copy link
Contributor Author

jiwaszki commented Sep 3, 2024

For: #12140

This is very rough draft that just injects into current implementation.

@ragmani as a creator of the issue, I want to ask for directions.
Current proposal is to use already exiting GEMM implementation that should be adjusted to work with BMM (here it means that runs along batches). Is it sufficient for the first level of optimization? Or should I switch to more custom implementation?
I also need to take a closer look on transposeRowsCols in BMM and the effect on data layout for the kernel.

I see that optimized::GEMM is used together with CKER_X86_PLATFORM. Is BMM also required to have two branches for X86+GEMM and others+GEMV (i.e. reuse something like MatrixBatchVectorMultiplyAccumulate)?

About shapes -- I only tested it out on perfectly square matrices, need to investigate it more with possible limitations of cker's GEMM.

As a note, here is simple average time (benchmark of 1000 runs):

Current Proposal
~0.02820s ~0.00769s

PS @zetwhite I saw your comment #12140 (comment) but it's few months old. I am sorry that I took this task away without notice. Are you okay with me continuing it?

@zetwhite
Copy link
Contributor

zetwhite commented Sep 3, 2024

PS @zetwhite I saw your comment #12140 (comment) but it's few months old. I am sorry that I took this task away without notice. Are you okay with me continuing it?

No problem with me :) plz self assign for #12140

@ragmani
Copy link
Contributor

ragmani commented Sep 4, 2024

@jiwaszki

Current proposal is to use already exiting GEMM implementation that should be adjusted to work with BMM (here it means that runs along batches). Is it sufficient for the first level of optimization? Or should I switch to more custom implementation?

I think it's sufficient for the first level of optimization. But, it's more important to optimize for arm than x64.

I also need to take a closer look on transposeRowsCols in BMM and the effect on data layout for the kernel.

AFAIK, circle and tflite model has the attributes adjoint_lhs and adjoint_rhs for BMM op. BMM' layouts of lhs and rhs are different depending on those attributes. We may need to implement optimized kernels that run according to those attributes.

Here is an example case.
image
image

I see that optimized::GEMM is used together with CKER_X86_PLATFORM. Is BMM also required to have two branches for X86+GEMM and others+GEMV (i.e. reuse something like MatrixBatchVectorMultiplyAccumulate)?

Yes, we also need optimzed kernels running on arm arch. I think GEMV's sufficient for the first level of optimization on arm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DRAFT A draft issue or PR for sharing one's current working status and discussion. PR/NO TEST Tell CI to not run test type/discussion We need discussion. Discussion itself can help. Even without conclusions!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants