Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of BandedBlockBandedMatrix Vector Multiplication #121

Open
Luapulu opened this issue Sep 21, 2021 · 2 comments
Open

Performance of BandedBlockBandedMatrix Vector Multiplication #121

Luapulu opened this issue Sep 21, 2021 · 2 comments

Comments

@Luapulu
Copy link

Luapulu commented Sep 21, 2021

I have a big BanedBlockBandedMatrix and need to multiply it with a vector. However it seems to be rather slow and require a lot of allocation. Here's a MWE

using BlockArrays: BlockRange
using ApproxFunOrthogonalPolynomials
using LinearAlgebra
using SparseArrays

using BenchmarkTools

# Making one of my Matrices
CC = Chebyshev(-1..1)         ⊗ Chebyshev(-1..1)
UC = Ultraspherical(1, -1..1) ⊗ Chebyshev(-1..1)
CU = Chebyshev(-1..1)         ⊗ Ultraspherical(1, -1..1)
UU = Ultraspherical(1, -1..1) ⊗ Ultraspherical(1, -1..1)

degree = 500

Du = Derivative(CC, [1,0])[BlockRange(1:degree), BlockRange(1:degree+1)]
UCtoC2 = Conversion(UC, UU)[BlockRange(1:degree), BlockRange(1:degree)]
DutoC2 = UCtoC2 * Du

# Benchmarks

@benchmark mul!(out, $DutoC2, v) setup=begin
    out = Vector{Float64}(undef, size(DutoC2, 1))
    v = rand(size(DutoC2, 2))
end
# 3.2 ms with 257.62 KiB allocated

@benchmark mul!(out, $(sparse(DutoC2)), v) setup=begin
    out = Vector{Float64}(undef, size(DutoC2, 1))
    v = rand(size(DutoC2, 2))
end
# For comparison: 1.5 ms with 0 bytes allocated

I wonder why the BandedBlockBandedMatrix multiplication allocates so much and why it's slower than using a sparse matrix. Any ideas?

@dlfivefifty
Copy link
Member

Some things:

  1. OpenBLAS has very slow banded linear algebra. MKL is about 5x faster I think. Though the cost is not dominated by the actual banded solve.
  2. The allocations and computational cost are from view(A, Block(K,J)) calls. These should in theory be non-allocating but in practice can sometimes be too complicated for the compiler to realise.
  3. For very small bandwidths sparse is extremely fast for matrix multiplication. I don't think we'd be able to beat it.

@Luapulu
Copy link
Author

Luapulu commented Sep 21, 2021

Interesting, thanks! I'll stick with SparseMatrix then. So, should ApproxFun have some heuristic to pick sparse matrices instead of BandedBlockBandedMatrix?

Edit: if the bands are thin enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants