Skip to content

Conversation

@AZEY4
Copy link
Collaborator

@AZEY4 AZEY4 commented Jun 21, 2024

This is the simple parallelization of LU decomposition using DataFlowTasks library. To see the DAG and performance plot run lu_test_3.

@AZEY4 AZEY4 added the run benchmark Trigger the `BenchmarkCI.jl` action label Jun 21, 2024
@AZEY4 AZEY4 requested a review from maltezfaria June 21, 2024 16:20
@AZEY4 AZEY4 self-assigned this Jun 21, 2024
@codecov
Copy link

codecov bot commented Jun 21, 2024

Codecov Report

❌ Patch coverage is 98.30508% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 76.47%. Comparing base (fbebdb2) to head (4bfb148).
⚠️ Report is 32 commits behind head on main.

Files with missing lines Patch % Lines
src/hmatrix.jl 96.42% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #65      +/-   ##
==========================================
+ Coverage   75.64%   76.47%   +0.83%     
==========================================
  Files          14       14              
  Lines        1466     1522      +56     
==========================================
+ Hits         1109     1164      +55     
- Misses        357      358       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@AZEY4 AZEY4 marked this pull request as draft June 23, 2024 11:59
Copy link
Member

@maltezfaria maltezfaria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good as a first implementation, but I think we need:

  1. A task graph for the LU
  2. Some performance comparisons

I suspect we need to insert tasks at a finer granularity to expose more parallelism in the task graph.

Comment on lines +502 to +514
function DataFlowTasks.memory_overlap(A::HMatrix, B::HMatrix)
# TODO: compare leaves in more efficient way.
if A === B
return true
elseif issubmatrix(A, B) || issubmatrix(B, A)
return true
end
chdA = leaves(A)
chdB = leaves(B)
for i in eachindex(chdA)
for j in eachindex(chdB)
if data(chdA[i]) === data(chdB[j])
return true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for most use cases, we can simply:

  1. Check if A and B have a common root
  2. Check the intersection of the row and column ranges of A and B

When they don't have a common root, we can (probably?) assume they don't overlap?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they don't have a common root it is not nesseccery they don't overlap. For example, we can have this situation:
M = Matrix(...)
H = HMatrix(M, ...)
H1 = HMatrix(M, ...)
Then even we check intersections of H and H1 we need to compare their data anyway. But we can assume that the H and H1 are built correctly(without using one data matrix for more than one HMatrix) and then remove the loops altogether.

Comment on lines +96 to +140
function _lu_threads!(M::HMatrix, compressor, bufs = nothing, level = 0, parent = (0, 0))
if isleaf(M)
@dspawn begin
@RW(M)
d = data(M)
@assert d isa Matrix
lu!(d, NOPIVOT())
end label = "lu($(parent[1]),$(parent[2]))\nlevel=$(level)"
else
@assert !hasdata(M)
chdM = children(M)
m, n = size(chdM)
for i in 1:m
_lu_threads!(chdM[i, i], compressor, bufs, level + 1, (i, i))
for j in (i+1):n
@dspawn ldiv!(
UnitLowerTriangular(@R(chdM[i, i])),
@RW(chdM[i, j]),
compressor,
bufs,
) label = "ldiv($i,$j)\nlevel=$(level+1)"
@dspawn rdiv!(
@RW(chdM[j, i]),
UpperTriangular(@R(chdM[i, i])),
compressor,
bufs,
) label = "rdiv($j,$i)\nlevel=$(level+1)"
end
for j in (i+1):m
for k in (i+1):n
@dspawn hmul!(
@RW(chdM[j, k]),
@R(chdM[j, i]),
@R(chdM[i, k]),
-1,
1,
compressor,
bufs,
) label = "hmul($j,$k)\nlevel=$(level+1)"
end
end
end
end
return M
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine as a first implementation, but this level of granularity is not going to be sufficient for a good parallel scaling (I think). You probably need to spawn task a finer scale inside the hmul! and ldiv!/rdiv! functions...

We should probably start by looking at the TaskGraph for the current implementation. Could you post one e.g. on the PR?

Copy link
Collaborator Author

@AZEY4 AZEY4 Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. I am planning to use higher level of granularity and to spawn tasks inside ldiv!, rdiv! and hmul!.

Copy link
Member

@maltezfaria maltezfaria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments. Overall this looks good as a first implementation, but it is not ready to be merged. We need:

  • An HLU task graph to see what is being parallelized
  • Some performance comparisons

As mentioned in one of the comments, I think we will need to spawn tasks at finer granularity to get decent scaling.

@maltezfaria
Copy link
Member

I see that we are getting the same but with a missing key when AirSpeedVelocity tries to generate the plots. Do you think it is a bug on their side?

@AZEY4
Copy link
Collaborator Author

AZEY4 commented Jun 24, 2024

Here is a performance plot for this kind of implementation. As it can be seen we should try to split the computation of hmul, rdiv and ldiv. I used only two threads because it is maximum possible number of tasks which can be run at the same time.
performance_table

@AZEY4
Copy link
Collaborator Author

AZEY4 commented Jun 24, 2024

@AZEY4
Copy link
Collaborator Author

AZEY4 commented Jun 24, 2024

I see that we are getting the same but with a missing key when AirSpeedVelocity tries to generate the plots. Do you think it is a bug on their side?

I think it is a bug from there side because I have never seen this problem running benchmarks locally or running workflows in the Docker container. I will check it carefully again and try to find a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run benchmark Trigger the `BenchmarkCI.jl` action

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants