Skip to content

Conversation

@yuanhu2435
Copy link

Tensorpool mkl allocator combines tensorpool_allocator and mkl_allocator to improve allocator performance for both small size and large size memory allocation.

To confirm our allocator is effective, we test it on "shoucai" model.

Test env: CPU Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz, 1 socket, 32 cores, lock frequency to 2.6GHz
Command line:
$python3 graph_runner.py --input-graph=sub_graph_external.pbtxt --input-data=placeholder_dump.json

Result

shoucai
Allocator loop process_num latency(1) latency(2) latency(3) latency(4) latency(5) avg
TensorPoolAllocator
1000 1 94.84 86.32 90.45 86.47 91.89 90.00
1000 5 342.78 324.96 344.92 337.50 339.84 338.00
1000 10 716.42 722.26 716.44 709.10 720.14 716.87
MKLAllocator
1000 1 59.56 58.15 58.80 59.82 60.03 59.27
1000 5 324.29 322.76 321.69 326.25 324.64 323.93
1000 10 701.34 700.69 701.10 701.55 701.09 701.15
TensorpoolMKLAllocator
1000 1 61.27 58.49 57.48 58.51 61.36 59.42
1000 5 327.43 327.40 327.32 326.21 326.87 327.05
1000 10 707.21 707.69 706.83 708.57 708.25 707.71

It combines tensorpool_allocator and mkl_allocator to improve allocator
performance for both small size and large size memory allocation.

Add an env value "TENSORPOOL_MKL_LARGE_SIZE" to set large size threshold.
It is by default 512K.

Signed-off-by: Lin Xie <lin.xie@intel.com>
Signed-off-by: Yuan Hu <yuan1.hu@intel.com>
@changqi1
Copy link
Owner

changqi1 commented Apr 20, 2022

@yuanhu2435 Thanks. I got your perf data from shoucai model, but I didn't catch the perf differences both small size and large size. Did you test shoucai model perf on small batch size and large batch size?

I could see the MKLAllocator's latency is lowest then others, not TensorpoolMKLAllocator. So would you please give us two situations about TensorpoolMKLAllocator latency == MKLAllocator latency < TensorPoolAllocator latency and TensorpoolMKLAllocator latency == TensorPoolAllocator latency < MKLAllocator latency, right?

@pujiang2018
Copy link

I think we need to collect more perf data for more models since this is the fundamental change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants