diff --git a/transposed/Makefile b/transposed/Makefile new file mode 100644 index 00000000..d96eedbe --- /dev/null +++ b/transposed/Makefile @@ -0,0 +1,3 @@ +include ../P0009/wg21/Makefile + +.DEFAULT_GOAL := $(HTML) diff --git a/transposed/P3222R0.html b/transposed/P3222R0.html new file mode 100644 index 00000000..da06abfa --- /dev/null +++ b/transposed/P3222R0.html @@ -0,0 +1,941 @@ + + +
+ + + +| Document #: | +P3222R0 | +
| Date: | +2024/04/08 | +
| Project: | +Programming Language C++ + LEWG + |
+
| Reply-to: | +
+ Mark Hoemmen <mhoemmen@nvidia.com> + |
+
Thanks to Nicolas Morales (Sandia National Laboratories) for review +feedback.
+We propose to change the C++ Working Draft for C++26 so that
+linalg::transposed includes special cases for
+layout_left_padded and layout_right_padded.
+These are the two mdspan layouts proposed by P2642R6, which was voted
+into the C++ Working Draft at the Tokyo 2024 WG21 meeting. This change
+will make it easier for linalg implementations to optimize
+for these two layouts by dispatching to an existing optimized C or
+Fortran BLAS (Basic Linear Algebra Subroutines). Delaying this until
+after C++26 would be a breaking change.
linalg::transposed?WG21 voted P1673R13 into the C++ Working Draft at the Kona 2023 WG21
+meeting. P1673 introduces the linalg::transposed function,
+which takes a rank-2 mdspan and returns a read-only
+mdspan representing the transpose of its input. The
+transpose of a rank-2 mdspan A is a rank-2 mdspan
+AT such that A[i, j] refers to the same
+element as AT[j, i] for all i, j in the domain
+of A. Transposing a matrix “flips” it over its diagonal.
+The diagonal of a rank-2 mdspan A is the set of
+all elements A[i, i] where i, i is in the
+domain of A. A key feature of transposed is
+that it represents a read-only “transpose view” of the data, without
+copying or moving elements of the matrix.
The transposed function currently has “special cases”
+for three layouts: layout_left, layout_right,
+and layout_stride. For these three layouts,
+linalg::transposed works by changing the return type’s
+layout and/or layout mapping in a way that reverses the extents and
+strides. For layout_left, “reversing the strides” means
+layout_right, and vice versa. Here are two examples.
The transpose layout mapping of
+layout_left::mapping<extents<int, 3, 4>>{} is
+layout_right::mapping<extents<int, 4, 3>>{}.
The transpose layout mapping of
+layout_right::mapping<extents<int, 3, 4>>{} is
+layout_left::mapping<extents<int, 4, 3>>{}.
For both layout_left and layout_right, the
+mapping does not store the strides; they are computed from the extents.
+For layout_stride, the mapping actually stores the strides
+and its constructor takes them as a std::array, so
+“reversing the strides” means passing in a reverse-order
+array of the input strides. For example, the transpose
+layout of
auto m = layout_stride::mapping<extents<int, 3, 4>>{
+ extents<int, 3, 4>{}, array{2, 6}};is
+auto mt = layout_stride::mapping<extents<int, 4, 3>>{
+ extents<int, 4, 3>{}, array{6, 2}};The transpose layouts described above reflect the current +behavior in the C++ Working Draft.
+The current behavior in the C++ Working Draft is that for
+any layout which is not one of the three cases listed above
+(layout_left, layout_right, or
+layout_stride), transposed resorts to a
+“fall-back.” That is, it wraps the original layout Layout’s
+mapping in a nested layout mapping
+layout_transpose<Layout>::mapping. That nested
+mapping’s operator() invokes the original layout with
+indices reversed.
transposed need special cases?The fall-back case correctly represents the transpose of an mdspan
+with any layout. If so, why do we need special cases? Why doesn’t
+transposed just always use
+layout_transpose?
The design intent of P1673 is that implementations can dispatch to an
+existing optimized C or Fortran BLAS if the caller’s mdspan
+arguments satisfy the right conditions. If the arguments do not
+satisfy these conditions, the implementation may dispatch to possibly
+unoptimized “generic” code. As P1673 explains, failure to dispatch to an
+optimized library may result in invocation of an asymptotically slower
+algorithm, and/or may fail to take advantage of any acceleration
+hardware. Some of these conditions depend only on the argument types and
+thus can always be checked at compile time, while other conditions
+depend on possibly run-time properties of the objects.
One of those conditions is that the layout mappings satisfy the +following three properties.
+They are all unique (is_unique() is
+true).
They are all strided (is_strided() is
+true).
At least one of their strides equals to one.
If all three are true, we say that the layout mapping is
+BLAS-compatible. For all layout_left and
+layout_right mappings, these are known at compile time
+always to be true. For layout_stride, testing the third
+property requires a possibly run-time check.
Knowing at compile time whether a layout mapping is BLAS-compatible
+makes it a zero-cost abstraction for the implementation to dispatch to
+the BLAS based on that mapping. As we will see below, both
+layout_left_padded and layout_right_padded are
+also known at compile time always to be BLAS-compatible. However,
+transposed does not have special cases for these two
+layouts.
transposed
+had no special cases at all?Suppose that transposed had no special cases for input
+layouts. Implementations of P1673’s algorithms could still optimize by
+adding their own special cases for specific input layouts, such as
+layout_transpose<layout_left> and
+layout_transpose<layout_right>. This would not
+prevent implementations from dispatching to the BLAS. However, it would
+complicate the implementation and possibly add compile-time cost by
+introducing more internal overloads and/or specializations. Every
+algorithm would need to check for twice as many layout special cases:
+the BLAS-compatible layout, and layout_transpose of that
+layout. The code that dispatches to the BLAS would need to do the same
+thing that transposed does for its special cases, namely
+reverse the extents and strides in the transposed case. Furthermore,
+users may want to use transposed with their own P1673-like
+linear algebra algorithms. Such users would generally expect
+transposed to optimize for the common case of known strided
+layouts. Without that optimization, they may end up implementing their
+own transposed functions. The proliferation of incompatible
+transposed functions would hinder interoperability of
+libraries.
layout_left_padded
+and layout_right_paddedWG21 voted P2642R6 into the C++ Working Draft at the Tokyo 2024 WG21
+meeting. P2642R6 adds two layouts, layout_left_padded and
+layout_right_padded. The data layouts described by these
+two class templates are exactly the two layouts understood by the C BLAS
+(Basic Linear Algebra Subroutines), as explained in P1673 and P1674.
+These layouts have one stride (the leftmost resp. rightmost) that is
+known at compile time to be one, and one stride (the next leftmost resp.
+rightmost) that the user provides either at compile time or run time.
+The remaining strides are computed from these and the extents, as if
+with layout_left resp. layout_right where the
+user-provided stride represents a possibly larger extent.
These two layouts are exactly the layouts supported by the BLAS. The
+BLAS calls the one user-provided stride the matrix’s “leading
+dimension.” (The name hints at the reason for these layouts, namely that
+they represent the layout of a submatrix of contiguous rows and columns
+of a possibly larger matrix, whose dimension is the user-provided
+stride.) BLAS implementations can optimize transpose of input matrices
+in these two layouts without copying data, just by reversing extents and
+retaining the one input stride. Furthermore, it is known at compile time
+that any mapping of these two layouts is BLAS-compatible. Therefore,
+it’s reasonable to expect P1673 implementations to optimize for
+layout_left_padded and layout_right_padded.
+The way to do that would be for transposed of a
+layout_left_padded<PaddingValue> mdspan
+to return a layout_right_padded<PaddingValue>
+mdspan with extents swapped and the one “padding stride”
+copied over, and vice versa for transposed of a
+layout_right_padded<PaddingValue>
+mdspan. However, the C++ Working Draft currently handles
+those two layouts with the “fall-back” layout_transpose
+case.
Earlier versions of P1673 defined two mdspan layouts,
+layout_blas_general<column_major_t> and
+layout_blas_general<row_major_t>. P1673’s
+transposed function originally included special cases for
+those two layouts, as one can see in P1673R9’s
+[linalg.transp.transposed]. Version R10 of P1673 moved those layouts to
+P2642 and renamed them layout_left_padded and
+layout_right_padded, respectively. P167310 removed these
+special cases from transposed so that P2642 and P1673 could
+make progress separately. However, P1673’s authors always intended to
+optimize transposed for those layouts. WG21 voted P2642R6
+into the C++ Working Draft at the Tokyo 2024 WG21 meeting, so now it’s
+possible to carry out that intent.
We propose to add those two special cases. The result of
+transposed on a
+layout_left_padded<PaddingValue> mdspan
+will be a layout_right_padded<PaddingValue>
+mdspan with extents swapped and the one “padding stride”
+copied over. Likewise, the result of transposed on a
+layout_right_padded<PaddingValue> mdspan
+will be a layout_left_padded<PaddingValue>
+mdspan with extents swapped and the one “padding stride”
+copied over.
The following example shows how this proposal would change the return
+type of transposed.
// optimized overload
+extern void some_algorithm(
+ mdspan<const float, dextents<size_t, 2>, layout_right_padded<dynamic_extent>> A_T,
+ mdspan<const float, dextents<size_t, 2>, layout_left_padded<dynamic_extent>> B,
+ mdspan<float, dextents<size_t, 2>, layout_left_padded<dynamic_extent>> C);
+
+template<class GenericFallBackLayout>
+void some_algorithm(
+ mdspan<const float, dextents<size_t, 2>, GenericFallBackLayout> A_T,
+ mdspan<const float, dextents<size_t, 2>, layout_left_padded<dynamic_extent>> B,
+ mdspan<float, dextents<size_t, 2>, layout_left_padded<dynamic_extent>> C)
+{
+ // ... slow generic code ...
+}
+
+void some_function(size_t N) {
+ vector<float> A_storage(4 * N * N);
+ vector<float> B_storage(N * N);
+ vector<float> C_storage(N * N);
+
+ mdspan A_parent{A_storage.data(), extents{2 * N, 2 * N}
+ mdspan B{B_storage.data(), extents{N, N}};
+ mdspan C{C_storage.data(), extents{N, N}};
+
+ // ... fill A_parent and B with useful values ...
+
+ // A views the upper left N x N submatrix of its "parent."
+ mdspan A = submdspan(A_parent, tuple{0, N}, tuple{0, N});
+
+ // Approval of P2642R6 added this to the C++ Working Draft.
+ static_assert(is_same_v<
+ decltype(A)::layout_type,
+ layout_left_padded<dynamic_extent>>);
+ static_assert(A.stride(0) == 1); // compile-time value
+ assert(A.stride(1) == A_parent.stride(1)); // possibly run-time value
+
+ mdspan A_T = linalg::transposed(A);
+ some_algorithm(A_T, B, C);
+}decltype(A_T)::layout_type is
+layout_transposed<layout_left_padded<dynamic_extent>>.
Generic overload of some_algorithm is
+called.
decltype(A_T)::layout_type is
+layout_right_padded<dynamic_extent>.
The statement static_assert(A_T.stride(1) == 1); is
+well formed.
Optimized overload of some_algorithm is
+called.
The type of the layout of the mdspan returned by
+transposed is observable and is specified in the current
+wording. Therefore, delaying this change until after C++26 would be a
+breaking change.
We have already discussed above how the lack of special cases for
+transposed would affect use of transposed with
+user’s custom P1673-like algorithms, and possibly hinder
+interoperability of different users’ libraries. That leaves the effects
+of not having this proposal on P1673 implementations themselves.
The first and worst possibility is that implementations might not not
+optimize for layout_left_padded or
+layout_right_padded at all. That would be unfortunate,
+because those are exactly the layouts that the BLAS supports and has
+supported since the 1980’s. This would hinder adoption of P1673
+algorithms.
The second possibility is that implementations may nevertheless still
+dispatch to the BLAS for any BLAS-compatible layout. This works for
+user-defined layouts as well as the Standard layouts.
+layout_transpose::mapping preserves the uniqueness and
+stridedness of its nested layout mapping, and even reverses the strides
+if the nested layout mapping is strided. However, without this proposal,
+implementations would still need to check the actual stride values at
+run time, even though for layout_left_padded and
+layout_right_padded, it’s known at compile time that at
+least one of the strides is one. The run-time check would add overhead
+and complicate the implementation.
The third possibility is that implementations might add special cases
+for
+layout_transpose<layout_left_padded<PaddingValue>>
+and
+layout_transpose<layout_right_padded<PaddingValue>>
+in the algorithms, rather than in transposed. However, as
+discussed above in the section “What if transposed had no
+special cases?”, this would complicate the implementation and possibly
+add compile-time cost.
The fourth possibility is that the implementation may simply not +optimize the transposed case for any layouts. This would be unfortunate, +as the BLAS itself favors transposes in some cases. For example, +implementations of matrix-matrix multiply for general dense matrices +(GEMM) have simpler code and may perform better if exactly one of the +two input matrices is transposed. Under these so-called “NT” and “TN” +cases, typical optimized implementations end up reading the matrices as +if they have the same memory layout.
+A valid P1673 implementation might not dispatch to the BLAS for all
+BLAS-compatible layouts. Instead, it might only do so for the four
+Standard layouts which are known to be BLAS compatible at compile time:
+layout_left, layout_right,
+layout_left_padded, and layout_right_padded.
+This approach has three advantages.
It minimizes run-time overhead by not calling
+is_unique, is_strided, or
+stride.
It can use function overloading on specific layout types to
+dispatch to the BLAS, instead of generic constraint checks (like a
+constraint that is_always_strided() is true)
+that may increase compilation cost.
It avoids the risk that user-defined layouts incorrectly define
+their stridedness or uniqueness. Users who copy-paste an existing layout
+to write their own might forget to change is_strided() or
+is_unique().
However, without this proposal, such an implementation would need to +resort to either the third or fourth possibility above.
+transposed_mapping customization pointIn the previous section, we mentioned that users may want to use
+transposed with their own P1673-like linear algebra
+algorithms. A reviewer suggested that we make it possible for users to
+optimize transposed for their user-defined layouts, by
+adding a transposed_mapping customization point. This would
+be analogous to the submdspan_mapping customization point
+that approval of P2630R4 (submdspan) added to the C++
+Working Draft. The new transposed_mapping customization
+point would take an input layout mapping and return the layout mapping
+that the transpose of an mdspan with the input mapping
+would have. If no customization exists for a given layout mapping,
+transposed would default to using
+layout_transpose, as before.
This design would have the following advantages.
+It would be easier to specify the wording of
+transposed. Instead of its current list of special cases,
+it would look more like the submdspan wording that puts all
+the layout-specific behavior in the customization point.
Implementations that provide implementation-specific layouts
+could optimize transposed for those layouts.
Users could use transposed with their custom
+P1673-like algorithms and implement optimizations for their user-defined
+layouts.
Here are some reasons why WG21 might not want to do +this.
+It would reserve yet another customization point name.
It would not change the set of BLAS-compatible layouts, and would +not change the ability of P1673 implementations to dispatch to the BLAS +for any BLAS-compatible layout.
submdspan_mapping enables functionality – the
+ability to slice mdspan with user-defined layouts. In
+contrast, transposed_mapping would only enable (or
+simplify) optimizations for one of many legal ways to implement the
+Standard.
It makes less sense for transposed_mapping to be a
+hidden friend than it does for submdspan_mapping to be a
+hidden friend. However, defining transposed_mapping as a
+nonmember function without using the hidden friends technique would make
+transposed_mapping vulnerable to implicit conversions. This
+could make transposed’s calls to the customization point
+ambiguous.
LEWG has already seen the proposed design over several reviews +(the last being the 2022/07/05 telecon review of P1673R9), but has not +yet had a chance to review this alternative customization point +design.
Regarding (4), the C++ Working Draft defines
+submdspan_mapping customizations for Standard layout
+mappings as “hidden friends.” The hidden friends technique protects use
+of the customization from possible ambiguities due to implicit
+conversions. This matters because layout mappings have many implicit
+conversions. These conversions help make mdspan-based
+interfaces more usable. Slicing is closely enough related to a layout
+mapping’s behavior that it makes sense to put the slicing customization
+in the layout mapping. However, it makes less sense to make
+transposed_mapping a hidden friend of the mapping, because
+transposition only works for rank-2 mappings, and transposition is
+specific to linear algebra and related computations.
We think the disadvantages of a customization point outweigh the
+advantages. For example, we do not recommend adding
+transposed_mapping as a hidden friend of all the Standard
+layout mappings. We also would not want to force users to remember to
+protect their customizations from the possibility of ambiguous
+overloads. As a result, we do not provide wording for this alternative
+design. Nevertheless, we would like LEWG to poll this option.
This proposal is implemented as
+PR 268 in the
+reference mdspan implementation.
++Text in blockquotes is not proposed wording, but rather instructions +for generating proposed wording.
+Make the following changes to the latest C++ Working Draft as of the +time of writing. All wording is relative to the latest C++ Working +Draft.
+In [version.syn], increase the value of the +
+__cpp_lib_linalgmacro by replacing YYYMML below with the +integer literal encoding the appropriate year (YYYY) and month (MM).
#define __cpp_lib_linalg YYYYMML // also in <linalg>++Change [linalg.transp.transposed] paragraph 3 (“Let +
+ReturnExtentsbe …”) by inserting the following +subparagraphs after subparagraph 3.2 (“otherwise, +layout_left…”) and before current subparagraph 3.3 +(“otherwise,layout_stride…”, to be renumbered to +paragraph 3.5), and renumbering subparagraphs and subsubparagraphs +within paragraph 3 thereafter.
(3.3)
+otherwise, layout_right_padded<PaddingValue> if
+Layout is
+layout_left_padded<PaddingValue> for some
+size_t value PaddingValue;
(3.4)
+otherwise, layout_left_padded<PaddingValue> if
+Layout is
+layout_right_padded<PaddingValue> for some
+size_t value PaddingValue;
++Change [linalg.transp.transposed] paragraph 4 (Returns +clause of
+transposed) by inserting the following +subparagraphs after subparagraph 4.1 (forLayoutbeing +layout_left,layout_right, or a specialization +oflayout_blas_packed) and before current subparagraph 4.2 +(forLayoutbeinglayout_stride, to be +renumbered to subparagraph 4.4), and renumbering subparagraphs within +paragraph 4 thereafter.
(4.2)
+otherwise,
+R(a.data_handle(), ReturnMapping(transpose-extents(a.mapping().extents()), a.mapping().stride(1)), a.accessor())
+if Layout is
+layout_left_padded<PaddingValue> for some
+size_t value PaddingValue;
(4.3)
+otherwise,
+R(a.data_handle(), ReturnMapping(transpose-extents(a.mapping().extents()), a.stride(0)), a.accessor())
+if Layout is
+layout_right_padded<PaddingValue> for some
+size_t value PaddingValue;