Skip to content

Switch from xPPTRF to xPOTRF to improve performance in TurbSim on macOS #3120

@IrisMeasure

Description

@IrisMeasure

I'm using TurbSim and OpenFAST 4.1.2 for wind turbine simulations on both Windows (Windows 11 24H2, AMD 9950X) and macOS (macOS 26.2, M4 Pro). Since there is no official TurbSim distribution for macOS, I compiled TurbSim myself using GCC 15.2.0 with the following build flags:

BUILD_UNIT_TESTING=OFF
DOUBLE_PRECISION=OFF
VARIABLE_TRACKING=OFF

Performance in OpenFAST is quite similar on both platforms. However, TurbSim takes approximately 3 to 5 times longer on macOS than on Windows for the same input file.

To investigate the issue, I used Instruments in Xcode to profile TurbSim on macOS with the Time Profiler. The hotspot subroutine is coh2h(), which primarily calls LAPACK_pptrf() from the nwtc_lapack.f90 module, and the subroutine LAPACK_pptrf() uses xPPTRF (SPPTRF or DPPTRF) to perform Cholesky decomposition.

I made the following modification in the nwtc_lapack.f90 file for the LAPACK_SPPTRF subroutine (removing the comments):

Original Version:

SUBROUTINE LAPACK_SPPTRF (UPLO, N, AP, ErrStat, ErrMsg)
    INTEGER,         intent(in   ) :: N                 
    REAL(SiKi)      ,intent(inout) :: AP( : )           
    INTEGER(IntKi),  intent(  out) :: ErrStat           
    CHARACTER(*),    intent(  out) :: ErrMsg            
    CHARACTER(1),    intent(in   ) :: UPLO              
    INTEGER                        :: INFO 
    ErrStat = ErrID_None
    ErrMsg  = ""
    CALL SPPTRF (UPLO, N, AP, INFO)
    IF (INFO /= 0) THEN
        ErrStat = ErrID_FATAL
        WRITE( ErrMsg, * ) INFO
        IF (INFO < 0) THEN
            ErrMsg  = "LAPACK_SPPTRF: illegal value in argument "//TRIM(ErrMsg)//"."
        ELSE
            ErrMsg = 'LAPACK_SPPTRF: Leading minor order '//TRIM(ErrMsg)//' of A is not positive definite, so Cholesky factorization could not be completed.'
        END IF
    END IF
RETURN
END SUBROUTINE LAPACK_SPPTRF

Modified Version:

SUBROUTINE LAPACK_SPPTRF (UPLO, N, AP, ErrStat, ErrMsg)
    INTEGER,         INTENT(IN   ) :: N
    REAL(SiKi),      INTENT(INOUT) :: AP(:)
    INTEGER(IntKi),  INTENT(  OUT) :: ErrStat
    CHARACTER(*),    INTENT(  OUT) :: ErrMsg
    CHARACTER(1),    INTENT(IN   ) :: UPLO

    REAL(SiKi), ALLOCATABLE :: A(:,:)
    INTEGER :: I, J, IDX, INFO

    ErrStat = ErrID_None
    ErrMsg  = ""

    IF (N <= 0) RETURN

    ALLOCATE(A(N,N))
    A = 0.0_SiKi

    IF (UPLO == 'U') THEN
        IDX = 0
        DO J = 1, N
            DO I = 1, J
                IDX = IDX + 1
                A(I,J) = AP(IDX)
            END DO
        END DO
    ELSE IF (UPLO == 'L') THEN
        IDX = 0
        DO J = 1, N
            DO I = J, N
                IDX = IDX + 1
                A(I,J) = AP(IDX)
            END DO
        END DO
    ELSE
        ErrStat = ErrID_FATAL
        ErrMsg  = "LAPACK_SPPTRF: invalid UPLO value."
        DEALLOCATE(A)
        RETURN
    END IF

    CALL SPOTRF (UPLO, N, A, N, INFO)

    IF (INFO /= 0) THEN
        ErrStat = ErrID_FATAL
        WRITE(ErrMsg,*) INFO
        IF (INFO < 0) THEN
            ErrMsg = "LAPACK_SPPTRF: illegal value in argument "//TRIM(ErrMsg)//"."
        ELSE
            ErrMsg = "LAPACK_SPPTRF: Leading minor of order "//TRIM(ErrMsg)// &
                     " of A is not positive definite, so Cholesky factorization could not be completed."
        END IF
        DEALLOCATE(A)
        RETURN
    END IF

    IF (UPLO == 'U') THEN
        IDX = 0
        DO J = 1, N
            DO I = 1, J
                IDX = IDX + 1
                AP(IDX) = A(I,J)
            END DO
        END DO
    ELSE
        IDX = 0
        DO J = 1, N
            DO I = J, N
                IDX = IDX + 1
                AP(IDX) = A(I,J)
            END DO
        END DO
    END IF

    DEALLOCATE(A)
    RETURN

END SUBROUTINE LAPACK_SPPTRF

The core modification is to replace the packed storage Cholesky decomposition (SPPTRF) with the full storage Cholesky decomposition (SPOTRF). Also, this modification maintains the subroutine signature by using a wrapper internally to convert between packed and full storage formats, ensuring that no changes are needed on the caller side. The modified version is significantly faster than the original one (see the comparsion below), with minimal additional memory overhead.

Comparsion:

I used both versions of TurbSim to generate (1) Grid = 43 x 43, 120-second .bts file; (2) Grid = 23 x 23, 600-second .bts file. The performance results are shown below:

(1)

Version Total Time coh2h()
Original (SPPTRF) 1.89 minutes 1.76 minutes
Modified (SPOTRF) 11 seconds 3.5 seconds

(2)

Version Total Time coh2h()
Original (SPPTRF) 15.5 seconds 12.4 seconds
Modified (SPOTRF) 4.3 seconds 1.1 seconds

Even though the modified version using SPOTRF introduces additional matrix copying, the speed improvement over SPPTRF is substantial. In fact, TurbSim’s Coh2h() subroutine can be rewritten to directly use xPOTRF, eliminating the additional copying overhead.

Recommendation:

Given the significant performance improvement on macOS, I recommend rewriting the Coh2h() subroutine to use xPOTRF instead of xPPTRF on macOS. This change provides a substantial speed boost on macOS.

Metadata

Metadata

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions