-
Notifications
You must be signed in to change notification settings - Fork 507
Description
I'm using TurbSim and OpenFAST 4.1.2 for wind turbine simulations on both Windows (Windows 11 24H2, AMD 9950X) and macOS (macOS 26.2, M4 Pro). Since there is no official TurbSim distribution for macOS, I compiled TurbSim myself using GCC 15.2.0 with the following build flags:
BUILD_UNIT_TESTING=OFF
DOUBLE_PRECISION=OFF
VARIABLE_TRACKING=OFFPerformance in OpenFAST is quite similar on both platforms. However, TurbSim takes approximately 3 to 5 times longer on macOS than on Windows for the same input file.
To investigate the issue, I used Instruments in Xcode to profile TurbSim on macOS with the Time Profiler. The hotspot subroutine is coh2h(), which primarily calls LAPACK_pptrf() from the nwtc_lapack.f90 module, and the subroutine LAPACK_pptrf() uses xPPTRF (SPPTRF or DPPTRF) to perform Cholesky decomposition.
I made the following modification in the nwtc_lapack.f90 file for the LAPACK_SPPTRF subroutine (removing the comments):
Original Version:
SUBROUTINE LAPACK_SPPTRF (UPLO, N, AP, ErrStat, ErrMsg)
INTEGER, intent(in ) :: N
REAL(SiKi) ,intent(inout) :: AP( : )
INTEGER(IntKi), intent( out) :: ErrStat
CHARACTER(*), intent( out) :: ErrMsg
CHARACTER(1), intent(in ) :: UPLO
INTEGER :: INFO
ErrStat = ErrID_None
ErrMsg = ""
CALL SPPTRF (UPLO, N, AP, INFO)
IF (INFO /= 0) THEN
ErrStat = ErrID_FATAL
WRITE( ErrMsg, * ) INFO
IF (INFO < 0) THEN
ErrMsg = "LAPACK_SPPTRF: illegal value in argument "//TRIM(ErrMsg)//"."
ELSE
ErrMsg = 'LAPACK_SPPTRF: Leading minor order '//TRIM(ErrMsg)//' of A is not positive definite, so Cholesky factorization could not be completed.'
END IF
END IF
RETURN
END SUBROUTINE LAPACK_SPPTRFModified Version:
SUBROUTINE LAPACK_SPPTRF (UPLO, N, AP, ErrStat, ErrMsg)
INTEGER, INTENT(IN ) :: N
REAL(SiKi), INTENT(INOUT) :: AP(:)
INTEGER(IntKi), INTENT( OUT) :: ErrStat
CHARACTER(*), INTENT( OUT) :: ErrMsg
CHARACTER(1), INTENT(IN ) :: UPLO
REAL(SiKi), ALLOCATABLE :: A(:,:)
INTEGER :: I, J, IDX, INFO
ErrStat = ErrID_None
ErrMsg = ""
IF (N <= 0) RETURN
ALLOCATE(A(N,N))
A = 0.0_SiKi
IF (UPLO == 'U') THEN
IDX = 0
DO J = 1, N
DO I = 1, J
IDX = IDX + 1
A(I,J) = AP(IDX)
END DO
END DO
ELSE IF (UPLO == 'L') THEN
IDX = 0
DO J = 1, N
DO I = J, N
IDX = IDX + 1
A(I,J) = AP(IDX)
END DO
END DO
ELSE
ErrStat = ErrID_FATAL
ErrMsg = "LAPACK_SPPTRF: invalid UPLO value."
DEALLOCATE(A)
RETURN
END IF
CALL SPOTRF (UPLO, N, A, N, INFO)
IF (INFO /= 0) THEN
ErrStat = ErrID_FATAL
WRITE(ErrMsg,*) INFO
IF (INFO < 0) THEN
ErrMsg = "LAPACK_SPPTRF: illegal value in argument "//TRIM(ErrMsg)//"."
ELSE
ErrMsg = "LAPACK_SPPTRF: Leading minor of order "//TRIM(ErrMsg)// &
" of A is not positive definite, so Cholesky factorization could not be completed."
END IF
DEALLOCATE(A)
RETURN
END IF
IF (UPLO == 'U') THEN
IDX = 0
DO J = 1, N
DO I = 1, J
IDX = IDX + 1
AP(IDX) = A(I,J)
END DO
END DO
ELSE
IDX = 0
DO J = 1, N
DO I = J, N
IDX = IDX + 1
AP(IDX) = A(I,J)
END DO
END DO
END IF
DEALLOCATE(A)
RETURN
END SUBROUTINE LAPACK_SPPTRFThe core modification is to replace the packed storage Cholesky decomposition (SPPTRF) with the full storage Cholesky decomposition (SPOTRF). Also, this modification maintains the subroutine signature by using a wrapper internally to convert between packed and full storage formats, ensuring that no changes are needed on the caller side. The modified version is significantly faster than the original one (see the comparsion below), with minimal additional memory overhead.
Comparsion:
I used both versions of TurbSim to generate (1) Grid = 43 x 43, 120-second .bts file; (2) Grid = 23 x 23, 600-second .bts file. The performance results are shown below:
(1)
| Version | Total Time | coh2h() |
|---|---|---|
| Original (SPPTRF) | 1.89 minutes | 1.76 minutes |
| Modified (SPOTRF) | 11 seconds | 3.5 seconds |
(2)
| Version | Total Time | coh2h() |
|---|---|---|
| Original (SPPTRF) | 15.5 seconds | 12.4 seconds |
| Modified (SPOTRF) | 4.3 seconds | 1.1 seconds |
Even though the modified version using SPOTRF introduces additional matrix copying, the speed improvement over SPPTRF is substantial. In fact, TurbSim’s Coh2h() subroutine can be rewritten to directly use xPOTRF, eliminating the additional copying overhead.
Recommendation:
Given the significant performance improvement on macOS, I recommend rewriting the Coh2h() subroutine to use xPOTRF instead of xPPTRF on macOS. This change provides a substantial speed boost on macOS.