Jg cs 1735 branch #71

jgabler-hpc · 2026-01-01T15:58:19Z

No description provided.

…d truncates output

…aa clients

Copilot

Pull request overview

This pull request addresses issues related to CS-1735 and includes communication library improvements, debugging enhancements, and various code cleanups. The changes primarily focus on improving message handling, timeout management, and logging capabilities in the communication subsystem.

Enhanced commlib logging integration with optional logging function parameter in comm_init_lib()
Modified message flush and receive behavior with updated return value semantics
Disabled problematic code blocks related to gdi_timeout and app_condition handling with TODO comments for future investigation

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
source/libs/uti/sge_language.cc	Updated copyright year and commented out debug print statement
source/libs/comm/cl_data_types.h	Updated copyright year and fixed typo in comment ("threas" → "threads")
source/libs/comm/cl_communication.cc	Updated copyright year and disabled gdi_timeout setting code with TODO comment
source/libs/comm/cl_commlib.cc	Updated copyright year, added done_on_app_condition_triggered logic (disabled), and added TODO comments
source/daemons/shepherd/sge_shepherd_ijs.cc	Added shepherd_log_list_flush_list() function for extensive tracing, enhanced error handling and logging, fixed typo
source/daemons/qmaster/sge_thread_listener.cc	Updated copyright year and added reference to CS-982 and CS-1735 in comment
source/common/sge_ijs_comm.h	Updated copyright year and modified comm_init_lib() signature to accept optional logging function
source/common/sge_ijs_comm.cc	Updated copyright year, implemented configurable commlib logging, modified comm_flush_write_messages() return semantics, fixed synchronous receive behavior
source/clients/qsh/sge_client_ijs.cc	Updated copyright year and fixed typo in debug message ("sendig" → "sending")

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-01T16:02:32Z

source/common/sge_ijs_comm.cc

+   int retries = 0;

-   elems = cl_com_messages_in_send_queue(handle);
+   unsigned long elems = cl_com_messages_in_send_queue(handle);
   while (elems > 0) {
      /*
       * Don't set the cl_commlib_trigger()-call to be blocking and
       * get rid of the usleep() - it's much slower!
       * The last cl_commlib_trigger()-call will take 1 s.
       */
-      ret = cl_commlib_trigger(handle, 0);
+      int trigger_ret = cl_commlib_trigger(handle, 0);
      /* 
       * Bail out if trigger fails with an error that indicates that we
       * won't be able to send the messages in the near future.
       */
-      if (ret != CL_RETVAL_OK && 
-          ret != CL_RETVAL_SELECT_TIMEOUT &&
-          ret != CL_RETVAL_SELECT_INTERRUPT) {
-         sge_dstring_sprintf(err_msg, cl_get_error_text(ret));
-         retries = ret;  
-         break;   
+      if (trigger_ret != CL_RETVAL_OK &&
+          trigger_ret != CL_RETVAL_SELECT_TIMEOUT &&
+          trigger_ret != CL_RETVAL_SELECT_INTERRUPT &&
+          trigger_ret != CL_RETVAL_THREADS_ENABLED) {
+         sge_dstring_sprintf(err_msg, cl_get_error_text(trigger_ret));
+         sge_dstring_sprintf_append(err_msg, " - after %d retries", retries);
+         return trigger_ret;
      }
+
      elems = cl_com_messages_in_send_queue(handle);
      /* 
       * We just tried to send the messages and it wasn't possible to send
       * all messages - give the network some time to recover.
+       * @todo CS-1739 cl_commlib_trigger() does *not* wait until all messages are sent!
+       * @todo Shall we have a maximum number of retries? A timeout?
+       *       But if the qrsh client is suspended, we probably need to wait until it is unsuspended again.
       */
      /* TODO (NEW): make this work correctly by calling check_client_alive */
      if (elems > 0) {
         usleep(10000);
-         retries--;
+         retries++;


The increment operation has been changed from 'retries--' to 'retries++', and the return value from 'return retries' to 'return -retries'. While this appears intentional to change the semantics (negative for success indicating retry count, positive for errors), the variable name 'retries' is now misleading since it's being used as a counter that gets incremented. Consider renaming this variable to 'retry_count' to better reflect its purpose.

Copilot · 2026-01-01T16:02:32Z

source/common/sge_ijs_comm.cc

 *
 *  INPUTS
 *     dstring *err_msg - Gets the error reason in case of error.
+*     cl_log_func_t    - a commlib logging function which will print CL_LOG messages


The function documentation has been updated to mention the new cl_log_func_t parameter, but the description only says "a commlib logging function which will print CL_LOG messages" without explaining when/why a caller should provide this parameter, what happens if nullptr is passed, or what the expected signature and behavior of the logging function should be. This incomplete documentation makes it difficult for API consumers to understand how to use this parameter correctly.

Suggested change

* dstring *err_msg - Gets the error reason in case of error.

* cl_log_func_t - a commlib logging function which will print CL_LOG messages

* dstring *err_msg - Gets the error reason in case of error.

* cl_log_func_t log_func - Optional commlib logging callback. If non-null,

* this function is invoked for each CL_LOG message

* produced by the communication library. If nullptr

* is passed, the library uses its internal default

* logging behavior. The callback must match the

* cl_log_func_t typedef (see cl_commlib.h), be

* reentrant/thread-safe, and return quickly to avoid

* blocking communication threads.

Copilot · 2026-01-01T16:02:33Z

source/libs/comm/cl_commlib.cc

+               // @todo CS-1740 The following code would speed up shutdown of qrsh and also of listener threads,
+               // but breaks drmaa code - needs further analysis.
+#if 0
+               if (return_value == CL_RETVAL_OK) {
+                  // The app_condition is triggered by commlib itself when new messages are received,
+                  // but also by application code, e.g., to wake up a thread from receiving messages.
+                  // If the app_condition was triggered, but no new messages were received,
+                  // we leave the while loop waiting for messages.
+                  pthread_mutex_lock(handle->messages_ready_mutex);
+                  if (handle->messages_ready_for_read == 0) {
+                     CL_LOG(CL_LOG_INFO, "app_condition has been triggered and no new messages, leaving while loop (2)");
+                     done_on_app_condition_triggered = true;
+                  }
+                  pthread_mutex_unlock(handle->messages_ready_mutex);
+               }
+#endif


Two large blocks of similar code have been added but disabled with #if 0 (lines 5064-5076 and 5112-5125). Both blocks check if app_condition was triggered without new messages and would exit the while loop. The TODO comment states this would speed up shutdown but breaks DRMAA code. Having two instances of nearly identical disabled code suggests unclear design. Either resolve the DRMAA compatibility issue and enable the code, or remove it entirely if it won't be used. Leaving it disabled with unresolved issues creates technical debt.

Suggested change

// @todo CS-1740 The following code would speed up shutdown of qrsh and also of listener threads,

// but breaks drmaa code - needs further analysis.

#if 0

if (return_value == CL_RETVAL_OK) {

// The app_condition is triggered by commlib itself when new messages are received,

// but also by application code, e.g., to wake up a thread from receiving messages.

// If the app_condition was triggered, but no new messages were received,

// we leave the while loop waiting for messages.

pthread_mutex_lock(handle->messages_ready_mutex);

if (handle->messages_ready_for_read == 0) {

CL_LOG(CL_LOG_INFO, "app_condition has been triggered and no new messages, leaving while loop (2)");

done_on_app_condition_triggered = true;

}

pthread_mutex_unlock(handle->messages_ready_mutex);

}

#endif

Copilot · 2026-01-01T16:02:33Z

source/common/sge_ijs_comm.h



-int comm_init_lib(dstring *err_msg);
+int comm_init_lib(dstring *err_msg, cl_log_func_t commlib_log_func = nullptr);


The function signature for comm_init_lib() has been updated to include an optional cl_log_func_t parameter with a default value of nullptr. However, this changes the API without a clear deprecation path. Additionally, mixing default parameters with C-style function declarations may cause compatibility issues, particularly if this function is called from C code or across compilation boundaries.

Suggested change

int comm_init_lib(dstring *err_msg, cl_log_func_t commlib_log_func = nullptr);

int comm_init_lib(dstring *err_msg, cl_log_func_t commlib_log_func);

Copilot · 2026-01-01T16:02:33Z

source/common/sge_ijs_comm.cc

      }
   }
-   return retries;
+   return -retries;


The return value semantics of comm_flush_write_messages() have been changed from returning an error code directly to returning a negative retry count on success and a positive error code on failure. This is inconsistent with typical return value conventions where 0 indicates success and non-zero indicates an error. The caller in line 1247 now needs to check if the return value is positive for errors, which is confusing and error-prone. Consider using a more conventional approach with an out parameter for retry count, or documenting this unusual return convention more explicitly.

Suggested change

return -retries;

return 0;

Copilot · 2026-01-01T16:02:34Z

source/common/sge_ijs_comm.cc

+      // @todo CS-1739 do we need the cl_commlib_trigger, when we are using multi-threaded commlib?
+      // if b_synchron is 0, then it does essentially nothing
+      // otherwise it waits, until a message is available - the same which is done by cl_commlib_receive_message()
+      // itself


A TODO comment references CS-1739, but the related code is disabled with #if 0. The comment questions whether cl_commlib_trigger() is needed when using multi-threaded commlib. If this is a legitimate concern that needs investigation, consider creating a tracking issue or resolving it before merging. Leaving disabled code with unresolved questions reduces code maintainability and makes it unclear whether this code path should be removed or fixed.

Suggested change

// @todo CS-1739 do we need the cl_commlib_trigger, when we are using multi-threaded commlib?

// if b_synchron is 0, then it does essentially nothing

// otherwise it waits, until a message is available - the same which is done by cl_commlib_receive_message()

// itself

/* Intentionally call cl_commlib_trigger() here to drive commlib progress.

* For asynchronous operation (b_synchron == 0) this is effectively a no-op,

* while for synchronous operation it will block until a message is available,

* matching the behavior expected by cl_commlib_receive_message() users.

* This call is kept for consistency across single-threaded and multi-threaded

* commlib implementations.

*/

Copilot · 2026-01-01T16:02:34Z

source/libs/comm/cl_commlib.cc

+*     @todo It only returns the number of messages in the *first* connection.
+*           What if we have multiple connections?


The TODO comment states that cl_com_messages_in_send_queue() only returns the number of messages in the first connection, but doesn't explain what should be done about it or whether this limitation affects the correctness of the surrounding code. If this is a known issue that impacts functionality, it should be documented with a ticket reference and severity. If it doesn't impact functionality, explain why it's acceptable to only check the first connection.

Copilot · 2026-01-01T16:02:34Z

source/daemons/shepherd/sge_shepherd_ijs.cc

+
+         } else if (flush_ret < 0) {
+#ifdef EXTENSIVE_TRACING
+            shepherd_trace("pty_to_commlib: comm_flush_write_messages() did %d retries", -flush_ret);
+#endif
+         }


The call to comm_flush_write_messages() now returns a value that is checked, but the handling logic appears incorrect. The function now returns negative values for the number of retries on success and positive values for error codes on failure. The condition 'if (flush_ret > 0)' correctly checks for errors, but the else-if condition checking 'flush_ret < 0' treats negative values as indicating retries. However, if flush_ret is 0 (zero retries), the code doesn't handle it, which means successful flushes with zero retries won't be logged even under EXTENSIVE_TRACING. Consider adding an explicit check for the success case (flush_ret == 0 or flush_ret <= 0) to ensure all cases are handled correctly.

Suggested change

} else if (flush_ret < 0) {

#ifdef EXTENSIVE_TRACING

shepherd_trace("pty_to_commlib: comm_flush_write_messages() did %d retries", -flush_ret);

#endif

}

}

#ifdef EXTENSIVE_TRACING

else {

if (flush_ret < 0) {

shepherd_trace("pty_to_commlib: comm_flush_write_messages() did %d retries", -flush_ret);

} else { /* flush_ret == 0: success without retries */

shepherd_trace("pty_to_commlib: comm_flush_write_messages() completed without retries");

}

}

#endif

Copilot · 2026-01-01T16:02:35Z

source/daemons/shepherd/sge_shepherd_ijs.cc

+      if (elem->log_parameter == nullptr) {
+         shepherd_trace("COMMLIB|%s|%s|%s",
+                        cl_thread_convert_state_id(elem->log_thread_state),
+                        cl_log_list_convert_type_id(elem->log_type),
+                        elem->log_message);
+      } else {
+         shepherd_trace("COMMLIB|%s|%s|%s %s",
+                        cl_thread_convert_state_id(elem->log_thread_state),
+                        cl_log_list_convert_type_id(elem->log_type),
+                        elem->log_message, elem->log_parameter);


The shepherd_log_list_flush_list() function doesn't check if the log_message field is nullptr before using it in shepherd_trace(). If elem->log_message is nullptr, this will result in undefined behavior when passed to the format string. Add a nullptr check for elem->log_message before using it.

Suggested change

if (elem->log_parameter == nullptr) {

shepherd_trace("COMMLIB|%s|%s|%s",

cl_thread_convert_state_id(elem->log_thread_state),

cl_log_list_convert_type_id(elem->log_type),

elem->log_message);

} else {

shepherd_trace("COMMLIB|%s|%s|%s %s",

cl_thread_convert_state_id(elem->log_thread_state),

cl_log_list_convert_type_id(elem->log_type),

elem->log_message, elem->log_parameter);

const char *log_message = (elem->log_message != nullptr) ? elem->log_message : "";

if (elem->log_parameter == nullptr) {

shepherd_trace("COMMLIB|%s|%s|%s",

cl_thread_convert_state_id(elem->log_thread_state),

cl_log_list_convert_type_id(elem->log_type),

log_message);

} else {

shepherd_trace("COMMLIB|%s|%s|%s %s",

cl_thread_convert_state_id(elem->log_thread_state),

cl_log_list_convert_type_id(elem->log_type),

log_message, elem->log_parameter);

Copilot · 2026-01-01T16:02:35Z

source/libs/comm/cl_commlib.cc

      CL_LOG(CL_LOG_DEBUG, "message filtering not supported");
   }
+
+   bool done_on_app_condition_triggered = false;


The variable done_on_app_condition_triggered is declared but never used because all code that sets it to true is inside #if 0 blocks. This creates dead code that serves no purpose. Either implement the feature or remove the unused variable declaration.

Suggested change

bool done_on_app_condition_triggered = false;

jgabler-hpc added 3 commits December 31, 2025 15:17

BF: CS-1735 qrsh doing massive output reports a commlib read error an…

473912e

…d truncates output

reverted a performance improvement, it seems to cause issues with drm…

b7bf501

…aa clients

marked disabled code with corresponding Jira-ID

c82b0b6

jgabler-hpc requested a review from Copilot January 1, 2026 15:58

Copilot started reviewing on behalf of jgabler-hpc January 1, 2026 15:58 View session

jgabler-hpc merged commit 7a31b62 into V90_BRANCH Jan 1, 2026
5 checks passed

Copilot AI reviewed Jan 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Jg cs 1735 branch #71

Jg cs 1735 branch #71

Uh oh!

jgabler-hpc commented Jan 1, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 1, 2026

Uh oh!

Copilot AI Jan 1, 2026

Uh oh!

Copilot AI Jan 1, 2026

Uh oh!

Copilot AI Jan 1, 2026

Uh oh!

Copilot AI Jan 1, 2026

Uh oh!

Copilot AI Jan 1, 2026

Uh oh!

Copilot AI Jan 1, 2026

Uh oh!

Copilot AI Jan 1, 2026

Uh oh!

Copilot AI Jan 1, 2026

Uh oh!

Copilot AI Jan 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-*     dstring *err_msg - Gets the error reason in case of error.
-*     cl_log_func_t    - a commlib logging function which will print CL_LOG messages
+*     dstring *err_msg      - Gets the error reason in case of error.
+*     cl_log_func_t log_func - Optional commlib logging callback. If non-null,
+*                              this function is invoked for each CL_LOG message
+*                              produced by the communication library. If nullptr
+*                              is passed, the library uses its internal default
+*                              logging behavior. The callback must match the
+*                              cl_log_func_t typedef (see cl_commlib.h), be
+*                              reentrant/thread-safe, and return quickly to avoid
+*                              blocking communication threads.



		int comm_init_lib(dstring *err_msg);
		int comm_init_lib(dstring *err_msg, cl_log_func_t commlib_log_func = nullptr);

-      // @todo CS-1739 do we need the cl_commlib_trigger, when we are using multi-threaded commlib?
-      // if b_synchron is 0, then it does essentially nothing
-      // otherwise it waits, until a message is available - the same which is done by cl_commlib_receive_message()
-      // itself
+      /* Intentionally call cl_commlib_trigger() here to drive commlib progress.
+       * For asynchronous operation (b_synchron == 0) this is effectively a no-op,
+       * while for synchronous operation it will block until a message is available,
+       * matching the behavior expected by cl_commlib_receive_message() users.
+       * This call is kept for consistency across single-threaded and multi-threaded
+       * commlib implementations.
+       */

		* @todo It only returns the number of messages in the first connection.
		* What if we have multiple connections?

Jg cs 1735 branch #71

Jg cs 1735 branch #71

Uh oh!

Conversation

jgabler-hpc commented Jan 1, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants