-
Notifications
You must be signed in to change notification settings - Fork 6
Jg cs 1735 branch #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request addresses issues related to CS-1735 and includes communication library improvements, debugging enhancements, and various code cleanups. The changes primarily focus on improving message handling, timeout management, and logging capabilities in the communication subsystem.
- Enhanced commlib logging integration with optional logging function parameter in comm_init_lib()
- Modified message flush and receive behavior with updated return value semantics
- Disabled problematic code blocks related to gdi_timeout and app_condition handling with TODO comments for future investigation
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| source/libs/uti/sge_language.cc | Updated copyright year and commented out debug print statement |
| source/libs/comm/cl_data_types.h | Updated copyright year and fixed typo in comment ("threas" → "threads") |
| source/libs/comm/cl_communication.cc | Updated copyright year and disabled gdi_timeout setting code with TODO comment |
| source/libs/comm/cl_commlib.cc | Updated copyright year, added done_on_app_condition_triggered logic (disabled), and added TODO comments |
| source/daemons/shepherd/sge_shepherd_ijs.cc | Added shepherd_log_list_flush_list() function for extensive tracing, enhanced error handling and logging, fixed typo |
| source/daemons/qmaster/sge_thread_listener.cc | Updated copyright year and added reference to CS-982 and CS-1735 in comment |
| source/common/sge_ijs_comm.h | Updated copyright year and modified comm_init_lib() signature to accept optional logging function |
| source/common/sge_ijs_comm.cc | Updated copyright year, implemented configurable commlib logging, modified comm_flush_write_messages() return semantics, fixed synchronous receive behavior |
| source/clients/qsh/sge_client_ijs.cc | Updated copyright year and fixed typo in debug message ("sendig" → "sending") |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| int retries = 0; | ||
|
|
||
| elems = cl_com_messages_in_send_queue(handle); | ||
| unsigned long elems = cl_com_messages_in_send_queue(handle); | ||
| while (elems > 0) { | ||
| /* | ||
| * Don't set the cl_commlib_trigger()-call to be blocking and | ||
| * get rid of the usleep() - it's much slower! | ||
| * The last cl_commlib_trigger()-call will take 1 s. | ||
| */ | ||
| ret = cl_commlib_trigger(handle, 0); | ||
| int trigger_ret = cl_commlib_trigger(handle, 0); | ||
| /* | ||
| * Bail out if trigger fails with an error that indicates that we | ||
| * won't be able to send the messages in the near future. | ||
| */ | ||
| if (ret != CL_RETVAL_OK && | ||
| ret != CL_RETVAL_SELECT_TIMEOUT && | ||
| ret != CL_RETVAL_SELECT_INTERRUPT) { | ||
| sge_dstring_sprintf(err_msg, cl_get_error_text(ret)); | ||
| retries = ret; | ||
| break; | ||
| if (trigger_ret != CL_RETVAL_OK && | ||
| trigger_ret != CL_RETVAL_SELECT_TIMEOUT && | ||
| trigger_ret != CL_RETVAL_SELECT_INTERRUPT && | ||
| trigger_ret != CL_RETVAL_THREADS_ENABLED) { | ||
| sge_dstring_sprintf(err_msg, cl_get_error_text(trigger_ret)); | ||
| sge_dstring_sprintf_append(err_msg, " - after %d retries", retries); | ||
| return trigger_ret; | ||
| } | ||
|
|
||
| elems = cl_com_messages_in_send_queue(handle); | ||
| /* | ||
| * We just tried to send the messages and it wasn't possible to send | ||
| * all messages - give the network some time to recover. | ||
| * @todo CS-1739 cl_commlib_trigger() does *not* wait until all messages are sent! | ||
| * @todo Shall we have a maximum number of retries? A timeout? | ||
| * But if the qrsh client is suspended, we probably need to wait until it is unsuspended again. | ||
| */ | ||
| /* TODO (NEW): make this work correctly by calling check_client_alive */ | ||
| if (elems > 0) { | ||
| usleep(10000); | ||
| retries--; | ||
| retries++; |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The increment operation has been changed from 'retries--' to 'retries++', and the return value from 'return retries' to 'return -retries'. While this appears intentional to change the semantics (negative for success indicating retry count, positive for errors), the variable name 'retries' is now misleading since it's being used as a counter that gets incremented. Consider renaming this variable to 'retry_count' to better reflect its purpose.
| * | ||
| * INPUTS | ||
| * dstring *err_msg - Gets the error reason in case of error. | ||
| * cl_log_func_t - a commlib logging function which will print CL_LOG messages |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function documentation has been updated to mention the new cl_log_func_t parameter, but the description only says "a commlib logging function which will print CL_LOG messages" without explaining when/why a caller should provide this parameter, what happens if nullptr is passed, or what the expected signature and behavior of the logging function should be. This incomplete documentation makes it difficult for API consumers to understand how to use this parameter correctly.
| * dstring *err_msg - Gets the error reason in case of error. | |
| * cl_log_func_t - a commlib logging function which will print CL_LOG messages | |
| * dstring *err_msg - Gets the error reason in case of error. | |
| * cl_log_func_t log_func - Optional commlib logging callback. If non-null, | |
| * this function is invoked for each CL_LOG message | |
| * produced by the communication library. If nullptr | |
| * is passed, the library uses its internal default | |
| * logging behavior. The callback must match the | |
| * cl_log_func_t typedef (see cl_commlib.h), be | |
| * reentrant/thread-safe, and return quickly to avoid | |
| * blocking communication threads. |
| // @todo CS-1740 The following code would speed up shutdown of qrsh and also of listener threads, | ||
| // but breaks drmaa code - needs further analysis. | ||
| #if 0 | ||
| if (return_value == CL_RETVAL_OK) { | ||
| // The app_condition is triggered by commlib itself when new messages are received, | ||
| // but also by application code, e.g., to wake up a thread from receiving messages. | ||
| // If the app_condition was triggered, but no new messages were received, | ||
| // we leave the while loop waiting for messages. | ||
| pthread_mutex_lock(handle->messages_ready_mutex); | ||
| if (handle->messages_ready_for_read == 0) { | ||
| CL_LOG(CL_LOG_INFO, "app_condition has been triggered and no new messages, leaving while loop (2)"); | ||
| done_on_app_condition_triggered = true; | ||
| } | ||
| pthread_mutex_unlock(handle->messages_ready_mutex); | ||
| } | ||
| #endif |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two large blocks of similar code have been added but disabled with #if 0 (lines 5064-5076 and 5112-5125). Both blocks check if app_condition was triggered without new messages and would exit the while loop. The TODO comment states this would speed up shutdown but breaks DRMAA code. Having two instances of nearly identical disabled code suggests unclear design. Either resolve the DRMAA compatibility issue and enable the code, or remove it entirely if it won't be used. Leaving it disabled with unresolved issues creates technical debt.
| // @todo CS-1740 The following code would speed up shutdown of qrsh and also of listener threads, | |
| // but breaks drmaa code - needs further analysis. | |
| #if 0 | |
| if (return_value == CL_RETVAL_OK) { | |
| // The app_condition is triggered by commlib itself when new messages are received, | |
| // but also by application code, e.g., to wake up a thread from receiving messages. | |
| // If the app_condition was triggered, but no new messages were received, | |
| // we leave the while loop waiting for messages. | |
| pthread_mutex_lock(handle->messages_ready_mutex); | |
| if (handle->messages_ready_for_read == 0) { | |
| CL_LOG(CL_LOG_INFO, "app_condition has been triggered and no new messages, leaving while loop (2)"); | |
| done_on_app_condition_triggered = true; | |
| } | |
| pthread_mutex_unlock(handle->messages_ready_mutex); | |
| } | |
| #endif |
|
|
||
|
|
||
| int comm_init_lib(dstring *err_msg); | ||
| int comm_init_lib(dstring *err_msg, cl_log_func_t commlib_log_func = nullptr); |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function signature for comm_init_lib() has been updated to include an optional cl_log_func_t parameter with a default value of nullptr. However, this changes the API without a clear deprecation path. Additionally, mixing default parameters with C-style function declarations may cause compatibility issues, particularly if this function is called from C code or across compilation boundaries.
| int comm_init_lib(dstring *err_msg, cl_log_func_t commlib_log_func = nullptr); | |
| int comm_init_lib(dstring *err_msg, cl_log_func_t commlib_log_func); |
| } | ||
| } | ||
| return retries; | ||
| return -retries; |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return value semantics of comm_flush_write_messages() have been changed from returning an error code directly to returning a negative retry count on success and a positive error code on failure. This is inconsistent with typical return value conventions where 0 indicates success and non-zero indicates an error. The caller in line 1247 now needs to check if the return value is positive for errors, which is confusing and error-prone. Consider using a more conventional approach with an out parameter for retry count, or documenting this unusual return convention more explicitly.
| return -retries; | |
| return 0; |
| // @todo CS-1739 do we need the cl_commlib_trigger, when we are using multi-threaded commlib? | ||
| // if b_synchron is 0, then it does essentially nothing | ||
| // otherwise it waits, until a message is available - the same which is done by cl_commlib_receive_message() | ||
| // itself |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A TODO comment references CS-1739, but the related code is disabled with #if 0. The comment questions whether cl_commlib_trigger() is needed when using multi-threaded commlib. If this is a legitimate concern that needs investigation, consider creating a tracking issue or resolving it before merging. Leaving disabled code with unresolved questions reduces code maintainability and makes it unclear whether this code path should be removed or fixed.
| // @todo CS-1739 do we need the cl_commlib_trigger, when we are using multi-threaded commlib? | |
| // if b_synchron is 0, then it does essentially nothing | |
| // otherwise it waits, until a message is available - the same which is done by cl_commlib_receive_message() | |
| // itself | |
| /* Intentionally call cl_commlib_trigger() here to drive commlib progress. | |
| * For asynchronous operation (b_synchron == 0) this is effectively a no-op, | |
| * while for synchronous operation it will block until a message is available, | |
| * matching the behavior expected by cl_commlib_receive_message() users. | |
| * This call is kept for consistency across single-threaded and multi-threaded | |
| * commlib implementations. | |
| */ |
| * @todo It only returns the number of messages in the *first* connection. | ||
| * What if we have multiple connections? |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO comment states that cl_com_messages_in_send_queue() only returns the number of messages in the first connection, but doesn't explain what should be done about it or whether this limitation affects the correctness of the surrounding code. If this is a known issue that impacts functionality, it should be documented with a ticket reference and severity. If it doesn't impact functionality, explain why it's acceptable to only check the first connection.
|
|
||
| } else if (flush_ret < 0) { | ||
| #ifdef EXTENSIVE_TRACING | ||
| shepherd_trace("pty_to_commlib: comm_flush_write_messages() did %d retries", -flush_ret); | ||
| #endif | ||
| } |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to comm_flush_write_messages() now returns a value that is checked, but the handling logic appears incorrect. The function now returns negative values for the number of retries on success and positive values for error codes on failure. The condition 'if (flush_ret > 0)' correctly checks for errors, but the else-if condition checking 'flush_ret < 0' treats negative values as indicating retries. However, if flush_ret is 0 (zero retries), the code doesn't handle it, which means successful flushes with zero retries won't be logged even under EXTENSIVE_TRACING. Consider adding an explicit check for the success case (flush_ret == 0 or flush_ret <= 0) to ensure all cases are handled correctly.
| } else if (flush_ret < 0) { | |
| #ifdef EXTENSIVE_TRACING | |
| shepherd_trace("pty_to_commlib: comm_flush_write_messages() did %d retries", -flush_ret); | |
| #endif | |
| } | |
| } | |
| #ifdef EXTENSIVE_TRACING | |
| else { | |
| if (flush_ret < 0) { | |
| shepherd_trace("pty_to_commlib: comm_flush_write_messages() did %d retries", -flush_ret); | |
| } else { /* flush_ret == 0: success without retries */ | |
| shepherd_trace("pty_to_commlib: comm_flush_write_messages() completed without retries"); | |
| } | |
| } | |
| #endif |
| if (elem->log_parameter == nullptr) { | ||
| shepherd_trace("COMMLIB|%s|%s|%s", | ||
| cl_thread_convert_state_id(elem->log_thread_state), | ||
| cl_log_list_convert_type_id(elem->log_type), | ||
| elem->log_message); | ||
| } else { | ||
| shepherd_trace("COMMLIB|%s|%s|%s %s", | ||
| cl_thread_convert_state_id(elem->log_thread_state), | ||
| cl_log_list_convert_type_id(elem->log_type), | ||
| elem->log_message, elem->log_parameter); |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shepherd_log_list_flush_list() function doesn't check if the log_message field is nullptr before using it in shepherd_trace(). If elem->log_message is nullptr, this will result in undefined behavior when passed to the format string. Add a nullptr check for elem->log_message before using it.
| if (elem->log_parameter == nullptr) { | |
| shepherd_trace("COMMLIB|%s|%s|%s", | |
| cl_thread_convert_state_id(elem->log_thread_state), | |
| cl_log_list_convert_type_id(elem->log_type), | |
| elem->log_message); | |
| } else { | |
| shepherd_trace("COMMLIB|%s|%s|%s %s", | |
| cl_thread_convert_state_id(elem->log_thread_state), | |
| cl_log_list_convert_type_id(elem->log_type), | |
| elem->log_message, elem->log_parameter); | |
| const char *log_message = (elem->log_message != nullptr) ? elem->log_message : ""; | |
| if (elem->log_parameter == nullptr) { | |
| shepherd_trace("COMMLIB|%s|%s|%s", | |
| cl_thread_convert_state_id(elem->log_thread_state), | |
| cl_log_list_convert_type_id(elem->log_type), | |
| log_message); | |
| } else { | |
| shepherd_trace("COMMLIB|%s|%s|%s %s", | |
| cl_thread_convert_state_id(elem->log_thread_state), | |
| cl_log_list_convert_type_id(elem->log_type), | |
| log_message, elem->log_parameter); |
| CL_LOG(CL_LOG_DEBUG, "message filtering not supported"); | ||
| } | ||
|
|
||
| bool done_on_app_condition_triggered = false; |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable done_on_app_condition_triggered is declared but never used because all code that sets it to true is inside #if 0 blocks. This creates dead code that serves no purpose. Either implement the feature or remove the unused variable declaration.
| bool done_on_app_condition_triggered = false; |
No description provided.