Skip to content

Conversation

@carll99
Copy link
Member

@carll99 carll99 commented Dec 9, 2025

Here is the first pass at what we would like to see enabled for performance tracking.

Posting this as requested from earlier meeting. Will discuss the PR in our next meeting.

@mkumatag mkumatag added this to the .Next milestone Dec 10, 2025
@iv1111 iv1111 marked this pull request as draft December 10, 2025 09:09
Copy link
Member

@dharaneeshvrd dharaneeshvrd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if you can collect these timings along with the results somehow and display it in a tabular format instead of printing in separate lines.

max_tokens = data.get("max_tokens", 512)
temperature = data.get("temperature", 0.0)
stop_words = data.get("stop")
stream = data.get("stream")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you want to start a timer here to measure the overall execution of the query?

start_time = time.time()
vllm_stream = query_vllm_stream(prompt, docs, llm_endpoint, llm_model, stop_words, max_tokens, temperature, stream, dynamic_chunk_truncation=TRUNCATION)
request_time = time.time() - start_time
logger.info(f"Perf data: rag answer time = {request_time}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between rag answer time and llm inferencing time in llm_utils? It looks it would be produce same timings.

Copy link
Collaborator

@iv1111 iv1111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the amount of logging is acceptable.


try:
start_time = time.time()
vllm_stream = query_vllm_stream(prompt, docs, llm_endpoint, llm_model, stop_words, max_tokens, temperature, stream, dynamic_chunk_truncation=TRUNCATION)
Copy link
Collaborator

@iv1111 iv1111 Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call to query_vllm_stream in streaming mode returns a generator and not a final answer. This generator (like a C++ iterator) is then read chunk by chunk with each piece being sent to Web UI even after this Python backend function returns. So if you want to measure very precisely you should probably test non-streaming case only.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Niharika0306 chime in please to confirm.

Copy link
Contributor

@Niharika0306 Niharika0306 Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iv1111 , true.
I have addressed it in the updated PR

@iv1111
Copy link
Collaborator

iv1111 commented Dec 10, 2025

@carll99 You could try out this for request id retrieval:

from flask import request
.....
req_id = id(request)

Signed-off-by: Carl Love <cel@linux.ibm.com>
@carll99
Copy link
Member Author

carll99 commented Dec 10, 2025

The above comments are all really good. The concern was to make this as simple a possible given the pending release. The goal is to implement thread tracking and additional timings for a future release.

The blocking issue with the current patch is there is no sign off. I will close this PR and submit a new one with the sign off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants