-
Notifications
You must be signed in to change notification settings - Fork 31
This is proposed addition for performance data logging #215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dharaneeshvrd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if you can collect these timings along with the results somehow and display it in a tabular format instead of printing in separate lines.
| max_tokens = data.get("max_tokens", 512) | ||
| temperature = data.get("temperature", 0.0) | ||
| stop_words = data.get("stop") | ||
| stream = data.get("stream") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you want to start a timer here to measure the overall execution of the query?
| start_time = time.time() | ||
| vllm_stream = query_vllm_stream(prompt, docs, llm_endpoint, llm_model, stop_words, max_tokens, temperature, stream, dynamic_chunk_truncation=TRUNCATION) | ||
| request_time = time.time() - start_time | ||
| logger.info(f"Perf data: rag answer time = {request_time}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between rag answer time and llm inferencing time in llm_utils? It looks it would be produce same timings.
iv1111
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the amount of logging is acceptable.
|
|
||
| try: | ||
| start_time = time.time() | ||
| vllm_stream = query_vllm_stream(prompt, docs, llm_endpoint, llm_model, stop_words, max_tokens, temperature, stream, dynamic_chunk_truncation=TRUNCATION) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call to query_vllm_stream in streaming mode returns a generator and not a final answer. This generator (like a C++ iterator) is then read chunk by chunk with each piece being sent to Web UI even after this Python backend function returns. So if you want to measure very precisely you should probably test non-streaming case only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Niharika0306 chime in please to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@carll99 You could try out this for request id retrieval: |
Signed-off-by: Carl Love <cel@linux.ibm.com>
|
The above comments are all really good. The concern was to make this as simple a possible given the pending release. The goal is to implement thread tracking and additional timings for a future release. The blocking issue with the current patch is there is no sign off. I will close this PR and submit a new one with the sign off. |
Here is the first pass at what we would like to see enabled for performance tracking.
Posting this as requested from earlier meeting. Will discuss the PR in our next meeting.