Skip to content

Conversation

@manalilatkar
Copy link
Member

@manalilatkar manalilatkar commented Feb 5, 2026

This PR adds a separate microservice for summarization. Input to be summarized can be given by file or plaintext. A summary length has to be given as well.

Signed-off-by: Manali Latkar <manali.latkar@ibm.com>
Comment on lines +39 to +40
MAX_WORDS_BEST_PERFORMANCE = 2000
MAX_WORDS_DEGRADED_PERFORMANCE = 21500
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values come from the following calculation in the design doc:

Image

Copy link
Member

@mkumatag mkumatag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us push the design doc somewhere in this repo itself so that it becomes handy while reviewing the PR

default_max_input_length = 6000
default_prompt_template_token_count = 250
default_max_summary_length = 1000
default_max_file_size_mb = 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10MB is really huge file, have you tested with 10MB file size?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handling file size separately for pdf and text files. more details added in the proposal PR.

default_temperature = 0.0
default_max_input_length = 6000
default_prompt_template_token_count = 250
default_max_summary_length = 1000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any logic behind this number?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted a big enough number for our maximum input size which is 21500 words. 21500 words -> 1000 words = good summary. Anything larger than 1000 won't be a summary.

default_prompt_template_token_count = 250
default_max_summary_length = 1000
default_max_file_size_mb = 10
default_summarization_temperature = 0.3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be 0.2

default_max_summary_length = 1000
default_max_file_size_mb = 10
default_summarization_temperature = 0.3
default_summarization_stop_words = "\n\n,Note,Word Count,Revised Summary"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of these stop words?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkumatag When I tested this, the LLM was responding with extra data that started with some of these stop words. When I started sending them, the response was always clean.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be because of the prompt: Summarize the following text in approximately {summary_length} words

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are supposed to take the summary length as input from the user. So we have to include it in the prompt for the output to stick to that limit.


if __name__ == "__main__":
initialize_models()
uvicorn.run(app, host="0.0.0.0", port=8000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how to override this port?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and also how user will run this service alone vs with rag service

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this needs to be run along with rag service as per Sebasitan's slide.
To run alone, this would require vllm as prerequisite, once we figure out the custom usecase/template problem, then we can address it I believe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm talking about an individual summarisation service itself, may be we will explore post merging this..

@@ -0,0 +1,200 @@
import os
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some readme in this directory


# Log warning if text exceeds the best performance threshold
if word_count > MAX_WORDS_BEST_PERFORMANCE:
logger.info(f"Input text exceeds maximum word limit of {MAX_WORDS_BEST_PERFORMANCE} words. Performance may be degraded.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure without displaying anything else and just be this logging message how much is going to helpful for the admin who is running this solution

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkumatag should we send a "degraded_performance": True in the response?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should log other details to identify the request like what is the type of content user has passed & word count

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: What will be the admin/user action on this? IMO: I think we should consider collecting via some perf counter for monitoring purposes. Otherwise, exposing such information to end users won't be very helpful. If necessary, we should limit character count(input) so that we can be efficient and return an error if it exceeds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we say degraded_performance, do we mean qualitatively? isn't max_model_len = 32k should take care of this ?

detail="Server busy. Try again shortly."
)
try:
summary = query_vllm_completions(llm_endpoint, prompt, llm_model, settings.llm_max_tokens, settings.summarization_temperature)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llm_max_tokens is set to 512 tokens which is too less if you are considering of 1000 output words

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkumatag you are right, I had changed the max_tokens while testing larger inputs, but missed adding the change here. Will probably come up with a max_tokens value based on the summary length given by the user.

file_content = file.file.read()

# Validate file size
if len(file_content) > MAX_FILE_SIZE_BYTES:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the file size should vary based on file type, .txt would consume less space compared to pdf.
Please do some exercise to come up with a valid file sizes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have addressed the separate max file sizes in my document proposal.


# Log warning if text exceeds the best performance threshold
if word_count > MAX_WORDS_BEST_PERFORMANCE:
logger.info(f"Input text exceeds maximum word limit of {MAX_WORDS_BEST_PERFORMANCE} words. Performance may be degraded.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should log other details to identify the request like what is the type of content user has passed & word count

Comment on lines +161 to +178
except requests.exceptions.RequestException as e:
concurrency_limiter.release()
error_details = str(e)
if e.response is not None:
error_details += f", Response Text: {e.response.text}"
logger.error(f"Error calling vLLM API: {error_details}")

raise HTTPException(
status_code=500,
detail=error_details
)
except Exception as e:
concurrency_limiter.release()
logger.error(f"Error calling vLLM API: {e}")
raise HTTPException(
status_code=500,
detail=str(e)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this block to llm_utils itself like how its handled in other query methods?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants