-
Notifications
You must be signed in to change notification settings - Fork 104
Open
Description
I built a test code based on the example in KVZap's README documentation, but encountered an error during runtime. However, I was unable to pinpoint the issue.
My modified test code is as follows:
import requests
import os
from transformers import pipeline
from kvpress import KVzapPress, DMSPress
os.environ["TRANSFORMERS_OFFLINE"] = "1"
model = "Qwen/Qwen3-8B"
pipe = pipeline("kv-press-text-generation", model=model, device_map="auto", dtype="auto", local_files_only=True)
kvzap_press = KVzapPress(model_type="mlp")
press = DMSPress(kvzap_press, threshold=-4)
print(f"load successfullly")
press.decoding = False
context = (
context
) = """
This is an example article about machine learning. Machine learning is a subset of artificial intelligence
that focuses on building systems that learn from data. Recent advances in deep learning have revolutionized
many fields including computer vision, natural language processing, and speech recognition.
Transformer models like BERT and GPT have shown remarkable performance on various NLP tasks.
The field continues to evolve with new architectures and training techniques being developed regularly.
In this paper, we introduce a novel approach to attention mechanisms that improves efficiency
while maintaining performance. Our method reduces computational complexity from O(n^2) to O(n log n)
for sequence length n. Experiments on benchmark datasets show competitive results with
state-of-the-art models while using significantly less memory and computation time.
"""
question = "\n What is this article about in 2 sentences ?"
answer = pipe(context, question=question, press=press)["answer"]
print(f"Answer:{answer}")
press.decoding = True
prompt = "What is the best hardware to run LLMs and why ?"
answer = pipe(prompt, press=press, enable_thinking=True, max_new_tokens=2000)["answer"]
print(f"Answer:{answer}")
The error message is as follows:
Loading checkpoint shards: 100%|████████████████████████████████████| 5/5 [00:03<00:00, 1.55it/s]
Device set to use cuda:0
load successfullly
Answer:¾-var!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Traceback (most recent call last):
File "/home/myserver/workplace1/zsy/zsy_2/kvpress/./test_kvpress.py", line 40, in <module>
answer = pipe(prompt, press=press, enable_thinking=True, max_new_tokens=2000)["answer"]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myserver/workplace1/zsy/zsy_2/kvpress/.venv/lib/python3.12/site-packages/transformer s/pipelines/base.py", line 1467, in __call__
return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myserver/workplace1/zsy/zsy_2/kvpress/.venv/lib/python3.12/site-packages/transformer s/pipelines/base.py", line 1474, in run_single
model_outputs = self.forward(model_inputs, **forward_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myserver/workplace1/zsy/zsy_2/kvpress/.venv/lib/python3.12/site-packages/transformer s/pipelines/base.py", line 1374, in forward
model_outputs = self._forward(model_inputs, **forward_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myserver/workplace1/zsy/zsy_2/kvpress/kvpress/pipeline.py", line 217, in _forward
self.model.model(
File "/home/myserver/workplace1/zsy/zsy_2/kvpress/.venv/lib/python3.12/site-packages/torch/nn/mo dules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myserver/workplace1/zsy/zsy_2/kvpress/.venv/lib/python3.12/site-packages/torch/nn/mo dules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myserver/workplace1/zsy/zsy_2/kvpress/.venv/lib/python3.12/site-packages/transformer s/utils/generic.py", line 1072, in wrapper
outputs = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myserver/workplace1/zsy/zsy_2/kvpress/.venv/lib/python3.12/site-packages/transformer s/models/qwen3/modeling_qwen3.py", line 410, in forward
hidden_states = decoder_layer(
^^^^^^^^^^^^^^
File "/home/myserver/workplace1/zsy/zsy_2/kvpress/.venv/lib/python3.12/site-packages/transformer s/modeling_layers.py", line 94, in __call__
return super().__call__(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myserver/workplace1/zsy/zsy_2/kvpress/.venv/lib/python3.12/site-packages/torch/nn/mo dules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myserver/workplace1/zsy/zsy_2/kvpress/.venv/lib/python3.12/site-packages/torch/nn/mo dules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myserver/workplace1/zsy/zsy_2/kvpress/.venv/lib/python3.12/site-packages/accelerate/ hooks.py", line 175, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myserver/workplace1/zsy/zsy_2/kvpress/.venv/lib/python3.12/site-packages/transformer s/utils/deprecation.py", line 172, in wrapped_func
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/myserver/workplace1/zsy/zsy_2/kvpress/.venv/lib/python3.12/site-packages/transformer s/models/qwen3/modeling_qwen3.py", line 260, in forward
hidden_states, _ = self.self_attn(
^^^^^^^^^^^^^^^
File "/home/myserver/workplace1/zsy/zsy_2/kvpress/.venv/lib/python3.12/site-packages/torch/nn/mo dules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myserver/workplace1/zsy/zsy_2/kvpress/.venv/lib/python3.12/site-packages/torch/nn/mo dules/module.py", line 1881, in _call_impl
return inner()
^^^^^^^
File "/home/myserver/workplace1/zsy/zsy_2/kvpress/.venv/lib/python3.12/site-packages/torch/nn/mo dules/module.py", line 1840, in inner
hook_result = hook(self, args, kwargs, result)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myserver/workplace1/zsy/zsy_2/kvpress/kvpress/presses/dms_press.py", line 93, in for ward_hook
self.scores_buffer[layer_idx] = torch.cat([self.scores_buffer[layer_idx], scores], dim=-1)
I have not modified any other files. Where might the problem be?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels