Add support for LlamaCPP#116
Conversation
Integrate LlamaCPP Python wrapper (`llama-cpp-python`)
Clean up comments
|
Hi @urroxyz, Hope you’re doing well! Just wanted to let you know that I’ve moved the tokenizer files to my own hosting: Pull Request #117. With this change, you should not have problem with automated testing. Could you try to make another commit? the tests should run and pass without any issues! About the LlamaCPP implementation—I was thinking it might be cleaner to create a separate LlamaCPPLogitsProcessor class, like the one you shared in your first email. Maybe we could add a new file called Also, if you could add a simple example of how to use it under the We should definitely update the README too, so people know that LlamaCPP is supported now! import numpy as np
import torch
from llama_cpp import Llama, LogitsProcessor
from transformers_cfg.grammar_utils import IncrementalGrammarConstraint
from transformers_cfg.generation.logits_process import GrammarConstrainedLogitsProcessor
from transformers import AutoTokenizer
# Define EBNF grammar
ebnf_grammar = """
root ::= (expr "=" ws term "\n")+
expr ::= term ([-+*/] term)*
term ::= ident | num | "(" ws expr ")" ws
ident ::= [a-z] [a-z0-9_]* ws
num ::= [0-9]+ ws
ws ::= [ \t\n]*
"""
# Define logits processor for LlamaCPP
class LlamaCPPLogitsProcessor(LogitsProcessor):
def __init__(self, grammar_str, start_symbol, tokenizer):
self.grammar_str = grammar_str
self.start_symbol = start_symbol
self.tokenizer = tokenizer
self.grammar = IncrementalGrammarConstraint(grammar_str, start_symbol, tokenizer)
self.grammar_processor = GrammarConstrainedLogitsProcessor(self.grammar)
self.finished = False
self.started = False
def __call__(self, input_ids, scores):
# Convert numpy types to Python types
if np.isscalar(input_ids):
input_ids = [int(input_ids)]
elif isinstance(input_ids, np.ndarray):
input_ids = input_ids.tolist()
elif isinstance(input_ids, list):
input_ids = [int(i) if isinstance(i, np.generic) else i for i in input_ids]
elif isinstance(input_ids, np.generic):
input_ids = [int(input_ids)]
# Batch of token sequences
if input_ids and isinstance(input_ids[0], int):
input_ids = [input_ids]
# Convert scores to a PyTorch tensor for batch dimension
if isinstance(scores, np.ndarray):
scores = torch.from_numpy(scores)
elif not isinstance(scores, torch.Tensor):
scores = torch.tensor(scores)
# Scores tensor need proper dimensionality
# If it's 1D (just vocabulary dimension), add a batch dimension
if scores.dim() == 1:
scores = scores.unsqueeze(0) # Add batch dimension [vocab_size] -> [1, vocab_size]
# If finished, force EOS token as model likely won't generate it on its own
if self.finished:
return self._force_eos(scores).squeeze(0).numpy() # Remove batch dim for output
# Reset grammar if token sequence length doesn't match expectation
current_length = len(input_ids[0])
if hasattr(self.grammar, "last_size") and self.grammar.last_size is not None:
expected_length = self.grammar.last_size + 1
if current_length != expected_length:
self.grammar = IncrementalGrammarConstraint(
self.grammar_str, self.start_symbol, self.tokenizer
)
self.grammar_processor = GrammarConstrainedLogitsProcessor(self.grammar)
self.started = False
try:
processed_scores = self.grammar_processor(input_ids, scores)
self.started = True
except ValueError as e:
if "All stacks are empty" in str(e):
self.finished = True
processed_scores = self._force_eos(scores)
else:
raise e
# Remove batch dimension for output
if processed_scores.dim() > 1:
processed_scores = processed_scores.squeeze(0)
return processed_scores.detach().cpu().numpy()
def _force_eos(self, scores_tensor):
"""Force the scores such that only the EOS token is allowed."""
eos_token = self.tokenizer.eos_token_id
mask = torch.full_like(scores_tensor, fill_value=-float("inf"))
mask[..., eos_token] = 0
return mask
# Load the tokenizer with Transformers
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
# Load the model with LlamaCPP
model = Llama(model_path="qwen2.5-0.5b-instruct-q8_0.gguf")
# Create the logits processor with grammar constraints
grammar_processor = LlamaCPPLogitsProcessor(ebnf_grammar, "root", tokenizer)
# Define a more explicit instruction
input_text = "Give me some math."
messages = [{"role": "user", "content": input_text}]
# Generate with constraints
response = model.create_chat_completion(
stream=True,
messages=messages,
logits_processor=[grammar_processor],
max_tokens=30,
temperature=0.7,
top_p=0.95,
# Stop at the end of the sentence
# because it currently runs infinitely
# if the grammar isn't specific enough
stop=[".", "\n"]
# A force-EOS functionality must be implemented
# in a potential future official release
)
# Print the streamed response
for chunk in response:
print(chunk["choices"][0]["delta"].get("content", ""), end="", flush=True) |
|
I originally created a Here's my compromise—Let's create a folder titled In this instance, the example code would look something like this: Detailsimport io
import logging
from contextlib import redirect_stderr
from llama_cpp import Llama
from transformers_cfg.grammar_utils import IncrementalGrammarConstraint
from transformers_cfg.generation.logits_process import GrammarConstrainedLogitsProcessor
from transformers import AutoTokenizer
from adapters.llama_cpp_python import llama_cpp_python
logging.basicConfig(level=logging.INFO)
# Define the EBNF grammar.
ebnf_grammar = """
root ::= "The animal is a " animal "."
animal ::= "cat" | "fish"
"""
# Load the tokenizer matching your model.
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5b")
# Redirect stderr and load the model via llama-cpp-python.
f = io.StringIO()
with redirect_stderr(f):
model = Llama(model_path="qwen2.5-1.5b-q8_0.gguf", n_ctx=8000, verbose=False)
# Create the grammar constraint and the corresponding logits processor.
grammar_constraint = IncrementalGrammarConstraint(ebnf_grammar, "root", tokenizer)
grammar_processor = GrammarConstrainedLogitsProcessor(grammar_constraint)
# Adapt the processor for llama-cpp-python.
adapter_processor = llama_cpp_python(grammar_processor)
# Define the prompt.
prompt = 'The text says, "The animal is a dog." The answer is obvious. '
# Use the text completion API with the adapted logits processor.
response = model.create_completion(
stream=True,
prompt=prompt,
logits_processor=[adapter_processor],
max_tokens=100,
)
for token in response:
token_text = token["choices"][0]["text"]
print(token_text, end="", flush=True)In the current instance, which I believe to be cleaner, the example code looks like this: Detailsimport io
import torch
import logging
from contextlib import redirect_stderr
from llama_cpp import Llama
from transformers_cfg.grammar_utils import IncrementalGrammarConstraint
from transformers_cfg.generation.logits_process import GrammarConstrainedLogitsProcessor
from transformers import AutoTokenizer
logging.basicConfig(level=logging.INFO)
# Define your EBNF grammar (you can replace this with your own)
ebnf_grammar = """
root ::= "The animal is a " animal "."
animal ::= "cat" | "fish"
"""
# Load the tokenizer matching your model
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5b")
# Redirect stderr and load the model via llama-cpp-python
f = io.StringIO()
with redirect_stderr(f):
model = Llama(model_path="qwen2.5-1.5b-q8_0.gguf", n_ctx=8000, verbose=False)
# Create the grammar constraint and the logits processor with the new parameter.
grammar_constraint = IncrementalGrammarConstraint(ebnf_grammar, "root", tokenizer)
grammar_processor = GrammarConstrainedLogitsProcessor(grammar_constraint, library="llama-cpp-python")
# Define a prompt.
prompt = """The text says, "The animal is a dog." The answer is obvious. """
# Use the text completion API with the logits processor.
response = model.create_completion(
stream=True,
prompt=prompt,
logits_processor=[grammar_processor],
max_tokens=100,
)
for token in response:
token_text = token["choices"][0]["text"]
print(token_text, end="", flush=True)What are your thoughts? |
|
On second thought, maybe the module ( grammar_constraint = IncrementalGrammarConstraint(ebnf_grammar, "root", tokenizer)
grammar_processor = GrammarConstrainedLogitsProcessor(grammar_constraint, adapter="llama-cpp-python")I think this works wonderfully. I mean, it is a |
|
New example run: import io
import torch
import logging
from contextlib import redirect_stderr
from llama_cpp import Llama
from transformers_cfg.grammar_utils import IncrementalGrammarConstraint
from transformers_cfg.generation.logits_process import GrammarConstrainedLogitsProcessor
from transformers import AutoTokenizer
logging.basicConfig(level=logging.INFO)
# Define your EBNF grammar (you can replace this with your own)
ebnf_grammar = """
root ::= "The animal is a " animal "."
animal ::= "cat" | "fish"
"""
# Load the tokenizer matching your model
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5b")
# Redirect stderr and load the model via llama-cpp-python
f = io.StringIO()
with redirect_stderr(f):
model = Llama(model_path="qwen2.5-1.5b-q8_0.gguf", n_ctx=8000, verbose=False)
# Create the grammar constraint and the logits processor with the new parameter.
grammar_constraint = IncrementalGrammarConstraint(ebnf_grammar, "root", tokenizer)
grammar_processor = GrammarConstrainedLogitsProcessor(grammar_constraint, adapter="llama-cpp-python")
# Define a prompt.
prompt = """The text says, "The animal is a dog." The answer is obvious. """
# Use the text completion API with the logits processor.
response = model.create_completion(
stream=True,
prompt=prompt,
logits_processor=[grammar_processor],
max_tokens=100,
)
for token in response:
token_text = token["choices"][0]["text"]
print(token_text, end="", flush=True) |
|
We should run some benchmarks to see how the speed compares to the native approach with |
|
Hey @urroxyz, The example looks awesome! 😍 Just about the structure—I still think having a separate class like I totally get your point about keeping things simple for users by letting them add an argument like Since each adapter would effectively act as a proxy for a specific library, we would end up having a similar number of adapters or classes either way. So, it doesn’t fundamentally reduce complexity or improve reusability but is more about making the structure natural and readable for both users and developers. Also, considering that users who want to use llamacpp would need to import the llamacpp models anyway, it seems reasonable to ask them to import a separate class as well. While adapters do offer a layer of flexibility, they could also add a bit of complexity that could be avoided here. What do you think? |
|
I'm a bit confused, as I actually think an adapter would not only be cleaner for the user, but for the developer, and easier to maintain. Are you saying that an entirely new With an adapter, it only needs to be updated if the Also, the adapter can still be loaded without the Example run (without import io
import logging
from contextlib import redirect_stderr
import torch
from llama_cpp import Llama
from transformers import AutoTokenizer
from transformers_cfg.grammar_utils import IncrementalGrammarConstraint
from transformers_cfg.generation.logits_process import GrammarConstrainedLogitsProcessor
from transformers_cfg.adapters.llama_cpp_python import llama_cpp_python
# Configure logging
logging.basicConfig(level=logging.INFO)
# Define the EBNF grammar
ebnf_grammar = """
root ::= "The animal is a " animal "."
animal ::= "cat" | "fish"
"""
# Load the tokenizer matching your model
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5b")
# Redirect stderr and load the model via llama-cpp-python
f = io.StringIO()
with redirect_stderr(f):
model = Llama(model_path="qwen2.5-1.5b-q8_0.gguf", n_ctx=8000, verbose=False)
# Create the grammar constraint and the logits processor without specifying an adapter.
grammar_constraint = IncrementalGrammarConstraint(ebnf_grammar, "root", tokenizer)
grammar_processor = GrammarConstrainedLogitsProcessor(grammar_constraint)
# Now, manually adapt the processor for llama-cpp-python
adapter_processor = llama_cpp_python(grammar_processor)
# Define the prompt.
prompt = 'The text says, "The animal is a dog." The answer is obvious. '
# Use the text completion API with the adapted logits processor.
response = model.create_completion(
stream=True,
prompt=prompt,
logits_processor=[adapter_processor],
max_tokens=100,
)
# Stream and print the output.
for token in response:
token_text = token["choices"][0]["text"]
print(token_text, end="", flush=True)I can't verify if the code above still works right now, but you can let me know if anything goes awry. If this doesn't change your mind, please continue to explain your thought process to me as I need more context to understand. |
Oh, I was trying to say that the class in your first draft feels more intuitive to me compared to the adapter. What do you think? |
|
Sorry for misunderstanding, but, hmmm, I've actually come to like the adapter better for its organization—cleaner for both the dev and the user, and easier to maintain. Plus, it'll be simple to change in the future if need be. At that point, we could integrate the original version. |
|
Ok, let’s give it a try :) Is this ready to be merged ? |
|
It should be! Thank you for the discussion. |
Allow user to choose between
transformersandllama-cpp-pythonwith thelibraryparameter