Skip to content

Add support for LlamaCPP#116

Merged
Saibo-creator merged 9 commits intoepfl-dlab:mainfrom
urroxyz:patch-1
Mar 9, 2025
Merged

Add support for LlamaCPP#116
Saibo-creator merged 9 commits intoepfl-dlab:mainfrom
urroxyz:patch-1

Conversation

@urroxyz
Copy link
Contributor

@urroxyz urroxyz commented Mar 7, 2025

  • Integrate LlamaCPP Python wrapper (experimental)
    Allow user to choose between transformers and llama-cpp-python with the library parameter

urroxyz and others added 2 commits March 7, 2025 16:28
Integrate LlamaCPP Python wrapper (`llama-cpp-python`)
Clean up comments
@Saibo-creator
Copy link
Collaborator

Saibo-creator commented Mar 8, 2025

Hi @urroxyz,

Hope you’re doing well!

Just wanted to let you know that I’ve moved the tokenizer files to my own hosting: Pull Request #117. With this change, you should not have problem with automated testing. Could you try to make another commit? the tests should run and pass without any issues!

About the LlamaCPP implementation—I was thinking it might be cleaner to create a separate LlamaCPPLogitsProcessor class, like the one you shared in your first email. Maybe we could add a new file called llamacpp_logits_process.py under the generation module and include your first version there.

Also, if you could add a simple example of how to use it under the examples directory, that would be awesome! The example below in your draft worked great for me.

We should definitely update the README too, so people know that LlamaCPP is supported now!

import numpy as np
import torch
from llama_cpp import Llama, LogitsProcessor
from transformers_cfg.grammar_utils import IncrementalGrammarConstraint
from transformers_cfg.generation.logits_process import GrammarConstrainedLogitsProcessor
from transformers import AutoTokenizer

# Define EBNF grammar
ebnf_grammar = """
root  ::= (expr "=" ws term "\n")+
expr  ::= term ([-+*/] term)*
term  ::= ident | num | "(" ws expr ")" ws
ident ::= [a-z] [a-z0-9_]* ws
num   ::= [0-9]+ ws
ws    ::= [ \t\n]*
"""

# Define logits processor for LlamaCPP
class LlamaCPPLogitsProcessor(LogitsProcessor):
    def __init__(self, grammar_str, start_symbol, tokenizer):
        self.grammar_str = grammar_str
        self.start_symbol = start_symbol
        self.tokenizer = tokenizer
        self.grammar = IncrementalGrammarConstraint(grammar_str, start_symbol, tokenizer)
        self.grammar_processor = GrammarConstrainedLogitsProcessor(self.grammar)
        self.finished = False
        self.started = False

    def __call__(self, input_ids, scores):
        # Convert numpy types to Python types
        if np.isscalar(input_ids):
            input_ids = [int(input_ids)]
        elif isinstance(input_ids, np.ndarray):
            input_ids = input_ids.tolist()
        elif isinstance(input_ids, list):
            input_ids = [int(i) if isinstance(i, np.generic) else i for i in input_ids]
        elif isinstance(input_ids, np.generic):
            input_ids = [int(input_ids)]

        # Batch of token sequences
        if input_ids and isinstance(input_ids[0], int):
            input_ids = [input_ids]

        # Convert scores to a PyTorch tensor for batch dimension
        if isinstance(scores, np.ndarray):
            scores = torch.from_numpy(scores)
        elif not isinstance(scores, torch.Tensor):
            scores = torch.tensor(scores)
            
        # Scores tensor need proper dimensionality
        # If it's 1D (just vocabulary dimension), add a batch dimension
        if scores.dim() == 1:
            scores = scores.unsqueeze(0)  # Add batch dimension [vocab_size] -> [1, vocab_size]

        # If finished, force EOS token as model likely won't generate it on its own
        if self.finished:
            return self._force_eos(scores).squeeze(0).numpy()  # Remove batch dim for output

        # Reset grammar if token sequence length doesn't match expectation
        current_length = len(input_ids[0])
        if hasattr(self.grammar, "last_size") and self.grammar.last_size is not None:
            expected_length = self.grammar.last_size + 1
            if current_length != expected_length:
                self.grammar = IncrementalGrammarConstraint(
                    self.grammar_str, self.start_symbol, self.tokenizer
                )
                self.grammar_processor = GrammarConstrainedLogitsProcessor(self.grammar)
                self.started = False

        try:
            processed_scores = self.grammar_processor(input_ids, scores)
            self.started = True
        except ValueError as e:
            if "All stacks are empty" in str(e):
                self.finished = True
                processed_scores = self._force_eos(scores)
            else:
                raise e

        # Remove batch dimension for output
        if processed_scores.dim() > 1:
            processed_scores = processed_scores.squeeze(0)
            
        return processed_scores.detach().cpu().numpy()

    def _force_eos(self, scores_tensor):
        """Force the scores such that only the EOS token is allowed."""
        eos_token = self.tokenizer.eos_token_id
        mask = torch.full_like(scores_tensor, fill_value=-float("inf"))
        mask[..., eos_token] = 0
        return mask


# Load the tokenizer with Transformers
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")

# Load the model with LlamaCPP
model = Llama(model_path="qwen2.5-0.5b-instruct-q8_0.gguf")

# Create the logits processor with grammar constraints
grammar_processor = LlamaCPPLogitsProcessor(ebnf_grammar, "root", tokenizer)

# Define a more explicit instruction
input_text = "Give me some math."

messages = [{"role": "user", "content": input_text}]

# Generate with constraints
response = model.create_chat_completion(
    stream=True,
    messages=messages,
    logits_processor=[grammar_processor],
    max_tokens=30,
    temperature=0.7,
    top_p=0.95,
    # Stop at the end of the sentence
    # because it currently runs infinitely
    # if the grammar isn't specific enough
    stop=[".", "\n"]
    # A force-EOS functionality must be implemented
    # in a potential future official release
)

# Print the streamed response
for chunk in response:
    print(chunk["choices"][0]["delta"].get("content", ""), end="", flush=True)

@urroxyz
Copy link
Contributor Author

urroxyz commented Mar 8, 2025

I originally created a llamacpp_logits_process.py but wanted something simplistic for the user. Alongside realizing that the LlamaCPP integration is just converting the logits processor to a new format, I supposed that it would make the most sense to include it in the main file, but I completely understand your view. I don't want to make the current file too lengthy or difficult to understand, either.

Here's my compromise—Let's create a folder titled adapters for adapter modules that take standard transformers grammars generated by logits_process.py and convert them to an applicable format. For example, adapters/llama_cpp_python.py. This way, we can similarly expand to more libraries in the future!

In this instance, the example code would look something like this:

Details
import io
import logging
from contextlib import redirect_stderr
from llama_cpp import Llama
from transformers_cfg.grammar_utils import IncrementalGrammarConstraint
from transformers_cfg.generation.logits_process import GrammarConstrainedLogitsProcessor
from transformers import AutoTokenizer
from adapters.llama_cpp_python import llama_cpp_python

logging.basicConfig(level=logging.INFO)

# Define the EBNF grammar.
ebnf_grammar = """
    root   ::= "The animal is a " animal "."
    animal ::= "cat" | "fish"
"""

# Load the tokenizer matching your model.
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5b")

# Redirect stderr and load the model via llama-cpp-python.
f = io.StringIO()
with redirect_stderr(f):
    model = Llama(model_path="qwen2.5-1.5b-q8_0.gguf", n_ctx=8000, verbose=False)

# Create the grammar constraint and the corresponding logits processor.
grammar_constraint = IncrementalGrammarConstraint(ebnf_grammar, "root", tokenizer)
grammar_processor = GrammarConstrainedLogitsProcessor(grammar_constraint)

# Adapt the processor for llama-cpp-python.
adapter_processor = llama_cpp_python(grammar_processor)

# Define the prompt.
prompt = 'The text says, "The animal is a dog." The answer is obvious. '

# Use the text completion API with the adapted logits processor.
response = model.create_completion(
    stream=True,
    prompt=prompt,
    logits_processor=[adapter_processor],
    max_tokens=100,
)

for token in response:
    token_text = token["choices"][0]["text"]
    print(token_text, end="", flush=True)

In the current instance, which I believe to be cleaner, the example code looks like this:

Details
import io
import torch
import logging
from contextlib import redirect_stderr
from llama_cpp import Llama
from transformers_cfg.grammar_utils import IncrementalGrammarConstraint
from transformers_cfg.generation.logits_process import GrammarConstrainedLogitsProcessor
from transformers import AutoTokenizer

logging.basicConfig(level=logging.INFO)

# Define your EBNF grammar (you can replace this with your own)
ebnf_grammar = """

    root   ::= "The animal is a " animal "."

    animal ::= "cat" | "fish"

    """

# Load the tokenizer matching your model
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5b")

# Redirect stderr and load the model via llama-cpp-python
f = io.StringIO()
with redirect_stderr(f):
    model = Llama(model_path="qwen2.5-1.5b-q8_0.gguf", n_ctx=8000, verbose=False)

# Create the grammar constraint and the logits processor with the new parameter.
grammar_constraint = IncrementalGrammarConstraint(ebnf_grammar, "root", tokenizer)
grammar_processor = GrammarConstrainedLogitsProcessor(grammar_constraint, library="llama-cpp-python")

# Define a prompt.
prompt = """The text says, "The animal is a dog." The answer is obvious. """

# Use the text completion API with the logits processor.
response = model.create_completion(
    stream=True,
    prompt=prompt,
    logits_processor=[grammar_processor],
    max_tokens=100,
)

for token in response:
    token_text = token["choices"][0]["text"]
    print(token_text, end="", flush=True)

What are your thoughts?

@urroxyz
Copy link
Contributor Author

urroxyz commented Mar 8, 2025

On second thought, maybe the module (adapters/llama_cpp_python.py) should be called in logits_process.py itself according to a library or adapter parameter so that there is little-to-no change on the user end, but so the code is more organized on our end?

grammar_constraint = IncrementalGrammarConstraint(ebnf_grammar, "root", tokenizer)
grammar_processor = GrammarConstrainedLogitsProcessor(grammar_constraint, adapter="llama-cpp-python")

I think this works wonderfully. I mean, it is a GrammarConstrainedLogitsProcessor, not specifically a TransformersGrammarConstrainedLogitsProcessor.

@urroxyz
Copy link
Contributor Author

urroxyz commented Mar 8, 2025

New example run:

import io
import torch
import logging
from contextlib import redirect_stderr
from llama_cpp import Llama
from transformers_cfg.grammar_utils import IncrementalGrammarConstraint
from transformers_cfg.generation.logits_process import GrammarConstrainedLogitsProcessor
from transformers import AutoTokenizer

logging.basicConfig(level=logging.INFO)

# Define your EBNF grammar (you can replace this with your own)
ebnf_grammar = """

    root   ::= "The animal is a " animal "."

    animal ::= "cat" | "fish"

    """

# Load the tokenizer matching your model
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5b")

# Redirect stderr and load the model via llama-cpp-python
f = io.StringIO()
with redirect_stderr(f):
    model = Llama(model_path="qwen2.5-1.5b-q8_0.gguf", n_ctx=8000, verbose=False)

# Create the grammar constraint and the logits processor with the new parameter.
grammar_constraint = IncrementalGrammarConstraint(ebnf_grammar, "root", tokenizer)
grammar_processor = GrammarConstrainedLogitsProcessor(grammar_constraint, adapter="llama-cpp-python")

# Define a prompt.
prompt = """The text says, "The animal is a dog." The answer is obvious. """

# Use the text completion API with the logits processor.
response = model.create_completion(
    stream=True,
    prompt=prompt,
    logits_processor=[grammar_processor],
    max_tokens=100,
)

for token in response:
    token_text = token["choices"][0]["text"]
    print(token_text, end="", flush=True)

@urroxyz
Copy link
Contributor Author

urroxyz commented Mar 8, 2025

We should run some benchmarks to see how the speed compares to the native approach with llama-cpp-python only. If transformers-cfg is faster, we can definitely post those results on the README.md. Otherwise, we'll have to work on that.

@Saibo-creator
Copy link
Collaborator

Hey @urroxyz,

The example looks awesome! 😍 Just about the structure—I still think having a separate class like LLamaCPPConstrainedLogitsProcessor might be a cleaner and more maintainable approach.

I totally get your point about keeping things simple for users by letting them add an argument like adapter="llama-cpp-python"—that’s definitely user-friendly! However, I feel this might be a perfect use case for OOP principles. Encapsulating library-specific details and behaviors within individual classes could help keep things organized without adding too much complexity.

Since each adapter would effectively act as a proxy for a specific library, we would end up having a similar number of adapters or classes either way. So, it doesn’t fundamentally reduce complexity or improve reusability but is more about making the structure natural and readable for both users and developers.

Also, considering that users who want to use llamacpp would need to import the llamacpp models anyway, it seems reasonable to ask them to import a separate class as well.

While adapters do offer a layer of flexibility, they could also add a bit of complexity that could be avoided here.

What do you think?

@urroxyz
Copy link
Contributor Author

urroxyz commented Mar 9, 2025

I'm a bit confused, as I actually think an adapter would not only be cleaner for the user, but for the developer, and easier to maintain.

Are you saying that an entirely new logits_process.py should be created? If so, that would mean having to update both original logits processor and the LlamaCPP-specific one. This seems inefficient.

With an adapter, it only needs to be updated if the logits_process.py returns in a different format, which I believe to be unlikely, and something that would already require major change if it were to happen.

Also, the adapter can still be loaded without the adapter parameter (which is only an additional few lines of code in logits_process.py), and instead with your approach of another import.

Example run (without adapter parameter):

import io
import logging
from contextlib import redirect_stderr

import torch
from llama_cpp import Llama
from transformers import AutoTokenizer

from transformers_cfg.grammar_utils import IncrementalGrammarConstraint
from transformers_cfg.generation.logits_process import GrammarConstrainedLogitsProcessor
from transformers_cfg.adapters.llama_cpp_python import llama_cpp_python

# Configure logging
logging.basicConfig(level=logging.INFO)

# Define the EBNF grammar
ebnf_grammar = """
    root   ::= "The animal is a " animal "."
    animal ::= "cat" | "fish"
"""

# Load the tokenizer matching your model
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5b")

# Redirect stderr and load the model via llama-cpp-python
f = io.StringIO()
with redirect_stderr(f):
    model = Llama(model_path="qwen2.5-1.5b-q8_0.gguf", n_ctx=8000, verbose=False)

# Create the grammar constraint and the logits processor without specifying an adapter.
grammar_constraint = IncrementalGrammarConstraint(ebnf_grammar, "root", tokenizer)
grammar_processor = GrammarConstrainedLogitsProcessor(grammar_constraint)

# Now, manually adapt the processor for llama-cpp-python
adapter_processor = llama_cpp_python(grammar_processor)

# Define the prompt.
prompt = 'The text says, "The animal is a dog." The answer is obvious. '

# Use the text completion API with the adapted logits processor.
response = model.create_completion(
    stream=True,
    prompt=prompt,
    logits_processor=[adapter_processor],
    max_tokens=100,
)

# Stream and print the output.
for token in response:
    token_text = token["choices"][0]["text"]
    print(token_text, end="", flush=True)

I can't verify if the code above still works right now, but you can let me know if anything goes awry.

If this doesn't change your mind, please continue to explain your thought process to me as I need more context to understand.

@Saibo-creator
Copy link
Collaborator

# Define logits processor for LlamaCPP
class LlamaCPPGrammarLogitsProcessor(LogitsProcessor):
    def __init__(self, grammar_str, start_symbol, tokenizer):
        self.grammar_str = grammar_str
        self.start_symbol = start_symbol
        self.tokenizer = tokenizer
        self.grammar = IncrementalGrammarConstraint(grammar_str, start_symbol, tokenizer)
        self.grammar_processor = GrammarConstrainedLogitsProcessor(self.grammar)
        self.finished = False
        self.started = False

Oh, I was trying to say that the class in your first draft feels more intuitive to me compared to the adapter. What do you think?

@urroxyz
Copy link
Contributor Author

urroxyz commented Mar 9, 2025

Sorry for misunderstanding, but, hmmm, I've actually come to like the adapter better for its organization—cleaner for both the dev and the user, and easier to maintain. Plus, it'll be simple to change in the future if need be. At that point, we could integrate the original version.

@Saibo-creator
Copy link
Collaborator

Ok, let’s give it a try :)

Is this ready to be merged ?

@urroxyz
Copy link
Contributor Author

urroxyz commented Mar 9, 2025

It should be!

Thank you for the discussion.

@Saibo-creator Saibo-creator merged commit dde1e8b into epfl-dlab:main Mar 9, 2025
1 check passed
@Saibo-creator Saibo-creator changed the title Update logits_process.py Add support for LlamaCPP Mar 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments