GPU Not Utilized When Using llm-rs with CUDA Version

I have installed the llm-rs library with the CUDA version, However, even though I have set `use_gpu=True` in the `SessionConfig`, the GPU is not utilized when running the code. Instead, the CPU usage remains at 100% during execution.

Additional Information:
I am using the "RedPajama Chat 3B" model from Rustformers. The model can be found at the following link: [RedPajama Chat 3B Model](https://huggingface.co/rustformers/redpajama-3b-ggml).

Terminal output:
```
PS C:\Users\andri\Downloads\chatwaifu> python main.py
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA P106-100, compute capability 6.1
```

Code:
```python
import json
from llm_rs.langchain import RustformersLLM
from llm_rs import SessionConfig, GenerationConfig, ContainerType, QuantizationType, Precision
from langchain import PromptTemplate
from langchain.chains import LLMChain
from langchain.memory import ConversationBufferMemory
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from pathlib import Path

class ChainingModel:
    def __init__(self, model, name, assistant_name):
        with open('config.json') as self.configuration:
            self.user_config = json.load(self.configuration)
        with open('template.json') as self.prompt_template:
            self.user_template = json.load(self.prompt_template)
        model = f"{model}.bin"
        self.model = model

        self.name = name
        self.assistant_name = assistant_name
        self.names = f"<{name}>"
        self.assistant_names = f"<{assistant_name}>"
        
        self.stop_word = ['\n<human>:', '<human>', '<bot>', '\n<bot>:']
        self.stop_words = self.change_stop_words(self.stop_word, self.name, self.assistant_name)
        session_config = SessionConfig(
            threads=self.user_config['threads'],
            context_length=self.user_config['context_length'],
            prefer_mmap=False,
            use_gpu=True
        )

        generation_config = GenerationConfig(
            top_p=self.user_config['top_p'],
            top_k=self.user_config['top_k'],
            temperature=self.user_config['temperature'],
            max_new_tokens=self.user_config['max_new_tokens'],
            repetition_penalty=self.user_config['repetition_penalty'],
            stop_words=self.stop_words
        )

        template = self.user_template['template']

        self.template = self.change_names(template, self.assistant_name, self.name)
        self.prompt = PromptTemplate(
            input_variables=["chat_history", "instruction"],
            template=self.template
        )
        self.memory = ConversationBufferMemory(memory_key="chat_history")

        self.llm = RustformersLLM(
            model_path_or_repo_id=self.model,
            session_config=session_config,
            generation_config=generation_config,
            callbacks=[StreamingStdOutCallbackHandler()]
        )

        self.chain = LLMChain(llm=self.llm, prompt=self.prompt, memory=self.memory)
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Not Utilized When Using llm-rs with CUDA Version #27

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

GPU Not Utilized When Using llm-rs with CUDA Version #27

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions