Download ollama from: https://ollama.com/
Run downloaded app or type:
ollama serve
Download your model, but keep in mind that app was tested using llama3.2:1b.
ollama pull llama3.2:1b
Install dependencies:
pip install requirements.txt
Put your pdf documents inside documents folder
Activate virtual env using command:
./mini-rag/Scripts/activate
run python main.py
Go to 127.0.0.1:8000 and wait if needed for the system to load. Refresh from time to time or look at console to see if loading is completed.
Now you should be able to ask your questions freely.
What restrictions does this app have?
- It uses CPU for calculations which in turn takes long time to process,
- Embeddings and retrieval are sensitive to used words, queries and sentences,
- Answers are non deterministic,
- Number of embeddings is restricted by RAM size.
What would you correct in the app if you had more time?
- Improve embeddings calculations,
- Move heavy tasks to GPU,
- Add vector database,
- Make app more responsive.
How would you prepare this system for production in terms of scaling and monitoring?
- Use better hardware,
- Find and use better was for chunking and retrieval,
- Find and use better embedding models,
- Find and use better LLM models,
- Find and switch to better prompts,
- Allow LLMs for external knowledge use or scrap data if needed,
- Use vector database,
- Add chat history,
- Return more detailed sources,
- Add more tests.
To test, run pytest -v test_rag.py