This database was developed as part of my engineering thesis - Design and Implementation of a Vector Database.
The implementation was created using the Rust programming language and required implementing essential database system functionalities such as data storage, B+tree based indexing, query processing, and transaction management using the write-ahead logging (WAL) mechanism. Vector space searching, a fundamental feature of vector databases, was implemented using the Hierarchical Navigable Small World (HNSW) algorithm, which applies the concept of approximate nearest-neighbor search (ANNS), enabling efficient similarity searches between vectors. The system’s architecture follows the ”Command” and ”Command and Query Responsibility Segregation” (CQRS) design patterns.
Visualization showcasing the use cases of the vRod vector database.
Diagram illustrating the high-level architecture of the vRod database.
Visualization of how the Command and CQRS design patterns are implemented to separate write and read operations for better scalability and maintainability.
Depiction of the file system structure utilized by the system.
Detailed view of the storage -- used structures and data layout.
Detailed view of the WAL -- used structures and data layout.
Detailed view of the modified B+tree -- used structures and data layout.
Example of how data insertion is handled within the implemented modified B+tree structure.
Visualization of selecting the next neighbor using the heuristic method1.
Visualization of the in-memory HNSW graph structure during the construction phase.
Sequential visualization of the HNSW index construction.
Visualization of the operation of searching for similar points in HNSW2.
Identify a vector database for a Proof of Concept (PoC) that is efficient, scalable, user-friendly, and capable of accurate vector similarity search.
Environment Setup:
vrod -i . -n "sift_db"
Database created at: "./sift_db"
vrod -e "CREATE" -a "sift_1m_col" -d sift_db
Executing command: "CREATE sift_1m_col"
Collection created at: "sift_db/sift_1m_col"Data Insertion: Using the ANN_SIFT1M dataset3 (1 million 128-dimensional vectors).
vrod -e "BULK_INSERT" -f sift_dataset/sift1m_input.txt -d sift_db -c sift_1m_col
Parsing vectors and payloads from file...
Vectors and payloads parsed in 3.4844644s.
Executing command: "BULK_INSERT"
Collecting vectors and payloads...
Vectors and payloads collected.
Inserting vectors and payloads...
Vectors and payloads inserted in 3.932022s.Index Creation: Euclidean distance-based vector index.
vrod -e "CREATE_VECTOR_INDEX" -a "EUCLID" -d sift_db -c sift_1m_col
Executing command: "CREATE_VECTOR_INDEX EUCLID"
Creating HNSW Index...
HNSW Index created in 284.26566s.Similarity Search:
vrod -e "SEARCH_SIMILAR" -a "EUCLID 1.0,(...)" -d sift_db -c sift_1m_col
Executing query: "SEARCH_SIMILAR EUCLID"
Loading HNSW Index...
HNSW Index loaded in 0.27991423s.
Searching similar vectors...
id | similarity | payload
--------- | ------------ | -------
932086 | 232.87122 | 932085
934877 | 234.71472 | 934876
561814 | 243.98976 | 561813
708178 | 255.46037 | 708177
706772 | 256.31427 | 706771
695757 | 258.86288 | 695756
435346 | 261.24127 | 435345
701259 | 264.28015 | 701258
455538 | 267.2845 | 455537
872729 | 268.06903 | 872728The vRod database effectively handled 1 million vectors.
Accurate nearest neighbor search with efficient operations.
Enable customization of similarity search results to align with specific preferences.
Environment Setup:
vrod -i . -n "db"
Database created at: "./db"
vrod -e "CREATE" -a "top_uni" -d db
Executing command: "CREATE top_uni"
Collection created at: "db/top_uni"Data Vectorization and Insertion: Using all-MiniLM-L6-v24 for embeddings.
vrod -g 20 -a $'\n' -f to_be_embedded.txt
File content acquired.
Embeddings length: 20
Embedding dimension: 384
Size of the embeddings file: 0.09 MB
vrod -e "BULK_INSERT" -f embeddings.txt -c top_uni -d db
Parsing vectors and payloads from file...
Vectors and payloads parsed in 0.001920939s.
Executing command: "BULK_INSERT"
Collecting vectors and payloads...
Vectors and payloads collected.
Inserting vectors and payloads...
Vectors and payloads inserted in 0.00743906s.
vrod -e "CREATE_VECTOR_INDEX" -a "COSINE" -c top_uni -d db
Executing command: "CREATE_VECTOR_INDEX COSINE"
Creating HNSW Index...
HNSW Index created in 0.003545391s.Initial Similarity Search:
vrod -e "SEARCH_SIMILAR" -c top_uni -d db -a "COSINE -0.03908783,(...)"
Executing query: "SEARCH_SIMILAR COSINE"
Loading HNSW Index...
HNSW Index loaded in 0.000171699s.
Searching similar vectors...
id | similarity | payload
--------- | ------------ | -------
1 | 0.9999999 | What is the best university in poland?
2 | 0.9130575 | University of Warsaw
5 | 0.87840545 | Warsaw University of Technology
14 | 0.86457586 | Poznań University of Technology
9 | 0.8604667 | Gdańsk University of Technology
3 | 0.86006516 | Jagiellonian University in Kraków
13 | 0.8541289 | Łódź University of Technology
11 | 0.83903986 | Adam Mickiewicz University in Poznań
12 | 0.83505535 | Wrocław University of Technology
7 | 0.8201228 | AGH University of Science and Technology in KrakówRecord Modification: Adjust search results to reflect preferences.
vrod -e "UPDATE" -a "2;;Warsaw University of Technology" -c top_uni -d db
Executing command: "UPDATE 2;;Warsaw University of Technology"
Embedding updated successfully.
vrod -e "UPDATE" -a "5;;University of Warsaw" -c top_uni -d db
Executing command: "UPDATE 5;;University of Warsaw"
Embedding updated successfully.Post-Modification Search:
vrod -e "SEARCH_SIMILAR" -c top_uni -d db -a "COSINE -0.03908783,(...)"
Executing query: "SEARCH_SIMILAR COSINE"
Loading HNSW Index...
HNSW Index loaded in 0.000234967s.
Searching similar vectors...
id | similarity | payload
--------- | ------------ | -------
1 | 0.9999999 | What is the best university in poland?
2 | 0.9130575 | Warsaw University of Technology
5 | 0.87840545 | University of Warsaw
14 | 0.86457586 | Poznań University of Technology
9 | 0.8604667 | Gdańsk University of Technology
3 | 0.86006516 | Jagiellonian University in Kraków
13 | 0.8541289 | Łódź University of Technology
11 | 0.83903986 | Adam Mickiewicz University in Poznań
12 | 0.83505535 | Wrocław University of Technology
7 | 0.8201228 | AGH University of Science and Technology in KrakówSuccessful customization of search results based on preferences.
Efficient and flexible data manipulation within the vRod database.
To display the user guide and available commands, run:
vrod
vrod -h
vrod --helpFootnotes
-
Malkov, Y. A., and Yashunin, D. A.,"Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 4, pp. 824–836, Apr. 2020, ISSN: 0162-8828, 2160-9292, 1939-3539. doi: 10.1109 TPAMI.2018.2889473. Accessed Nov. 17, 2024. Available at: https://ieeexplore.ieee.org/document/8594636/. ↩
-
https://www.pinecone.io/learn/series/faiss/hnsw (Accessed Dec. 14, 2024). ↩
-
Available at: http://corpus-texmex.irisa.fr (Accessed Dec. 14, 2024) ↩
-
Available at: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 (Accessed Dec. 9, 2024). ↩