Search Engine Author: Richard Dzreke
This project was broken into 4 separate parts. This was built incrementally and involved passing stringent tests and code reviews along the way. It was a rewarding experience and allowed me to learn some of the best practices when it comes to code design and structure, with a focus on efficiency.
Key Components of Search Engine:
- Process text files into a consistent form
- Build a data structure that supports search
- Implement search algorithms through a data structure
- Switch from a single-threaded approach to a multi-threaded building and searching
- Process webpages into a consistent form
- Create a web server and a front end
Search Engine Part 1: Processing text files and Building a Custom Data Structure
Features
Cleans and stems text using OpenNLP’s Snowball stemmer
Builds two data structures: • Word Count Map – counts word stems per file • Inverted Index – maps each stemmed word to file(s) and position(s)
Accepts command-line arguments for input/output file paths
Outputs results in pretty-printed JSON format
Handles directories recursively and validates user input
Example Usage
java Driver -text input/text/simple -counts counts.json -index index.json
Sample Output (index.json):
{ "hello": { "input/text/simple/hello.txt": [1] } }
Core Components
ArgumentParser – parses and validates flags
TextFileFinder – locates text files recursively
TextFileStemmer – cleans and stems words
InvertedIndex – stores mappings of word → file → positions
JsonWriter – exports data in JSON format
Lessons Learned
Modular software design and testing
Recursive algorithms and file I/O
Data structure implementation in Java
Real-world text processing challenges
How to Run
javac -cp ".:libs/" Driver.java java -cp ".:libs/" Driver -text input/text/simple -counts counts.json -index index.json
Search Engine Part 2: Processing multi-word queries from a query file, conducting an exact search or partial search through the inverted index data structure for those queries, and outputting the search results ranked by term frequency in pretty JSON format to a file.
Features
Processes query files to perform exact or partial search using an existing inverted index
Cleans and stems query text using OpenNLP’s Snowball stemmer
Ranks results based on relevance score, match count, and file name
Supports command-line flags for input/output paths and search mode selection
Outputs results in pretty-printed JSON format mapping each query to its ranked results
Example Usage java Driver -index index.json -query queries.txt -results results.json
Sample Output (results.json): { "hello world": [ { "where": "input/text/simple/hello.txt", "count": 2, "score": 0.345 } ] }
Core Components
QueryParser – reads and cleans query lines from input file
SearchResult – stores and compares result data (count, score, path)
InvertedIndex – performs exact and partial searches
JsonWriter – formats and exports query results to JSON
Driver – handles flag parsing and overall program execution
Lessons Learned
Efficient searching and result ranking
JSON serialization and file output handling
Text normalization and stemming
Extending modular software architecture across projects
How to Run javac -cp ".:libs/" Driver.java java -cp ".:libs/" Driver -index index.json -query queries.txt -results results.json
Search Engine Part 3: support multithreading using a custom-made conditional read/write lock to create a thread-safe inverted index and a bespoke work queue to manage worker threads. The build process is multithreaded, with each worker thread processing a single file. The search process should be multithreaded such that a worker thread processes a single multi-word query line
Features
Implements multithreading to improve index building and query performance. Adds a custom work queue for managing worker threads and a reentrant read/write lock for thread-safe access to shared data. Supports both single-threaded and multithreaded execution modes via the -threads [num] flag. Multithreaded build assigns one file per worker thread; multithreaded search assigns one query line per thread. Ensures safe concurrent reads and exclusive writes with a custom conditional locking mechanism. Maintains all functionality from previous projects, including inverted index construction, search processing, and JSON output.
Example Usage
java Driver -text input/text/simple -query input/query/simple.txt -results results.json -threads 5
Single-threaded mode:
java Driver -text input/text/simple -query input/query/simple.txt -results results.json
Core Components
WorkQueue – manages worker threads and tracks pending tasks
ReadWriteLock – custom reentrant conditional lock for concurrent access control
ThreadSafeInvertedIndex – extends inverted index with thread-safe operations
MultithreadedBuilder – builds the index concurrently from multiple files
MultithreadedQueryHandler – performs parallel searches for query lines
Sample Output (results.json)
{ "hello world": [ { "where": "input/text/simple/hello.txt", "count": 2, "score": 0.75 } ] }
Lessons Learned
Practical implementation of multithreading for performance improvement
Designing thread-safe data structures using locks and condition variables
Building and testing concurrent systems in Java
Measuring speedup and efficiency across single- and multithreaded modes
How to Run
javac -cp ".:libs/" Driver.java java -cp ".:libs/" Driver -text input/text/simple -query input/query/simple.txt -results results.json -threads 5
Search Engine Part 4: creates a web crawler that builds the inverted index from a seed URL.
Features
Builds an inverted index from a seed URL, supporting both single-page and multi-page crawls.
Implements multithreading to improve crawl speed and index building efficiency.
Uses a thread-safe inverted index and work queue to manage concurrent processing of multiple web pages.
Supports crawl limits via the -crawl [total] flag and thread count via the -threads [num] flag.
Processes HTML pages efficiently:
Removes comments, block elements (head, style, script, noscript, svg), and HTML tags.
Converts HTML entities to Unicode symbols.
Cleans, parses, and stems text before adding it to the index.
Follows HTTP/S standards:
Handles redirects (up to 3) and fetches only HTML content with 200 OK status.
Ensures URLs are absolute, normalized, and URL-encoded.
Maintains full search functionality from previous projects.
Example Usage
Multithreaded crawl (default 5 threads if -threads not provided):
java Driver -html "https://usf-cs272-fall2022.github.io/project-web/input/simple/" -crawl 15 -threads 3 -index index-crawl.json
Single-page crawl (default crawl = 1, default threads = 5 for v4.1+):
java Driver -html "https://usf-cs272-fall2022.github.io/project-web/input/simple/" -index index-crawl.json
Core Components
WebCrawler – manages downloading, parsing, and indexing web pages.
WorkQueue – handles worker threads and queues crawl tasks.
ThreadSafeInvertedIndex – extends the inverted index with thread-safe operations.
HtmlFetcher – downloads HTML content over HTTP/S using sockets and handles redirects.
HtmlCleaner – processes HTML content to remove unwanted elements and tags.
LinkFinder – extracts links from cleaned HTML for further crawling.
Sample Output (index-crawl.json) { "hello": [ { "where": "https://usf-cs272-fall2022.github.io/project-web/input/simple/hello.html", "count": 2 } ], "world": [ { "where": "https://usf-cs272-fall2022.github.io/project-web/input/simple/hello.html", "count": 1 } ] }
Lessons Learned
Practical implementation of multithreading in web crawling and indexing.
Design of thread-safe data structures using work queues and synchronized operations.
Handling HTTP/S redirects, content types, and efficient page fetching.
HTML parsing, cleaning, and text normalization for indexing.
Incremental development and testing for large-scale multithreaded systems.
How to Run
Compile the project:
javac -cp ".:libs/*" Driver.java
Run the crawler with a seed URL and optional flags:
java -cp ".:libs/*" Driver -html [seed-url] -crawl [total] -threads [num] -index [output-file]
Example for multithreaded crawl:
java -cp ".:libs/*" Driver -html "https://usf-cs272-fall2022.github.io/project-web/input/simple/" -crawl 15 -threads 3 -index index-crawl.json
Search Engine Part 5: