GitHub

Search Engine Author: Richard Dzreke

This project was broken into 4 separate parts. This was built incrementally and involved passing stringent tests and code reviews along the way. It was a rewarding experience and allowed me to learn some of the best practices when it comes to code design and structure, with a focus on efficiency.

Key Components of Search Engine:

Process text files into a consistent form
Build a data structure that supports search
Implement search algorithms through a data structure
Switch from a single-threaded approach to a multi-threaded building and searching
Process webpages into a consistent form
Create a web server and a front end

Search Engine Part 1: Processing text files and Building a Custom Data Structure

Features

Cleans and stems text using OpenNLP’s Snowball stemmer

Builds two data structures: • Word Count Map – counts word stems per file • Inverted Index – maps each stemmed word to file(s) and position(s)

Accepts command-line arguments for input/output file paths

Outputs results in pretty-printed JSON format

Handles directories recursively and validates user input

Example Usage

java Driver -text input/text/simple -counts counts.json -index index.json

Sample Output (index.json):

{ "hello": { "input/text/simple/hello.txt": [1] } }

Core Components

ArgumentParser – parses and validates flags

TextFileFinder – locates text files recursively

TextFileStemmer – cleans and stems words

InvertedIndex – stores mappings of word → file → positions

JsonWriter – exports data in JSON format

Lessons Learned

Modular software design and testing

Recursive algorithms and file I/O

Data structure implementation in Java

Real-world text processing challenges

How to Run

javac -cp ".:libs/" Driver.java java -cp ".:libs/" Driver -text input/text/simple -counts counts.json -index index.json

Search Engine Part 2: Processing multi-word queries from a query file, conducting an exact search or partial search through the inverted index data structure for those queries, and outputting the search results ranked by term frequency in pretty JSON format to a file.

Features

Processes query files to perform exact or partial search using an existing inverted index

Cleans and stems query text using OpenNLP’s Snowball stemmer

Ranks results based on relevance score, match count, and file name

Supports command-line flags for input/output paths and search mode selection

Outputs results in pretty-printed JSON format mapping each query to its ranked results

Example Usage java Driver -index index.json -query queries.txt -results results.json

Sample Output (results.json): { "hello world": [ { "where": "input/text/simple/hello.txt", "count": 2, "score": 0.345 } ] }

Core Components

QueryParser – reads and cleans query lines from input file

SearchResult – stores and compares result data (count, score, path)

InvertedIndex – performs exact and partial searches

JsonWriter – formats and exports query results to JSON

Driver – handles flag parsing and overall program execution

Lessons Learned

Efficient searching and result ranking

JSON serialization and file output handling

Text normalization and stemming

Extending modular software architecture across projects

How to Run javac -cp ".:libs/" Driver.java java -cp ".:libs/" Driver -index index.json -query queries.txt -results results.json

Search Engine Part 3: support multithreading using a custom-made conditional read/write lock to create a thread-safe inverted index and a bespoke work queue to manage worker threads. The build process is multithreaded, with each worker thread processing a single file. The search process should be multithreaded such that a worker thread processes a single multi-word query line

Features

Implements multithreading to improve index building and query performance. Adds a custom work queue for managing worker threads and a reentrant read/write lock for thread-safe access to shared data. Supports both single-threaded and multithreaded execution modes via the -threads [num] flag. Multithreaded build assigns one file per worker thread; multithreaded search assigns one query line per thread. Ensures safe concurrent reads and exclusive writes with a custom conditional locking mechanism. Maintains all functionality from previous projects, including inverted index construction, search processing, and JSON output.

Example Usage

java Driver -text input/text/simple -query input/query/simple.txt -results results.json -threads 5

Single-threaded mode:

java Driver -text input/text/simple -query input/query/simple.txt -results results.json

Core Components

WorkQueue – manages worker threads and tracks pending tasks

ReadWriteLock – custom reentrant conditional lock for concurrent access control

ThreadSafeInvertedIndex – extends inverted index with thread-safe operations

MultithreadedBuilder – builds the index concurrently from multiple files

MultithreadedQueryHandler – performs parallel searches for query lines

Sample Output (results.json)

{ "hello world": [ { "where": "input/text/simple/hello.txt", "count": 2, "score": 0.75 } ] }

Lessons Learned

Practical implementation of multithreading for performance improvement

Designing thread-safe data structures using locks and condition variables

Building and testing concurrent systems in Java

Measuring speedup and efficiency across single- and multithreaded modes

How to Run

javac -cp ".:libs/" Driver.java java -cp ".:libs/" Driver -text input/text/simple -query input/query/simple.txt -results results.json -threads 5

Search Engine Part 4: creates a web crawler that builds the inverted index from a seed URL.

Features

Builds an inverted index from a seed URL, supporting both single-page and multi-page crawls.

Implements multithreading to improve crawl speed and index building efficiency.

Uses a thread-safe inverted index and work queue to manage concurrent processing of multiple web pages.

Supports crawl limits via the -crawl [total] flag and thread count via the -threads [num] flag.

Processes HTML pages efficiently:

Removes comments, block elements (head, style, script, noscript, svg), and HTML tags.

Converts HTML entities to Unicode symbols.

Cleans, parses, and stems text before adding it to the index.

Follows HTTP/S standards:

Handles redirects (up to 3) and fetches only HTML content with 200 OK status.

Ensures URLs are absolute, normalized, and URL-encoded.

Maintains full search functionality from previous projects.

Example Usage

Multithreaded crawl (default 5 threads if -threads not provided):

java Driver -html "https://usf-cs272-fall2022.github.io/project-web/input/simple/" -crawl 15 -threads 3 -index index-crawl.json

Single-page crawl (default crawl = 1, default threads = 5 for v4.1+):

java Driver -html "https://usf-cs272-fall2022.github.io/project-web/input/simple/" -index index-crawl.json

Core Components

WebCrawler – manages downloading, parsing, and indexing web pages.

WorkQueue – handles worker threads and queues crawl tasks.

ThreadSafeInvertedIndex – extends the inverted index with thread-safe operations.

HtmlFetcher – downloads HTML content over HTTP/S using sockets and handles redirects.

HtmlCleaner – processes HTML content to remove unwanted elements and tags.

LinkFinder – extracts links from cleaned HTML for further crawling.

Sample Output (index-crawl.json) { "hello": [ { "where": "https://usf-cs272-fall2022.github.io/project-web/input/simple/hello.html", "count": 2 } ], "world": [ { "where": "https://usf-cs272-fall2022.github.io/project-web/input/simple/hello.html", "count": 1 } ] }

Lessons Learned

Practical implementation of multithreading in web crawling and indexing.

Design of thread-safe data structures using work queues and synchronized operations.

Handling HTTP/S redirects, content types, and efficient page fetching.

HTML parsing, cleaning, and text normalization for indexing.

Incremental development and testing for large-scale multithreaded systems.

How to Run

Compile the project:

javac -cp ".:libs/*" Driver.java

Run the crawler with a seed URL and optional flags:

java -cp ".:libs/*" Driver -html [seed-url] -crawl [total] -threads [num] -index [output-file]

Example for multithreaded crawl:

java -cp ".:libs/*" Driver -html "https://usf-cs272-fall2022.github.io/project-web/input/simple/" -crawl 15 -threads 3 -index index-crawl.json

Search Engine Part 5:

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
.github/workflows		.github/workflows
src/main		src/main
.classpath		.classpath
.gitattributes		.gitattributes
.gitignore		.gitignore
.project		.project
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

rick2244/SearchEngine

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages