HyperCrawlX

🚀 About the Project

HyperCrawlX is a distributed web crawler designed to efficiently discover product URLs on e-commerce websites.
Leveraging modern web scraping technologies, it enables seamless extraction of e-commerce data.

🔥 Features

Distributed crawling for scalable and efficient data discovery
Asynchronous Processing with multiple workers
High-performance scraping using Playwright and HtmlAgilityPack
PostgreSQL for structured data storage with Connection Pooling for efficient DB connection management.
Containerized using Docker
Cloud-based deployment on AWS ECS and Render

High-Level Design

🛠️ Tech Stack

Backend: .NET Core, C#
Database: PostgreSQL
Scraping: Playwright, HtmlAgilityPack
Deployment: Docker, AWS ECS, Render

📌 Usage

HyperCrawlX operates as a web application. Users can interact with the crawler via its API endpoints.

1️⃣ Submit a Crawl Request

POST /hypercrawlx/submitCrawlRequest
Content-Type: application/json
{
  "url": "<E-com website URL>"
}

✅ Response

{
  "requestId": "<unique Id of the request>",
  "url": "<E-com website URL>"
}

This API submits a crawl request to the database with the given URL. The URL is validated for structural correctness before submitting the request. If found altered, it is rejected and appropriate error is thrown to the user.

2️⃣ Check Crawl Status

GET /hypercrawlx/getRequestStatus/<request-Id>

✅ Response

{
  "requestId": "<unique id of the request>",
  "status": "Completed",
  "url": "<E-com website URL>",
  "productUrlsCount": "<count-of-the-urls-found>",
  "productUrls": [
    "https://example.com/product1",
    "https://example.com/product2"
  ]
}

This API is used to check the status of the request. The status can be one of the following:

Queued
InProgress
Completed
Failed

🐳 Docker Image

Here is the docker image tag for this app anu1201d/apps-pub-repo:hypercrawlx

💡 HyperCrawlX - Simplifying e-commerce product discovery at scale!

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
root		root
.dockerignore		.dockerignore
.gitignore		.gitignore
HyperCrawlX.sln		HyperCrawlX.sln
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HyperCrawlX

🚀 About the Project

🔥 Features

High-Level Design

🛠️ Tech Stack

📌 Usage

1️⃣ Submit a Crawl Request

✅ Response

2️⃣ Check Crawl Status

✅ Response

🐳 Docker Image

About

Uh oh!

Releases

Uh oh!

Languages

dev-anubhavj/HyperCrawlX

Folders and files

Latest commit

History

Repository files navigation

HyperCrawlX

🚀 About the Project

🔥 Features

High-Level Design

🛠️ Tech Stack

📌 Usage

1️⃣ Submit a Crawl Request

✅ Response

2️⃣ Check Crawl Status

✅ Response

🐳 Docker Image

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Languages