Skip to content

A Web-Crawler API for discovering Product URLs on E-commerce Websites

Notifications You must be signed in to change notification settings

dev-anubhavj/HyperCrawlX

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HyperCrawlX

🚀 About the Project

HyperCrawlX is a distributed web crawler designed to efficiently discover product URLs on e-commerce websites.
Leveraging modern web scraping technologies, it enables seamless extraction of e-commerce data.

🔥 Features

  • Distributed crawling for scalable and efficient data discovery
  • Asynchronous Processing with multiple workers
  • High-performance scraping using Playwright and HtmlAgilityPack
  • PostgreSQL for structured data storage with Connection Pooling for efficient DB connection management.
  • Containerized using Docker
  • Cloud-based deployment on AWS ECS and Render

High-Level Design

Architecture

🛠️ Tech Stack

  • Backend: .NET Core, C#
  • Database: PostgreSQL
  • Scraping: Playwright, HtmlAgilityPack
  • Deployment: Docker, AWS ECS, Render

📌 Usage

HyperCrawlX operates as a web application. Users can interact with the crawler via its API endpoints.

1️⃣ Submit a Crawl Request

POST /hypercrawlx/submitCrawlRequest
Content-Type: application/json
{
  "url": "<E-com website URL>"
}

✅ Response

{
  "requestId": "<unique Id of the request>",
  "url": "<E-com website URL>"
}

This API submits a crawl request to the database with the given URL. The URL is validated for structural correctness before submitting the request. If found altered, it is rejected and appropriate error is thrown to the user.

2️⃣ Check Crawl Status

GET /hypercrawlx/getRequestStatus/<request-Id>

✅ Response

{
  "requestId": "<unique id of the request>",
  "status": "Completed",
  "url": "<E-com website URL>",
  "productUrlsCount": "<count-of-the-urls-found>",
  "productUrls": [
    "https://example.com/product1",
    "https://example.com/product2"
  ]
}

This API is used to check the status of the request. The status can be one of the following:

  1. Queued
  2. InProgress
  3. Completed
  4. Failed

🐳 Docker Image

Here is the docker image tag for this app anu1201d/apps-pub-repo:hypercrawlx


💡 HyperCrawlX - Simplifying e-commerce product discovery at scale!

About

A Web-Crawler API for discovering Product URLs on E-commerce Websites

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published