HyperCrawlX is a distributed web crawler designed to efficiently discover product URLs on e-commerce websites.
Leveraging modern web scraping technologies, it enables seamless extraction of e-commerce data.
- Distributed crawling for scalable and efficient data discovery
- Asynchronous Processing with multiple workers
- High-performance scraping using Playwright and HtmlAgilityPack
- PostgreSQL for structured data storage with Connection Pooling for efficient DB connection management.
- Containerized using Docker
- Cloud-based deployment on AWS ECS and Render
- Backend: .NET Core, C#
- Database: PostgreSQL
- Scraping: Playwright, HtmlAgilityPack
- Deployment: Docker, AWS ECS, Render
HyperCrawlX operates as a web application. Users can interact with the crawler via its API endpoints.
POST /hypercrawlx/submitCrawlRequest
Content-Type: application/json
{
"url": "<E-com website URL>"
}{
"requestId": "<unique Id of the request>",
"url": "<E-com website URL>"
}This API submits a crawl request to the database with the given URL. The URL is validated for structural correctness before submitting the request. If found altered, it is rejected and appropriate error is thrown to the user.
GET /hypercrawlx/getRequestStatus/<request-Id>{
"requestId": "<unique id of the request>",
"status": "Completed",
"url": "<E-com website URL>",
"productUrlsCount": "<count-of-the-urls-found>",
"productUrls": [
"https://example.com/product1",
"https://example.com/product2"
]
}This API is used to check the status of the request. The status can be one of the following:
- Queued
- InProgress
- Completed
- Failed
Here is the docker image tag for this app
anu1201d/apps-pub-repo:hypercrawlx
💡 HyperCrawlX - Simplifying e-commerce product discovery at scale!
