A high-performance, serverless Node.js application that extracts article data from URLs while bypassing Cloudflare's anti-bot measures. Built with modern JavaScript (ES2022) and optimized for Node.js v18+ and Vercel deployment.
- Dual Bypass Strategy: Two-tier anti-bot bypass system
- Primary:
humanoid-jsfor basic-medium Cloudflare protection - Secondary:
impitwith browser fingerprint spoofing - Automatic fallback if primary method fails
- Primary:
- Smart Caching: Redis-based caching with configurable TTL (default: 10 days)
- Content Extraction: Extracts title, content, images, author, published date, and metadata
- Quality Validation: Automatic detection of cookie walls, paywalls, and invalid content
- Dual HTTP Methods: Supports both GET and POST requests
- Secret Key Authentication: Multi-key support with comma-separated values
- Input Validation: URL format validation and sanitization
- Redis Fallback: Service continues without cache if Redis is unavailable
- Timeout Handling: 25-second timeout prevents hanging requests
- Error Sanitization: Production-safe error messages
- Strategy Reporting: Shows which bypass method succeeded (
humanoidorimpit) - Performance Tracking: Response time measurement for every request
- Content Validation: Reports on article quality and detected blockers
- Health Endpoint: Service health and Redis connectivity monitoring
- Cache Status: Indicates if content was served from cache or freshly fetched
- CORS Support: Cross-origin requests enabled for all methods
- RESTful API: Clean, consistent JSON responses
- Comprehensive Testing: 7 automated checks for project integrity
- Modern Tooling: ESLint v9, Prettier, ES2022 features
- Serverless Ready: Optimized for Vercel free tier (<50MB)
- Node.js: v18.0.0 or later (fully compatible with Node.js v22)
- Vercel CLI: Install globally with
npm i -g vercel - Upstash Redis: For caching (free tier available at upstash.com)
-
Clone the repository
git clone https://github.com/davodm/article-export.git cd article-export -
Install dependencies
npm install
-
Set up environment variables Create a
.env.localfile:UPSTASH_REDIS_REST_TOKEN=your_redis_token UPSTASH_REDIS_REST_URL=your_redis_url SECRET_KEY=your_secret_key1,your_secret_key2 REDIS_CACHE_DAYS=10
-
Run tests to verify setup
npm test -
Start local development
vercel dev
Extracts article content from a given URL. Supports both GET and POST methods.
Monitors service health and Redis connection status.
The API supports both GET and POST methods with the same parameters:
GET Request (Query Parameters):
GET /api?key=your_secret_key&url=https://example.com/articlePOST Request (JSON Body):
{
"key": "your_secret_key",
"url": "https://example.com/article"
}Success Response (200):
{
"status": 0,
"article": {
"title": "Article Title",
"content": "Article content...",
"image": "https://example.com/image.jpg",
"author": "Author Name",
"publishedTime": "2024-01-01T00:00:00.000Z"
},
"cached": false,
"strategy": "humanoid",
"validation": {
"isValid": true,
"hasBlocker": false,
"issues": [],
"quality": {
"hasValidTitle": true,
"hasValidContent": true,
"hasValidDescription": true,
"contentLength": 2540
}
},
"processingTime": "1250ms",
"timestamp": "2024-01-01T00:00:00.000Z"
}Response Fields:
status:0for success,-1for errorarticle: Extracted article data (title, content, author, etc.)cached:trueif served from cache,falseif freshly fetchedstrategy: Which fetch method was used ("humanoid"or"impit"),nullif from cachevalidation: Content quality and blocker detection (see below)processingTime: Total processing time in millisecondstimestamp: ISO timestamp of the response
Validation Object:
isValid:trueif content is valid,falseif issues detectedhasBlocker:trueif cookie wall or paywall detectedissues: Array of detected issues (cookie walls, paywalls, etc.)quality: Quality metrics (title, content, description validity)
Error Response (4xx/5xx):
{
"status": -1,
"error": "Error message",
"timestamp": "2024-01-01T00:00:00.000Z"
}Health Check Response:
{
"status": 0,
"message": "Service is healthy",
"timestamp": "2024-01-01T00:00:00.000Z",
"environment": "production",
"nodeVersion": "v22.15.1",
"redis": "connected",
"uptime": 123.456
}# Test health endpoint
curl https://your-app.vercel.app/api/health
# Extract article content (GET method - simple and easy)
curl "https://your-app.vercel.app/api?key=your_secret_key&url=https://example.com/article"
# Extract article content (POST method - recommended for long URLs)
curl -X POST https://your-app.vercel.app/api \
-H "Content-Type: application/json" \
-d '{
"key": "your_secret_key",
"url": "https://example.com/article"
}'vercel dev- Start local development servernpm run build- Build the project (creates public directory for Vercel)npm run deploy- Deploy to productionnpm run deploy:staging- Deploy to stagingnpm run lint- Run ESLint for code qualitynpm run format- Format code with Prettiernpm test- Run project validation testsnpm run clean- Clean Vercel build files
The project uses modern development tools:
- ESLint v9 with flat config for code linting
- Prettier for consistent code formatting
- ES2022 features for modern JavaScript
- Comprehensive testing with automated validation
-
Install Vercel CLI globally:
npm i -g vercel
-
Link your project:
vercel link
-
Run locally:
vercel dev
| Package | Version | Status | Purpose |
|---|---|---|---|
@extractus/article-extractor |
^8.0.20 | β Active | Extracts article content, metadata, and structured data from HTML |
@upstash/redis |
^1.35.6 | β Active | Serverless Redis client for caching with REST API |
humanoid-js |
^1.0.1 | Primary Cloudflare bypass (7 years old, but still functional) | |
impit |
^0.6.0 | β Active | HTTP client with browser impersonation for secondary bypass |
| Package | Version | Status | Purpose |
|---|---|---|---|
eslint |
^9.38.0 | β Active | Code linting with flat config support |
globals |
^16.4.0 | β Active | ESLint global variables for Node.js v24 compatibility |
prettier |
^3.6.2 | β Active | Code formatting |
humanoid-js (
- Last updated: 7 years ago (2018)
- Status: Works for basic-medium Cloudflare protection
- Why we keep it: Simple, lightweight, no browser needed
- Fallback:
impitautomatically used if humanoid-js fails - Future: Will replace when it stops working or better alternatives emerge
Why This Approach Works:
- β Two bypass strategies provide redundancy
- β Automatic fallback ensures reliability
- β All dependencies work on Vercel free tier
- β No browser automation needed (keeps function size <50MB)
- β Total package size: ~15MB (well under 50MB limit)
# Update all dependencies (safe - follows semver)
npm update
# Check for outdated packages
npm outdated
# Rebuild native modules after Node.js upgrade
npm rebuildRequest β Validate Key & URL
β
Check Redis Cache
β
Cache Hit? β Return Cached Article β
β
Cache Miss? β Fetch with Bypass Strategy
β
Try humanoid-js β Success? β Extract & Cache β Return β
β
Failed? β Try impit β Success? β Extract & Cache β Return β
β
Failed? β Return Error β
// Automatic fallback system
1. Try humanoid-js (fast, lightweight)
β Success β Cache & Return
β Fail
2. Try impit (browser impersonation)
β Success β Cache & Return
β Fail
3. Return error with detailsExtract Article β Validate Content
β
Check for:
- Cookie walls (40+ confidence threshold)
- Paywalls (30+ confidence threshold)
- Short content (< 200 chars)
- Missing title (< 10 chars)
β
Return validation object with:
- isValid: boolean
- hasBlocker: boolean
- issues: array
- quality: metrics
- π° News aggregators
- π± RSS feed readers
- π Bookmark managers with content preview
- π Content analysis tools
- π€ Research bots
- π Article archiving services
- π Content discovery platforms
- Cookie Walls: Detects but cannot automatically accept (requires browser automation)
- Paywalls: Detects but cannot bypass (premium content protected)
- JavaScript-heavy sites: May return incomplete content
- Rate limiting: Subject to target site's rate limits
- Dynamic content: May miss content loaded via AJAX after initial render
- Cache aggressively (10-day default is reasonable for most content)
- Handle
validation.hasBlockerin your client code - Monitor
strategyfield to track bypass success rates - Use POST for long URLs (avoid URL length limits)
- Implement retry logic with exponential backoff
- Check
cachedfield to understand performance
| Variable | Required | Default | Description |
|---|---|---|---|
UPSTASH_REDIS_REST_TOKEN |
β Yes | - | Your Upstash Redis REST token |
UPSTASH_REDIS_REST_URL |
β Yes | - | Your Upstash Redis REST URL (https://...) |
SECRET_KEY |
β Yes | - | Comma-separated API keys for authentication |
REDIS_CACHE_DAYS |
β No | 10 |
Cache duration in days (recommend 10-30) |
NODE_ENV |
β No | development |
Environment (development, production) |
.env.local for local development:
UPSTASH_REDIS_REST_TOKEN=xxxx...
UPSTASH_REDIS_REST_URL=https://frank-lizard-12345.upstash.io
SECRET_KEY=my_dev_key_123,another_key_456
REDIS_CACHE_DAYS=10
NODE_ENV=developmentVercel Environment Variables:
- Go to your Vercel project β Settings β Environment Variables
- Add each variable for Production, Preview, and Development
- Vercel will automatically inject them during deployment
| Content Type | Recommended TTL | Setting |
|---|---|---|
| News articles | 1-3 days | REDIS_CACHE_DAYS=1 |
| Blog posts | 7-14 days | REDIS_CACHE_DAYS=7 |
| Static content | 30+ days | REDIS_CACHE_DAYS=30 |
| General use (default) | 10 days | REDIS_CACHE_DAYS=10 |
Quick Deploy:
# Production deployment
npm run deploy
# Staging deployment
npm run deploy:stagingFirst-time Setup:
- Install Vercel CLI:
npm i -g vercel - Link project:
vercel link - Add environment variables in Vercel dashboard
- Deploy:
npm run deploy
Serverless functions can go "cold" after inactivity. To keep your function and Upstash Redis connection active, we've configured a daily cron job that pings the health endpoint.
Built-in Solution (Vercel Cron Jobs):
- β
Already configured in
vercel.json - β Runs daily at 12:00 UTC
- β Free on Vercel Pro plan (or use alternatives below)
- β No external dependencies
The cron job is configured to call /api/health once per day, which:
- Keeps the serverless function warm
- Tests Redis connectivity
- Ensures the database stays active
Alternative Free Solutions:
If you're on Vercel's free tier (which doesn't include cron jobs), use one of these free external services:
-
UptimeRobot (Recommended - Free tier: 50 monitors)
- URL: https://uptimerobot.com
- Setup: Create a monitor β HTTP(s) β Your health endpoint URL
- Interval: Set to check every 24 hours (or minimum 5 minutes)
- Free tier: 50 monitors, 5-minute intervals
-
Cron-Job.org (Free)
- URL: https://cron-job.org
- Setup: Create job β HTTP Request β Your health endpoint URL
- Schedule:
0 12 * * *(daily at 12:00 UTC) - Free tier: Unlimited jobs, 1-minute minimum interval
-
EasyCron (Free tier available)
- URL: https://www.easycron.com
- Setup: Create cron job β HTTP GET β Your health endpoint URL
- Schedule: Daily
- Free tier: 1 job, 1-hour minimum interval
-
GitHub Actions (If your repo is public)
- Create
.github/workflows/keep-alive.yml:
name: Keep Alive on: schedule: - cron: '0 12 * * *' # Daily at 12:00 UTC jobs: ping: runs-on: ubuntu-latest steps: - name: Ping health endpoint run: curl -f ${{ secrets.HEALTH_ENDPOINT_URL }} || exit 1
- Create
Health Endpoint URL:
https://your-app.vercel.app/api/health
Replace your-app with your actual Vercel deployment URL.
The project includes 7 automated validation checks:
npm testWhat's tested:
- β Project structure (all required files exist)
- β Code quality (ESLint passes)
- β Package scripts (deploy, test, lint, etc.)
- β Dependencies (all installed correctly)
- β Node.js compatibility (v18+)
- β Module exports (fetcher functions work)
- β Environment template (all variables documented)
We welcome contributions! Please follow these steps:
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Commit your changes:
git commit -m 'Add amazing feature' - Push to the branch:
git push origin feature/amazing-feature - Open a Pull Request
- Follow ESLint rules (run
npm run lint) - Use Prettier for formatting (run
npm run format) - Write meaningful commit messages
- Test your changes locally before submitting
- Ensure all tests pass (
npm test) - Update README if adding new features
- Keep dependencies up to date
This project is licensed under the MIT License - see the LICENSE file for details.
Made with β€οΈ by Davod Mozafari