Web Scraper with Playwright

This project demonstrates how to deploy a web scraper that collects all the links from a given webpage using Playwright in a Node.js environment. It's designed to be used with Leapcell (leapcell.io), and the goal is to help users learn how to deploy projects that depend on web scraping.

Prerequisites

Before running the application, you need to prepare the Playwright environment. To do so, execute the following script:

sh prepare_playwright_env.sh

This will:

Install Playwright and its dependencies (including Chromium) with the specified version.
Install the required Node.js modules by running npm install.

Project Structure

.
├── LICENSE                           # License file for the project
├── package.json                      # Contains metadata and dependencies for the Node.js project
├── prepare_playwright_env.sh          # Script for setting up the Playwright environment
└── src
    ├── app.js                        # Main application entry point using Express and Playwright
    └── views
        ├── error.ejs                 # Error page template displayed when something goes wrong
        ├── partials
        │   └── header.ejs            # Header template shared across pages
        └── success.ejs               # Success page template, showing the scraped links

Running the Application

Once you've prepared the environment, you can start the web service with the following command:

npm start

The service will be available on http://localhost:3000, and you can input the URL of the page you want to scrape. It will return a list of all links on that page.

Explanation of `prepare_playwright_env.sh`

This script is responsible for setting up the environment necessary for Playwright to run. Here's a breakdown of what each line does:

#!/bin/sh

# Install playwright and its dependencies
npx -y playwright@1.50.1 install --with-deps chromium

# Install node modules
npm install

npx -y playwright@1.50.1 install --with-deps chromium: This command installs Playwright version 1.50.1 and its necessary dependencies (including Chromium). It ensures the correct environment for the web scraping tasks.
npm install: Installs the Node.js modules specified in package.json.

Contact Support

If you have any issues or questions, feel free to reach out to support@leapcell.io.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
leapcell.yaml		leapcell.yaml
package.json		package.json
prepare_playwright_env.sh		prepare_playwright_env.sh
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Scraper with Playwright

Prerequisites

Project Structure

Running the Application

Explanation of `prepare_playwright_env.sh`

Contact Support

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

leapcell/playwright-crawler

Folders and files

Latest commit

History

Repository files navigation

Web Scraper with Playwright

Prerequisites

Project Structure

Running the Application

Explanation of prepare_playwright_env.sh

Contact Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Explanation of `prepare_playwright_env.sh`

Packages