scrapes files (jpg, png, gif, webm) from website
Build & share delightful machine learning apps easily
Website | Documentation | Guides | Getting Started | Examples | 中文
This Python script allows you to scrape images from various websites such as Instagram, Reddit, 4channel, Warosu, and Desuarchive. The script uses the Chrome browser for scraping and requires certain dependencies to be installed.
demo.mp4
Before using the image scraper, please make sure you have the following prerequisites:
- Python 3.x installed on your system
- Chrome browser (if not using headless option, ensure that Chrome is closed)
- Google Chrome Driver compatible with your Chrome browser version
- Download the appropriate Chrome Driver from https://sites.google.com/chromium.org/driver/?pli=1
- Extract the Chrome Driver executable file from the downloaded archive.
- Move the Chrome Driver executable file to the "driver" folder within the project directory.
To install the required Python packages, run the following command:
pip install -r requirements.txt
To use the image scraper, execute the main.py script with the following command:
python app.py
The script provides several options that can be specified via command-line arguments or within the script file itself.
-
--injected: Enable this option to handle websites that inject their content during the initial page load. The script will wait until the page is fully loaded before scraping images. By default, this option is disabled. -
--max-images: Specify the maximum number of images you want to scrape from the website. This limits the number of images retrieved. If not specified, all available images will be scraped. -
--bulk: Enable this option to scrape images from multiple URLs provided in a text file. The URLs in the text file should be separated by commas. The file path must be provided as an argument. -
--headless: Enable this option to run the scraper in the background without opening a visible Chrome browser window. By default, the scraper opens a visible browser window. -
--types: Specify the types of files you want to scrape. This option allows you to filter the file types to be downloaded. Supported file types include JPG, PNG, GIF, and WebM. Specify the types as a comma-separated list. -
--pause: Enable this option to introduce a delay (in seconds) between opening each URL and downloading each file. This can be useful to prevent excessive requests to the website. By default, there is no pause between requests. -
--user-agent: Specify your user agent string. Some websites require a specific user agent to access their content. To find your user agent, visit https://www.whatismybrowser.com/detect/what-is-my-user-agent/ and copy the user agent string. Paste the user agent string into the app GUI's input field called "User Agent".
The script will display the scraped images in a gallery format, allowing you to view and interact with the downloaded images conveniently.
Please note that scraping images from websites may violate the terms of service of those websites. Make sure to use this script responsibly and respect the rights of the website owners. The developers of this script are not responsible for any misuse or legal issues arising from the use of this tool.
Happy scraping!