Command-line tool to recursively download images from a given website.
The command below would download any single picture on the site https://mugshots.com/.
python mega_scraper.py -s https://mugshots.com/
MegaScraper implements a series of flags that can be used to customize its behaviour. For example, the following command would download 100 images with a minimum width and height of 300px from https://mugshots.com/.
python mega_scraper.py -s https://mugshots.com/ -hm 100 -mw 300 -mh 300
A full list of flags can be found below.
In order to use MegaScraper you'll need to install all its dependencies. The easiest way to do so is to follow these steps:
- Download the package from GitHub.
- Use virtualenv to create virtual enviroment:
virtualenv -p python3 venv. - Activate the virtual enviroment:
source venv/bin/activate - Install the requirements:
pip install -r requirements.txt.
That's it.
While MegaScraper was developed to be used as a command-line tool, nothing prevents you from importing it as if it was a regular Python package. For example:
import mega_scraper
url = 'https://mugshots.com/'
scraper = mega_scraper.MegaScraper(url)
scraper.scrape(max_pages=10)
scraper.download(how_many=100)Note however that MegaScraper is not currently on PyPI so you won't be able to install via pip. Just download mega_scraper.py from GitHub.
Below you can find a list of all flags currently supported by MegaScraper. You can find additional and more granular documentation in the mega_scraper.py file itself.
--seedor-sto specify the seed URL. It's the only non-optional flag.--regex_pagesor-rpto specify from which pages to download images.--regex_imagesor-rito specify which images URLs to consider.--min_widthor-mwto specify the minimum width an image has to have to be downloaded.--min_heightor-mhto specify the minimum height an image has to have to be downloaded.--output_folderpathor-ofto specify the folder where to output.--output_structureor-osto specify the output structure (eitherflatorgrouped).--output_namingor-onto specify the naming system for the output files (eitherkeepornumerical).--images_per_folderor-ifto specify how many images per folder whengrouped.--folder_initial_numor-fnto specify the number for the first folder whengrouped.--max_pagesor-mpto specify the maximum number of pages to crawl.--how_manyor-hmto specify how many images to download.
