This project is a Python-based web scraping tool that retrieves articles from a specific webpage, filters them based on user-defined criteria, and saves them into organized directories. It utilizes BeautifulSoup for HTML parsing, requests for HTTP requests, and automates file handling to create a structured directory of the articles.
Creating a New Conda Environment:
To create a new isolated environment for the project:
conda create --name articlescraper python=3.12
This command creates a new environment named
articlescraperwith Python 3.12.Activating the Environment:
Activate the created environment with the following command:
conda activate articlescraper
This ensures that any Python operations or package installations are confined to this environment.
Installing Necessary Packages:
Install the required packages using the following command:
conda install beautifulsoup4 requests lxml
This project allows users to scrape and save articles from a webpage. The user inputs the number of pages to scrape and the desired article type. The scraper retrieves the articles, formats their titles as filenames, and saves the article text to .txt files in directories labeled by page number.
Core Skills Demonstrated:
- Web Scraping: Uses BeautifulSoup to parse the HTML content of webpages and locate specific elements like article titles and links.
- HTTP Requests: Utilizes the requests library to fetch webpage content.
- File Management: Automates the creation of directories and the saving of files, ensuring content is properly organized.
- String Manipulation: Cleans and formats article titles for use as filenames, ensuring they are valid and consistent.
Running the Article Scraper:
To run the project, navigate to the directory containing article_scraper.py and execute the script:
python article_scraper.py
Functionality:
- User Input: Prompts the user to input the number of pages to scrape and the type of articles to save.
- Web Scraping: Retrieves articles based on the input criteria, including filtering by article type.
- File Creation: Automatically creates directories and saves the article text into .txt files named after the article titles.
The program will prompt the user for the number of pages to scrape and the type of articles they want to save. Here's an example session:
Enter the number of pages to scrape: 2
Enter the article type to filter (e.g., 'News'): Research Highlights
Saved all articles.
After running the script, your directory will be organized like this:
Project/
├── Page_1/
│ ├── Article_1.txt
│ ├── Article_2.txt
├── Page_2/
│ ├── Article_1.txt
│ └── Article_2.txt
Contributions to this project are welcome. Please ensure to maintain the environment specifications and follow the coding standards used in this project.
This project is licensed under the MIT License - see the LICENSE file for details.