Article Scraper Project

This project is a Python-based web scraping tool that retrieves articles from a specific webpage, filters them based on user-defined criteria, and saves them into organized directories. It utilizes BeautifulSoup for HTML parsing, requests for HTTP requests, and automates file handling to create a structured directory of the articles.

Environment Setup

Creating a New Conda Environment:

To create a new isolated environment for the project:
```
conda create --name articlescraper python=3.12
```
This command creates a new environment named articlescraper with Python 3.12.
Activating the Environment:

Activate the created environment with the following command:
```
conda activate articlescraper
```
This ensures that any Python operations or package installations are confined to this environment.
Installing Necessary Packages:

Install the required packages using the following command:
```
conda install beautifulsoup4 requests lxml
```

Project Overview

This project allows users to scrape and save articles from a webpage. The user inputs the number of pages to scrape and the desired article type. The scraper retrieves the articles, formats their titles as filenames, and saves the article text to .txt files in directories labeled by page number.

Core Skills Demonstrated:

Web Scraping: Uses BeautifulSoup to parse the HTML content of webpages and locate specific elements like article titles and links.
HTTP Requests: Utilizes the requests library to fetch webpage content.
File Management: Automates the creation of directories and the saving of files, ensuring content is properly organized.
String Manipulation: Cleans and formats article titles for use as filenames, ensuring they are valid and consistent.

Project Execution

Running the Article Scraper:

To run the project, navigate to the directory containing article_scraper.py and execute the script:
```
python article_scraper.py
```
Functionality:
- User Input: Prompts the user to input the number of pages to scrape and the type of articles to save.
- Web Scraping: Retrieves articles based on the input criteria, including filtering by article type.
- File Creation: Automatically creates directories and saves the article text into .txt files named after the article titles.

Example Execution

The program will prompt the user for the number of pages to scrape and the type of articles they want to save. Here's an example session:

Enter the number of pages to scrape: 2
Enter the article type to filter (e.g., 'News'): Research Highlights
Saved all articles.

Directory Structure:

After running the script, your directory will be organized like this:

Project/
├── Page_1/
│   ├── Article_1.txt
│   ├── Article_2.txt
├── Page_2/
│   ├── Article_1.txt
│   └── Article_2.txt

Contributing

Contributions to this project are welcome. Please ensure to maintain the environment specifications and follow the coding standards used in this project.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.rst		README.rst
article_scraper.py		article_scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Article Scraper Project

Environment Setup

Project Overview

Project Execution

Example Execution

Directory Structure:

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

RafaelKarcz/ArticleScraper

Folders and files

Latest commit

History

Repository files navigation

Article Scraper Project

Environment Setup

Project Overview

Project Execution

Example Execution

Directory Structure:

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages