Skip to content

arifgit12/web-scraper-app

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

13 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ•ท๏ธ Web Scraper App - Spring Boot

A powerful Spring Boot application with professional Swing GUI for web scraping news websites and analyzing articles with AI-powered sentiment analysis.

๐Ÿš€ Features

Tab 1: Website Link Scraper (๐Ÿ“ฐ)

  • Smart Link Extraction: Scrapes and lists latest news articles from websites
  • Content Preview: Click any link to view full article content
  • Image Display: Shows article images with proper loading and scaling
  • News Focus: Filters out navigation/footer links, shows only articles
  • Performance Optimized: Background loading prevents UI freezing
  • Professional UI: Modern blue color scheme with numbered article list

Tab 2: Article Analyzer (๐Ÿ“Š)

  • Detailed Article Parsing: Extract headline, author, publish date, and content
  • Sentiment Analysis: AI-powered emotion detection (๐Ÿ˜Š Positive/๐Ÿ˜Ÿ Negative/๐Ÿ˜ Neutral)
  • Word Count: Automatic article statistics
  • Image Extraction: Finds and displays article images
  • Keyword Analysis: Shows positive/negative sentiment keywords
  • Professional UI: Modern green color scheme with structured report format

๐Ÿ“ค Export & Batch Features (NEW!)

  • CSV Export: Export individual article analysis to CSV format
  • PDF Export: Generate professional PDF reports with complete analysis
  • Batch Analysis: Analyze multiple articles and store them in memory
  • Batch Export: Export all analyzed articles at once to CSV or PDF
  • Flexible Workflow: Add articles to batch, export when ready, or clear batch

๐ŸŽจ User Interface

The application features a professional, industry-standard UI with:

  • Modern Color Scheme: Blue and green themes with professional grays
  • Intuitive Icons: Emoji icons for better visual recognition
  • Clear Feedback: Loading states, success/error messages with helpful suggestions
  • Responsive Layout: Split panes with adjustable dividers
  • Enhanced Typography: Clear fonts and proper spacing throughout
  • Hover Tooltips: Full information on hover for truncated text
  • Numbered Lists: Easy-to-follow article enumeration
  • Status Bar: Real-time application status with tips

๐Ÿ› ๏ธ Technology Stack

  • Spring Boot 3.4.8 - Application framework
  • Java Swing - Desktop GUI with custom styling
  • JSoup 1.18.1 - HTML parsing and web scraping
  • Apache HTTP Client - HTTP connections
  • Apache Commons CSV 1.10.0 - CSV export functionality
  • iText7 7.2.5 - PDF generation and export
  • Java 17 - Runtime environment

๐Ÿ“‹ Prerequisites

  • Java 17 or higher
  • Maven 3.6+
  • Internet connection for web scraping

๐Ÿƒโ€โ™‚๏ธ Running the Application

Option 1: Maven

mvn spring-boot:run

Option 2: JAR

mvn clean package
java -jar target/web-scraper-app-1.0.0.jar

Option 3: IDE

Run the WebScraperApplication.java main class

๐ŸŒ Supported Websites

โœ… Confirmed Working:

  • BBC News (https://www.bbc.com/)
  • CNN (https://www.cnn.com/)
  • Reuters (https://www.reuters.com/)
  • NBC News (https://www.nbcnews.com/)
  • The Guardian (https://www.theguardian.com/)

โš ๏ธ May Block Automated Requests:

  • Telegraph India
  • Many paywalled news sites
  • Sites with heavy JavaScript content loading

๐Ÿ“– How to Use

Website Link Scraper (Tab 1 - ๐Ÿ“ฐ)

  1. Enter a news website URL (e.g., https://www.bbc.com/)
  2. Click "๐Ÿ” Get Latest Articles" button to scrape
  3. Browse the numbered list of articles on the left
  4. Select any article to view its content and images
  5. Images load automatically in the background
  6. Clear success/error messages guide you throughout

Article Analyzer (Tab 2 - ๐Ÿ“Š)

  1. Paste a specific article URL in the input field
  2. Click "๐Ÿง  Analyze with AI" button
  3. View the comprehensive analysis in the left panel:
    • Headline with hover for full text
    • Author information
    • Publication Date
    • Sentiment Analysis with emoji indicator and color coding
  4. Read the structured report in the main area:
    • Formatted headline and metadata
    • Sentiment analysis with score
    • Sentiment keywords (positive/negative)
    • Full article content
  5. View extracted images below the content
  6. All processing happens in the background for smooth experience

Export & Batch Analysis (NEW! ๐Ÿ“ค)

Exporting Single Articles:

  1. After analyzing an article, use the export buttons in the left panel
  2. Click "๐Ÿ’พ CSV" to export to CSV format
  3. Click "๐Ÿ“„ PDF" to export to PDF format
  4. Choose the save location in the file dialog
  5. Get confirmation when export is successful

Batch Analysis Workflow:

  1. Analyze an article as usual
  2. Click "โž• Add to Batch" to store it for batch processing
  3. Repeat steps 1-2 for multiple articles
  4. Click "๐Ÿ“ฆ Export Batch" when ready
  5. Choose CSV or PDF format
  6. All articles are exported to a single file
  7. Use "๐Ÿ—‘๏ธ Clear Batch" to start fresh

Benefits of Batch Analysis:

  • Compare multiple articles at once
  • Generate consolidated reports
  • Save time with bulk exports
  • Perfect for research and analysis tasks

๐Ÿง  Sentiment Analysis

The built-in sentiment analyzer provides:

  • Analyzes emotional tone of articles using word-based analysis
  • Scores from -1.0 to +1.0 (negative to positive)
  • Color coding: ๐ŸŸข Positive (Green), ๐Ÿ”ด Negative (Red), ๐Ÿ”ต Neutral (Blue)
  • Emoji indicators: ๐Ÿ˜Š for positive, ๐Ÿ˜Ÿ for negative, ๐Ÿ˜ for neutral
  • Keyword detection shows sentiment-bearing words found in the article
  • Statistical analysis with word count and sentiment score metrics
  • Structured report format with clear sections and formatting

โš™๏ธ Configuration

Timeouts and Limits

  • Connection timeout: 5-8 seconds
  • Read timeout: 8-10 seconds
  • Max article links: 25 (for performance)
  • Max images per article: 3-5
  • Max image size: 300x200px (scaled automatically)

Request Headers

The application uses proper browser headers to avoid blocking:

  • Modern Chrome User-Agent
  • Accept headers for HTML/images
  • Referer headers for legitimacy

๐Ÿ”ง Dependencies

Add to your pom.xml:

<dependencies>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter</artifactId>
    </dependency>
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.16.2</version>
    </dependency>
    <dependency>
        <groupId>org.apache.httpcomponents.client5</groupId>
        <artifactId>httpclient5</artifactId>
    </dependency>
</dependencies>

๐Ÿšจ Error Handling

The application handles common issues:

  • 403 Forbidden: Website blocks automated requests
  • Connection timeouts: Network or server issues
  • SSL errors: Certificate problems with HTTPS sites
  • Image loading failures: Graceful fallbacks with error messages
  • Content extraction failures: Clear user feedback

๐Ÿ“Š Example Output

Article Analysis Report Format:

โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
                    ARTICLE ANALYSIS REPORT
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

๐Ÿ“ฐ HEADLINE
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Breaking: Major Economic Policy Changes Announced

โœ๏ธ  METADATA
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Author:    John Smith
Published: 2024-08-07 10:30:00
Words:     847 words

๐Ÿ’ญ SENTIMENT ANALYSIS
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Overall Sentiment: ๐Ÿ˜Ÿ Negative (Score: -0.23)

๐Ÿ”‘ SENTIMENT KEYWORDS
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
โœ… Positive: progress, improve, success
โŒ Negative: crisis, problem, decline, concern

๐Ÿ“„ ARTICLE CONTENT
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
[Full article text here...]

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ› Known Issues

  • JavaScript-heavy sites: JSoup cannot execute JavaScript, so dynamic content may not be captured
  • Anti-bot protection: Some sites actively block automated requests
  • Image loading: Some images may fail due to CORS or authentication requirements

๐Ÿ’ก Tips for Best Results

  1. Use major news sites: BBC, CNN, Reuters work best
  2. Check robots.txt: Respect website scraping policies
  3. Don't overwhelm servers: Built-in delays prevent server overload
  4. Try different URLs: If one site blocks, try alternatives

๐Ÿ”ฎ Future Enhancements

  • Export analysis results to PDF/CSV โœ… COMPLETED
  • Batch article analysis โœ… COMPLETED
  • Advanced sentiment analysis with machine learning
  • Support for RSS feeds
  • Custom keyword tracking
  • Article comparison features

Built with โค๏ธ using Spring Boot and Java Swing

For questions or issues, please open a GitHub issue or contact the maintainer.

About

scrap new from website

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages