Skip to content

Conversation

@devZenta
Copy link
Owner

@devZenta devZenta commented Dec 8, 2025

This pull request introduces a new data cleaning script and refactors the IMDB scraper to improve code consistency and readability. The main changes include the addition of a cleaner.py script for preprocessing movie data, standardizing string formatting and selector usage across the scraper, and minor improvements to path handling in the main entry point.

New functionality:

  • Added src/cleaner/cleaner.py to clean and preprocess raw movie metadata, selecting relevant columns and removing rows with missing values. The output is saved to imdb_cleaned.csv and a preview is printed.

Scraper refactoring and consistency improvements:

  • Standardized all string quotes to double quotes and reformatted selectors and dictionary keys throughout src/scrapping/scraper.py for consistency and readability. [1] [2] [3] [4] [5] [6] [7]
  • Improved selector formatting and handling for movie details extraction (title, rating, duration, genres, director, actors) to make the code easier to maintain and less error-prone. [1] [2] [3] [4]
  • Updated the CSV export logic in the scraper to use consistent fieldnames and quoting.

Minor improvements:

  • Cleaned up path handling in src/main.py by standardizing quotes and import order.

@devZenta devZenta added documentation Improvements or additions to documentation enhancement New feature or request labels Dec 8, 2025
@devZenta devZenta merged commit 8a22420 into main Dec 8, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants