Skip to content

jonathan-kee/examTopicScraper

Repository files navigation

EXAMTOPICSCRAPER

The reason I was doing this is because I don't want to pay the expensive fee to see the certification dumps lol.

Resources to learn webscraping

  1. Website on webscraping & crawling https://webscraping.fyi/

^ https://webscraping.fyi/overview/browser-automation/

^ I am currently using browser automation instead of http clients

^ https://webscraping.fyi/overview/languages/#http-clients

  1. Youtube channels on webscraping & crawling https://www.youtube.com/@MichaelMintz https://www.youtube.com/@JohnWatsonRooney/playlists

  2. Airflow being used for webscraping https://www.youtube.com/watch?v=CraPKax37lo

Resources to learn Data Engineering

  1. Official Reddit Date Engineering website https://dataengineering.wiki/Tools/Tools

  2. Kodekloud https://kodekloud.com/courses/data-engineering-fundamentals

  3. Architecture for Data https://www.youtube.com/watch?v=gsUqW1IookY

  4. Workflow Orchestration

  1. Data Ingestion
  1. Data Processing
  1. Data Analytics
  • Excel
  • Power BI

12 Factor App Methodology

Github Actions solves

System Packages / Project Setup

  1. Install node version manager:

^^^^^^ The following wont work if ~/.profile exist, Then you need to manually add to .bashrc by doing the following:

  • echo 'export NVM_DIR="$HOME/.nvm" [ -s "$NVM_DIR/nvm.sh" ] && . "$NVM_DIR/nvm.sh" # This loads nvm [ -s "$NVM_DIR/bash_completion" ] && . "$NVM_DIR/bash_completion" # This loads nvm bash_completion' >> ~/.bashrc
  1. Make .bashrc take effect immediately by sourcing:
  • source ~/.bashrc
  1. Install Node 20:
  • nvm install 20
  1. Verify node:
  • node -v
  1. Install Typescript:
  • npm install -g typescript
  1. Install Dependencies:
  • npm install
  1. Compile Typescript and launch with node with sample arguments:
  • tsc && node ./build/index.js

Github Actions / Forejo Actions Setup

Docker Setup

  • Minio
  • Postgres ^ Both of these need to use volume, otherwise they would data would be lost

I am not entirely sure if I should create an image for the webscraper ^ If I am not mistaken, the 12 factor app mentioned about containerize scripts ^ The famous voting app has a image that is not a web server, it's like a script that has loop forevor until it errors

Docker is actually a server, it comes with

  • Docker Daemon (dockerd)
  • REST API
  • Docker CLI (docker) ^ Based on Kodekloud's Docker-Certified-Associate-Exam-Course, You still need indepth knowledge of Linux

Link to Docker Certified Associate (DCA) Exam: https://a.storyblok.com/f/146871/x/2001ce939c/docker-study-guide_v1-5-jan-2025.pdf

Mirantis bought over the Docker Certification Exam: https://training.mirantis.com/certification/dca-certification-exam/

An alternative to Kodekloud's docker courses https://labs.iximiuz.com/roadmaps/docker

Not sure what is the importance for the process tree looking like that https://labs.iximiuz.com/challenges/docker-101-container-run-in-background

Download an image without starting a container:

  1. docker image pull minio/minio

  2. docker image pull postgres:14.21-trixie

  3. Build Image for the webscraper

  4. Create docker compose file

DBT Setup

  1. Install DBT fusion https://docs.getdbt.com/docs/fusion/install-fusion-cli

  2. Install DBT extension by "dbt Labs Inc" https://docs.getdbt.com/docs/install-dbt-extension

Features to add

  • Rescrape pages that result in dirty data, need to update / merge existing data. ^ Partially added for Answers, never needed for question & discussions

  • Error handling for missing src images

  • Workflow Orchestration with Airflow

  • Column lineage with dbt

  • Schema Drift from upstream (The HTML, Javascript from examtopics)

  • SQL backup dump on repository, so when switch computer we get the data back

  • Maybe instead of scraping the data directly, you take all the data first and dump it into a data lake, then only process it at some future point ^ Try to use MINIO as a datalake, Hadoop also works

  • Convert the project into a CLI

List of bugs to fix

Launch browser that google does not capcha

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome
--remote-debugging-port=9222
--user-data-dir=/tmp/chrome-profile

How to remove popup block

Apparently if you edit class="popup-overlay show" to "popup-overla show", the popup will break

Question with Screenshot (Unsure how to deal with images in question)

/html/body/div[2]/div/div[4]/div/div[1]/div[2]/p

Screenshot

/html/body/div[2]/div/div[4]/div/div[1]/div[2]/p/img

Question Screenshot full link

https://www.examtopics.com/assets/media/exam-media/04351/0000200001.png

Answers

/html/body/div[2]/div/div[4]/div/div[1]/div[2]/div[2]/ul/li[1]/text()

/html/body/div[2]/div/div[4]/div/div[1]/div[2]/div[2]/ul/li[2]/text()

/html/body/div[2]/div/div[4]/div/div[1]/div[2]/div[2]/ul/li[3]/text()

/html/body/div[2]/div/div[4]/div/div[1]/div[2]/div[2]/ul/li[4]/text()

/html/body/div[2]/div/div[4]/div/div[1]/div[2]/div[2]/ul/li[5]/text()

Discussion texts

/html/body/div[2]/div/div[4]/div/div[2]/div[2]/div/div/div[2]/div[1]/div/div[2]/div[2]

/html/body/div[2]/div/div[4]/div/div[2]/div[2]/div/div/div[2]/div[2]/div/div[2]/div[2]

Discussion upvotes

/html/body/div[2]/div/div[4]/div/div[2]/div[2]/div/div/div[2]/div[1]/div/div[2]/div[3]/span[2]/span

/html/body/div[2]/div/div[4]/div/div[2]/div[2]/div/div/div[2]/div[2]/div/div[2]/div[3]/span[2]/span

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published