EXAMTOPICSCRAPER

echo 'export NVM_DIR="$HOME/.nvm" [ -s "$NVM_DIR/nvm.sh" ] && . "$NVM_DIR/nvm.sh" # This loads nvm [ -s "$NVM_DIR/bash_completion" ] && . "$NVM_DIR/bash_completion" # This loads nvm bash_completion' >> ~/.bashrc

Make .bashrc take effect immediately by sourcing:

source ~/.bashrc

Install Node 20:

nvm install 20

Verify node:

node -v

Install Typescript:

npm install -g typescript

Install Dependencies:

npm install

Compile Typescript and launch with node with sample arguments:

tsc && node ./build/index.js

Github Actions / Forejo Actions Setup

Docker Setup

Minio
Postgres ^ Both of these need to use volume, otherwise they would data would be lost

I am not entirely sure if I should create an image for the webscraper ^ If I am not mistaken, the 12 factor app mentioned about containerize scripts ^ The famous voting app has a image that is not a web server, it's like a script that has loop forevor until it errors

Docker is actually a server, it comes with

Docker Daemon (dockerd)
REST API
Docker CLI (docker) ^ Based on Kodekloud's Docker-Certified-Associate-Exam-Course, You still need indepth knowledge of Linux

Link to Docker Certified Associate (DCA) Exam: https://a.storyblok.com/f/146871/x/2001ce939c/docker-study-guide_v1-5-jan-2025.pdf

Mirantis bought over the Docker Certification Exam: https://training.mirantis.com/certification/dca-certification-exam/

An alternative to Kodekloud's docker courses https://labs.iximiuz.com/roadmaps/docker

Not sure what is the importance for the process tree looking like that https://labs.iximiuz.com/challenges/docker-101-container-run-in-background

Download an image without starting a container:

docker image pull minio/minio
docker image pull postgres:14.21-trixie
Build Image for the webscraper
Create docker compose file

DBT Setup

Install DBT fusion https://docs.getdbt.com/docs/fusion/install-fusion-cli
Install DBT extension by "dbt Labs Inc" https://docs.getdbt.com/docs/install-dbt-extension

Features to add

Rescrape pages that result in dirty data, need to update / merge existing data. ^ Partially added for Answers, never needed for question & discussions
Error handling for missing src images
Workflow Orchestration with Airflow
Column lineage with dbt
Schema Drift from upstream (The HTML, Javascript from examtopics)
SQL backup dump on repository, so when switch computer we get the data back
Maybe instead of scraping the data directly, you take all the data first and dump it into a data lake, then only process it at some future point ^ Try to use MINIO as a datalake, Hadoop also works
Convert the project into a CLI

List of bugs to fix

Answer cannot be scraped: https://www.examtopics.com/discussions/oracle/view/92435-exam-1z0-071-topic-1-question-24-discussion/ ^ The answers was already scraped, but it is contained within questions
Need to handle
https://img.examtopics.com/1z0-071/image98.png
https://www.examtopics.com/assets/media/exam-media/04351/0002400002.jpg
Apparently there was nothing wrong with my scraping code, the image's src just did not appear, meaning the resource did not lead.
Need to rescrape images from 103, 119, 120, 127, 128, 131, 133, 146, 166, 228, 236, 245, 256
replace pngMost with png
if pngMost exist, then replace 'pngMost' with 'png' & replace 'Voted' with ''

Launch browser that google does not capcha

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome
--remote-debugging-port=9222
--user-data-dir=/tmp/chrome-profile

How to remove popup block

Apparently if you edit class="popup-overlay show" to "popup-overla show", the popup will break

Question with Screenshot (Unsure how to deal with images in question)

/html/body/div[2]/div/div[4]/div/div[1]/div[2]/p

Screenshot

/html/body/div[2]/div/div[4]/div/div[1]/div[2]/p/img

Question Screenshot full link

https://www.examtopics.com/assets/media/exam-media/04351/0000200001.png

Answers

/html/body/div[2]/div/div[4]/div/div[1]/div[2]/div[2]/ul/li[1]/text()

/html/body/div[2]/div/div[4]/div/div[1]/div[2]/div[2]/ul/li[2]/text()

/html/body/div[2]/div/div[4]/div/div[1]/div[2]/div[2]/ul/li[3]/text()

/html/body/div[2]/div/div[4]/div/div[1]/div[2]/div[2]/ul/li[4]/text()

/html/body/div[2]/div/div[4]/div/div[1]/div[2]/div[2]/ul/li[5]/text()

Discussion texts

/html/body/div[2]/div/div[4]/div/div[2]/div[2]/div/div/div[2]/div[1]/div/div[2]/div[2]

/html/body/div[2]/div/div[4]/div/div[2]/div[2]/div/div/div[2]/div[2]/div/div[2]/div[2]

Discussion upvotes

/html/body/div[2]/div/div[4]/div/div[2]/div[2]/div/div/div[2]/div[1]/div/div[2]/div[3]/span[2]/span

/html/body/div[2]/div/div[4]/div/div[2]/div[2]/div/div/div[2]/div[2]/div/div[2]/div[3]/span[2]/span

Name		Name	Last commit message	Last commit date
Latest commit History 253 Commits
.github/workflows		.github/workflows
images		images
markdowns		markdowns
test		test
.gitignore		.gitignore
README.md		README.md
answers.txt		answers.txt
cleanImages.sql		cleanImages.sql
findMissingAnswers.sql		findMissingAnswers.sql
format_markdown.sql		format_markdown.sql
index.ts		index.ts
package-lock.json		package-lock.json
package.json		package.json
pg_scaper.sql		pg_scaper.sql
pg_schema.sql		pg_schema.sql
schema.sql		schema.sql
scrapeImage.sql		scrapeImage.sql
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EXAMTOPICSCRAPER

Resources to learn webscraping

Resources to learn Data Engineering

12 Factor App Methodology

System Packages / Project Setup

Github Actions / Forejo Actions Setup

Docker Setup

DBT Setup

Features to add

List of bugs to fix

Launch browser that google does not capcha

How to remove popup block

Question with Screenshot (Unsure how to deal with images in question)

Screenshot

Question Screenshot full link

Answers

Discussion texts

Discussion upvotes

About

Uh oh!

Releases

Packages

Languages

jonathan-kee/examTopicScraper

Folders and files

Latest commit

History

Repository files navigation

EXAMTOPICSCRAPER

Resources to learn webscraping

Resources to learn Data Engineering

12 Factor App Methodology

System Packages / Project Setup

Github Actions / Forejo Actions Setup

Docker Setup

DBT Setup

Features to add

List of bugs to fix

Launch browser that google does not capcha

How to remove popup block

Question with Screenshot (Unsure how to deal with images in question)

Screenshot

Question Screenshot full link

Answers

Discussion texts

Discussion upvotes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages