Intro to Machine Learning – Final Project

Coursework repository (2024) — includes report, notebooks, and datasets for 3 tasks:

Optimization methods (GD/Momentum/Adagrad + extensions)

Stock price forecasting (sequence length = 60) with neural networks & baselines

MNIST image classification using CNN

Project Overview

This repo stores the deliverables of a Machine Learning final project:

A PDF report
Three Jupyter notebooks (one for each question)
Two CSV datasets

Each notebook is designed to be run independently.

Repository Structure

Recommended structure after reorganizing:

.
├─ README.md
├─ requirements.txt
├─ .gitignore
├─ report/
│  └─ 52200207_52200214_52200216.pdf
├─ notebooks/
│  ├─ cau1_optimization_methods.ipynb
│  ├─ cau2_stock_forecasting.ipynb
│  └─ cau3_mnist_cnn.ipynb
└─ data/
   ├─ HousingData.csv
   └─ data_src_2.csv

If you keep the original file names, everything still works — the structure above is only for cleanliness when publishing on GitHub.

Tech Stack

Python (recommended 3.10+)
NumPy / Pandas for data processing
Matplotlib for visualization
Scikit-learn for metrics & baseline models
TensorFlow / Keras for neural networks
Jupyter Notebook / JupyterLab for experiments

Datasets

1) `HousingData.csv` (506 rows × 14 columns)

Columns:

Features: CRIM, ZN, INDUS, CHAS, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO, B, LSTAT
Target: MEDV

Used in Q1 for testing optimization algorithms on a regression problem.

2) `data_src_2.csv` (6816 rows × 10 columns)

Columns:

Date, Adj Close, Close, High, Low, Open, Volume, Industry, Ticker, GDP

Used in Q2 for time-series forecasting of Open prices by ticker.

Tasks & Methodology

Q1 — Optimization Methods

Notebook: notebooks/cau1_optimization_methods.ipynb

Goal: implement and compare optimization methods for minimizing a regression loss.

Implemented methods include (as seen in the notebook):

Batch Gradient Descent
Stochastic Gradient Descent (SGD)
Mini-batch Gradient Descent
Momentum
Adagrad
(Extra extensions for comparison) RMSProp, Adam

Workflow (high-level):

Load HousingData.csv
Standardize/prepare features and target
Train a simple regression model (linear regression style)
Track loss per epoch for each optimizer
Plot loss curves and compare final loss values

Outputs:

Loss curves (matplotlib)
Printed final losses per method

Q2 — Stock Open Price Forecasting

Notebook: notebooks/cau2_stock_forecasting.ipynb

Goal: forecast future Open prices using a sliding window.

Core preprocessing:

Filter by Ticker
Sort by Date
Scale Open using MinMaxScaler
Build sequences with sequence_length = 60
- Input X: last 60 Open values
- Target y: next Open value

Models covered in the notebook:

Linear Regression (neural form): single Dense layer output
Feedforward Neural Network (FFNN): Dense(50) → Dense(50) → Dense(1)
RNN/LSTM: stacked LSTM layers (sequence input)
Decision Tree Regressor baseline (scikit-learn)
Regularization utilities: Dropout, L2 Regularization, EarlyStopping (used for NN variants)

Evaluation:

MSE (Mean Squared Error)
R² score
Plot predicted vs actual (by date) for each ticker

Outputs:

Metrics printed per ticker and per model
Plots for actual vs predicted time series
Training histories (loss curves) for neural nets

Q3 — MNIST CNN Classification

Notebook: notebooks/cau3_mnist_cnn.ipynb

Goal: classify handwritten digits (0–9) with CNN.

Pipeline:

Load MNIST from keras.datasets
Normalize images to [0, 1], reshape to (28, 28, 1)
One-hot encode labels

CNN architecture (from the notebook):

Input (28, 28, 1)
Conv2D(32, 3×3, ReLU) → MaxPool(2×2)
Conv2D(64, 3×3, ReLU) → MaxPool(2×2)
Flatten
Dense(128, ReLU) → Dropout(0.5)
Dense(10, Softmax)

Training:

Optimizer: Adam
Loss: categorical_crossentropy
Batch size: 128
Epochs: 50
Includes validation tracking

Outputs:

Accuracy & loss curves
Test set evaluation
Sample predictions visualization

How to Run

1) Create a virtual environment

python -m venv .venv
# Windows:
.venv\Scripts\activate
# macOS/Linux:
source .venv/bin/activate

2) Install dependencies

pip install -r requirements.txt

3) Start Jupyter

jupyter lab

Open notebooks in notebooks/ and run cells from top to bottom.

Reproducibility Notes

For consistent results, keep random seeds fixed (some notebooks already include seeding).
Neural networks may still show minor variance depending on:
- CPU vs GPU
- TensorFlow version
- Non-deterministic GPU kernels

If you want fully deterministic runs, you can additionally configure TensorFlow deterministic ops (optional).

Authors

52200207 – Phạm Tuấn Đạt
52200214 – Trần Hồ Hoàng Vũ
52200216 – Trần Khiết Lôi

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Intro to Machine Learning – Final Project

Contents

Project Overview

Repository Structure

Tech Stack

Datasets

1) `HousingData.csv` (506 rows × 14 columns)

2) `data_src_2.csv` (6816 rows × 10 columns)

Tasks & Methodology

Q1 — Optimization Methods

Q2 — Stock Open Price Forecasting

Q3 — MNIST CNN Classification

How to Run

1) Create a virtual environment

2) Install dependencies

3) Start Jupyter

Reproducibility Notes

Authors

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
notebooks		notebooks
report		report
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

tranhohoangvu/Machine-Learning

Folders and files

Latest commit

History

Repository files navigation

Intro to Machine Learning – Final Project

Contents

Project Overview

Repository Structure

Tech Stack

Datasets

1) HousingData.csv (506 rows × 14 columns)

2) data_src_2.csv (6816 rows × 10 columns)

Tasks & Methodology

Q1 — Optimization Methods

Q2 — Stock Open Price Forecasting

Q3 — MNIST CNN Classification

How to Run

1) Create a virtual environment

2) Install dependencies

3) Start Jupyter

Reproducibility Notes

Authors

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1) `HousingData.csv` (506 rows × 14 columns)

2) `data_src_2.csv` (6816 rows × 10 columns)

Packages