Skip to content

ehuang08/Datathon-2025

Repository files navigation

Data Sprouts

Video link: https://youtu.be/uZu5tsH0Z0Q

Steps to run the code:

  • Make sure all datasets are in working directory
  • Run code in R in order of transactions_cleaning_visualization file, Prophet Model, classifier_data_creation file, Classifier Model

transactions_cleaning_visualization file:

  • got rid of columns with high number of missing values
  • created a plot transaction_date vs amount for all data
  • data was very right skewed, fixed using natural log, got rid of outliers using IQR rule
  • filtered the date to just look at previous 8 months
  • created transaction_date vs amount graph and a bar graph with previous 8 months vs total amount per month
  • made a new csv: cleaned_transaction.csv with cleaned data

Prophet Model:

  • https://facebook.github.io/prophet/
  • using cleaned data, we first convereted months into Date format
  • filtered using past 8 months (from july of 2024)
  • for prophet model, it uses variables ds for independent variable and y for our dependent variable to train
  • set ds = month
  • set y = total amount
  • trained the model
  • made a future dataframe for the next 10 months to include Q4 (october to december)
  • predicted the future spendings

classifier_data_creation file:

  • merged columns from multiple csv files to train k-means clustering on
  • merged account numbers, account open date, payment hisotry of 1-12 Âmonths, and gross fraud amount (potential risk signs)
  • visualized account open date vs payment history 1-12 months as a bubble plot with gross fraud amount as third variable
  • made a new data frame: classifier_data.csv and payment_seperated_classifer_data.csv (payment_seperated_classifier_data.csv takes the payment history from 1-12 months and seperates accounts that contain multiple letters)

Classifier Model:

  • Used k-means cluster
  • using classifier_data.csv created a new feature called months_since_oepn
  • only using features that affect accounts eligible for credit line increase without risk, accounts eligible for credit line increase but has risk of potential defaults or fraud, no Credit Line increase required and non-Performing
  • normalized the data
  • train the data with output of 4 clusters
  • ploted data to visualize clusters

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages