- Data File:
jobs.txt - Columns:
y: Job proficiency score (first column)t1,t2,t3,t4: Scores on four aptitude tests
- Ensure to assign appropriate column headings after importing the data.
- Graphical Summaries: Generate scatterplots of job proficiency against each predictor and interpret what the plots suggest.
- All Possible Regressions: Fit all 16 possible models with the four predictors and record model selection metrics:
p,R2,R2a,p,PRESSp,AICp,BICp,Mallows Cp. - Best Model Selection: Identify the best model based on each criterion and determine which variable(s) can be excluded.
- Model Fitting & Diagnostics: Fit the best model according to
R2a,pand assess if model assumptions are met. - Best Model with Two Predictors: Using
BICpandMallows Cp, find the best model with two or fewer predictors. - Validation: Compare
SSEpandPRESSpfor Model 11. - Automated Search Procedures: Perform forward selection, backward elimination, and stepwise regression using the
step()function and compare selected models. - Model Search with Conditions: Use the
regsubsets()function from theleapspackage to find the best models based onR2a,p,Mallows Cp, andBICp.
- DAAG Package: For computing
PRESSp - Leaps Package: For
regsubsets()function
- Ensure all plots and results are properly labeled and interpreted.
- Check that your models meet statistical assumptions and provide diagnostics when needed.
- Maintain clear and concise code structure and comments for readability.
This project aims to detect influential outliers in grocery data using R. The dataset contains weekly activity data from a national grocery retailer, and the analysis focuses on identifying anomalies in the response and predictor variables.
- File:
grocery.xlsx - Variables:
- labor: Total labor hours per week (response variable)
- shipped: Number of cases shipped in a week
- cost: Labor cost as a percentage of total costs
- holiday: Binary indicator (1 if the week includes a holiday, 0 otherwise)
We fit a first-order linear regression model without interaction effects and analyze different types of residuals:
library(readxl)
data <- read_excel("grocery.xlsx")
result <- lm(labor ~ shipped + cost + holiday, data)
# Visualizing residuals
par(mfrow = c(1, 3))
# Plot of residuals vs fitted values
plot(result$fitted.values, result$residuals,
main = "Residuals vs Fitted",
xlab = "Fitted Values", ylab = "Residuals")
# Plot of standardized residuals vs fitted values
plot(result$fitted.values, rstandard(result),
main = "Standardized Residuals vs Fitted",
xlab = "Fitted Values", ylab = "Standardized Residuals")
# Plot of studentized residuals vs fitted values
plot(result$fitted.values, rstudent(result),
main = "Studentized Residuals vs Fitted",
xlab = "Fitted Values", ylab = "Studentized Residuals")
# Reset graphics layout
par(mfrow = c(1, 1))We use studentized residuals and the Bonferroni correction to identify potential outliers:
# Calculate sample size and number of predictors
n <- nrow(data)
p <- length(coef(result)) - 1
# Compute studentized residuals
student.res <- rstudent(result)
# Calculate critical value using Bonferroni correction
alpha <- 0.05
crit <- qt(1 - alpha / (2 * n), df = n - p - 1)
# Plot studentized residuals with critical thresholds
plot(student.res,
main = "Studentized Residuals with Critical Values",
ylab = "Studentized Residuals")
abline(h = c(crit, -crit), col = "red", lty = 2)
# Identify outliers
outliers <- which(abs(student.res) > crit)
print("Indices of potential outliers:")
print(outliers)- Ensure that the
grocery.xlsxfile is in your working directory. - Open the R environment (e.g., RStudio, VSCode) and load the provided scripts.
- Execute the code blocks in the order presented.
- Visual analysis: Residual plots highlight any patterns or anomalies in the data.
- Outlier detection: Identified observations with studentized residuals beyond the Bonferroni-corrected thresholds.