Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
188 changes: 188 additions & 0 deletions Week-05-Business-Stats-Analytics/Exercise-DONT-EDIT-MAKE-COPY.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Week 5 Exercise: Business Statistics on MovieLens (Questions Only)\n",
"\n",
"Use the MovieLens 100k ratings dataset to practice business statistics. This exercise focuses on descriptive statistics, confidence intervals, and hypothesis tests covered in Week 5. Do NOT include answers in this notebook."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dataset\n",
"**File**: `data/movie_ratings.csv`\n\n",
"Columns:\n",
"- `user_id`, `movie_id`, `rating` (1–5), `timestamp`\n",
"- `age`, `gender` (M/F), `occupation`, `zip_code`\n",
"- `title`, `year`, `decade`, `genres` (pipe-separated), `rating_year`\n",
"\n",
"Notes:\n",
"- Movies can belong to multiple genres.\n",
"- Use reasonable minimum-sample thresholds (e.g., n ≥ 50 per group) to avoid noisy results.\n",
"- Where you compute CIs, use 95% CIs unless stated otherwise."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n",
"You may use Pandas, NumPy, SciPy/Statsmodels and Seaborn/Matplotlib for analysis and basic plots."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"df = pd.read_csv('data/movie_ratings.csv')\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"# Questions\n",
"Answer each using code cells and a brief interpretation. Apply appropriate tests from Week 5 (t-tests, ANOVA, chi-square, correlation) and include 95% CIs where requested."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Q1. Gender rating gap (t-test)\n",
"Do average ratings differ by gender (M vs F)?\n",
"- Report group means, counts, and 95% CIs.\n",
"- Perform a two-sample t-test (Welch).\n",
"- Interpret both statistical and practical significance."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Q2. Age group differences (ANOVA)\n",
"Do average ratings differ across age groups?\n",
"- Define age groups (e.g., `<=17`, `18–24`, `25–34`, `35–44`, `45–54`, `55+`).\n",
"- Report means with 95% CIs per group (filter out very small groups).\n",
"- Run one-way ANOVA across age groups and interpret."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Q3. Top genres for users under 25\n",
"Among users under 25 (age < 25):\n",
"- Identify the top 3 genres by average rating, using a minimum n per genre (e.g., n ≥ 50).\n",
"- Show the mean, 95% CI, and counts for those genres."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Q4. Best release decade\n",
"Which `decade` has the highest average rating?\n",
"- Compute mean rating by decade with 95% CIs and counts.\n",
"- Use ANOVA to test overall differences (drop decades with very low n).\n",
"- State the top decade and discuss sample-size considerations."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Q5. Comedy vs. Drama popularity\n",
"Are comedies more popular than dramas? Define popularity as the share of high ratings (>= 4).\n",
"- Create a `high_rating` flag.\n",
"- Compare the overall high-rating proportion for `Comedy` vs `Drama` (two-proportion z-test or chi-square).\n",
"- As context, show a simple comparison across top occupations (no need to test each occupation)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Q6. Action ratings: under 25 vs 25+ (t-test)\n",
"Within the `Action` genre only, do users under 25 rate differently than users 25 and older?\n",
"- Compare group means and 95% CIs.\n",
"- Two-sample t-test (Welch)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Q7. Occupation differences (ANOVA)\n",
"Among the top 6 occupations by volume, do mean ratings differ?\n",
"- Report means with 95% CIs for each of the top 6.\n",
"- Run one-way ANOVA and interpret results."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Q8. High-satisfaction genres\n",
"Across the top 10 genres by volume, which have the highest share of high ratings (>= 4)?\n",
"- Rank genres by high-rating proportion with counts, include 95% CIs for proportions (method of your choice).\n",
"- Discuss at least one potential confound (e.g., popularity, age/occupation mix)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Q9. Popularity vs. ratings\n",
"Do more popular movies receive different average ratings?\n",
"- Define popularity as the number of ratings per movie, bin into quartiles.\n",
"- Compare mean ratings across popularity quartiles (ANOVA; use Kruskal–Wallis if assumptions are dubious).\n",
"- Briefly interpret the pattern."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Q10. Age vs. rating correlation\n",
"Is user `age` correlated with `rating`?\n",
"- Compute Pearson correlation with 95% CI.\n",
"- Include a simple scatter+trend (optional).\n",
"- Interpret the strength and direction."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## Submission\n",
"- Include clear interpretations for each result.\n",
"- Keep plots simple and readable (no dashboards required).\n",
"- State any assumptions and thresholds used (e.g., minimum n per group)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.x"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading