Before starting this exercise, ensure you have the following:
- Python 3.11+ - Download Python
- GitHub Account - Sign up for GitHub
- Basic Python programming (functions, API calls, data structures)
- Basic understanding of GitHub PRs and Issues
Run these commands to verify your setup:
python --version # Should show Python 3.11 or higherCollaborative code development is about multiple developers working together on a codebase. It uses tools like Git and follows practices like pair programming and code review. It also includes DevOps practices like Continuous Integration/Continuous Deployment (CI/CD). The goal is to make development within a team easier while ensuring code quality. Platforms like GitHub produce many artifacts during development that can be mined to understand how a codebase evolves. These artifacts include issues, pull requests, commits, and code reviews.
This exercise will focus on investigating these collaborative development artifacts. You will learn to:
- Manually explore a GitHub repository to extract information about PRs and Issues
- Write a Python script to automate repository mining using the GitHub API
- Compare manual findings with automated results
For this exercise, you will investigate the Marksafe library repository:
- Repository:
pallets/markupsafe - URL: https://github.com/pallets/markupsafe
- Description: Safely add untrusted strings to HTML/XML markup.
The use of Generative AI tools (e.g., ChatGPT, Cursor, GitHub Copilot, Claude) is permitted for this exercise with the following guidelines:
- Understanding GitHub API documentation
- Debugging error messages
- Learning Python syntax for API calls
- Clarifying concepts about GitHub PRs, Issues, and API responses
- Generating the complete
mine_repo.pyscript (you must research how to mine a GitHub repository yourself by reading the API documentation) - Using AI-enabled IDEs to generate the entire mining script
- Having AI write your investigation findings
- You must be able to explain any code you submit
- Document any AI assistance in your submission (brief note at the end of your PDF)
Total Time: 75 minutes
Create your own repository from this template:
- Click "Use this template" button (green button at the top of the repo)
- Select "Create a new repository"
- Name it appropriately (e.g.,
SAhandons-topic1-yourname)
Clone your repository:
git clone <your-repo-url>
cd <repo-name>Note: Do NOT fork or clone this template directly. Always use the "Use this template" button to create your own copy.
Goal: Manually explore the target repository using GitHub's web interface and collect specific information about its collaborative development artifacts.
Task 1: Repository Statistics
Go to the GitHub repository and investigate it to answer the following questions:
-
How many open Pull Requests are there?
-
How many closed Pull Requests are there?
-
How many merged (closed) Pull Requests are there? (Hint: Use filters to find merged PRs - these are a subset of closed PRs)
-
How many total Issues are there? How many of those are closed vs. open?
-
How many contributors have contributed to this repository?
-
How many total commits are in this repository?
Task 2: Example PR with Passing Build
Find one example of a PR that has a passing build (green checkmark ✅) and record:
- PR number and title
- PR URL
- Screenshot showing the passing CI status
Task 3: Example PR with Failing Build
Find one example of a PR that has a failing build (red X ❌ or failed status) and record:
- PR number and title
- PR URL
- Screenshot showing the failing CI status
Task 4: Example PR with Inline Comments
Find one example of a PR that has inline code review comments (comments on specific lines of code) and record:
- PR number and title
- PR URL
- Screenshot showing the inline comments
- Brief description of what the comment discusses
Document your findings:
Create a document with:
- Repository statistics findings
- Links and screenshots for the three example PRs (passing build, failing build, inline comments)
- Brief record of any GenAI tools used during the investigation (if applicable)
Goal: Complete the Python script (mine_repo.py) that automatically extracts the repository statistics you collected manually in Part 1.
Note: Starter code is provided in mine_repo.py. You need to implement the mining logic using the GitHub API.
Script Requirements:
- Accept repository owner and name as command-line arguments
- Select your desired approach to access GitHub data (e.g., via PyGithub)
- Extract the following statistics:
- Number of open PRs
- Number of closed PRs (total closed, including merged and closed without merging)
- Total number of issues (closed vs. open)
- Number of contributors
- Number of commits
- Display results in a clear, readable format. Example is provided below.
Steps:
-
Create a GitHub Personal Access Token:
- Go to GitHub → Settings → Developer settings → Personal access tokens → Tokens (classic)
- Click "Generate new token (classic)"
- Name: "Repository Mining Script"
- Select scope:
public_repo(orrepoif accessing private repos) - Generate and copy the token
-
Create a
.envfile:GITHUB_TOKEN=your_token_hereImportant: Add
.envto your.gitignorefile! -
Choose a library for GitHub mining:
Research and select an approach for accessing GitHub data (e.g., PyGithub, PyDriller, or GitHub REST API with requests) and add the chosen library to
requirements.txt. -
Write your script:
Implement the mining logic in
mine_repo.pyto extract and display the same 6 statistics you collected manually in Part 1:- Number of open Pull Requests
- Number of closed Pull Requests (total closed, including merged vs. not merged)
- Number of total Issues (both open and closed)
- Number of Contributors
- Number of Commits
-
Create a Virtual Environment and Install dependencies:
python -m venv .venv source .venv/bin/activate pip install -r requirements.txt
Script Output Format:
Example:
$ python mine_repo.py pallets markupsafe
============================================================
REPOSITORY MINING RESULTS: pallets/markupsafe
============================================================
- Open Pull Requests: [NUMBER]
- Closed Pull Requests: [NUMBER]
--- Merged Pull Requests: [NUMBER]
--- Pull Requests closed without merging: [NUMBER]
- Total Issues: [NUMBER]
--- Open Issues: [NUMBER]
--- Closed Issues: [NUMBER]
- Contributors: [NUMBER]
- Total Commits: [NUMBER]
============================================================
Resources for Research:
Research the documentation for your chosen library to learn how to access repository data:
- PyGithub: Documentation | GitHub Repository
- PyDriller: Documentation | GitHub Repository
- GitHub REST API: API Documentation | Pull Requests API | Issues API | Repositories API
- Requests library: Documentation
Note: Some API calls may require pagination to get complete results. Check your library's documentation to see if it already handles paginated responses or if you have to do it yourself.
Goal: Compare and document your manual findings from the web interface with the automated script results.
- Compare each statistic
- Are the numbers matching?
- If there are discrepancies, make sure there are no bugs in your script. If the discrepency still persists, note them down in your report
Goal: Test your script on a different repository to ensure it works generically.
-
Run your script on the Lizard repository (
terryyin/lizard):python mine_repo.py terryyin lizard
-
Report the findings - briefly document what the script discovered about this second repository.
Submit the following to Brightspace:
Total Points: 10/10
-
Manual Investigation Results (2.5 points)
- Required Repository statistics
- Three example PRs with:
- Links to each PR
- Screenshots showing:
- PR with passing build
- PR with failing build
- PR with inline comments
-
Mining Script Code (5 points)
- Your
mine_repo.pyfile with implemented mining logic - Your
requirements.txtfile with any dependencies needed to run your script
- Your
-
Comparison Report (2.5 points)
- Screenshot of your script output
- Side-by-side comparison showing:
- Your manual statistics
- Your script output
- Report of any discrepancies
-
Optional Repository Results (if completed)
- Output from running your script on a second repository
-
GenAI Disclosure (if applicable)
- Brief note describing any AI tools used and how they assisted you
- Combine all documents/screenshots into a single PDF
- Attach your
mine_repo.pyandrequirements.txtfiles separately - Name your files:
LastName_FirstName_Mining.pdf,LastName_FirstName_mine_repo.py, andLastName_FirstName_requirements.txt
- GitHub REST API Documentation
- GitHub REST API - Pull Requests
- GitHub REST API - Issues
- GitHub REST API - Reviews and Comments
- PyGithub Documentation
- PyGithub GitHub Repository
- PyDriller Documentation
- PyDriller GitHub Repository
This exercise was developed with the assistance of Cursor, an AI-powered code editor. Cursor was used to:
- Brainstorm ideas for the exercise structure and tasks
- Draft and refine this README documentation
MIT License - See LICENSE file for details.