A sophisticated Python tool for extracting and structuring share repurchase data from SEC 10-Q and 10-K filings. This tool enables researchers to build comprehensive, monthly-frequency repurchase datasets covering all publicly traded companies in the US.
Traditional repurchase research has relied on two separate datasets that could not be linked: COMPUSTAT, which provides quarterly aggregate repurchase figures, and SDC, which contains repurchase program announcements. COMPUSTAT tells us how many shares a firm repurchased in a given quarter but provides no information about which specific programs these repurchases were conducted under. Conversely, SDC provides announcement dates and program sizes but offers no insight into program completion rates over time.
This disconnection has limited researchers' ability to analyze the relationship between announced programs and actual execution. Beyond this fundamental connectivity issue, the accuracy of commonly used repurchase measures has been questioned. Banyi et al. (2008) found that COMPUSTAT deviates from actual repurchases reported in SEC filings by more than 30% in approximately 16% of cases.
This extractor bridges the gap by providing:
- Monthly Repurchase Figures: Complete monthly breakdowns for all publicly traded companies
- Program-Specific Data: Links repurchases to specific announced programs
- Completion Tracking: Enables analysis of program completion rates over time
- Superior Data Quality: Validation against COMPUSTAT reveals that while 95% of cases match, the remaining 5% show errors of orders of magnitude in COMPUSTAT due to unit misinterpretation (e.g., extracting "300 thousand" as "300"). This tool directly parses SEC filings with proper unit handling, providing more reliable data than existing databases.
This tool enables researchers to:
- Build comprehensive panel datasets of monthly repurchase activity
- Track program completion rates and timing patterns
- Distinguish between different types of repurchase transactions
- Analyze the relationship between program announcements and execution
- Conduct research that was previously impossible due to data limitations
This tool has been used to create comprehensive repurchase datasets for academic research. For example, Choubdaran (2024) uses data extracted with this tool to examine the puzzling heterogeneity in completion rates of open market repurchase programs, documenting that low-completion firms significantly outperform analyst expectations in years 1-2 post-announcement, while high-completion firms excel in years 3-4. The study's validation process also revealed systematic errors in COMPUSTAT: in the 5% of cases where the tool's quarterly figures differed from COMPUSTAT, manual verification showed the tool was correct in nearly all instances, with COMPUSTAT errors typically stemming from unit misinterpretation (e.g., failing to recognize that "300" in the table means "300 thousand shares").
The tool's unique capability lies in connecting monthly repurchase figures to program authorization details. For example:
- Program Authorization: "In July 2024, the Company's Board of Directors authorized a share repurchase program of up to $400.0 million"
- Monthly Tracking: January shows $359.7M remaining → $40.3M completed (10% completion rate)
- Ongoing Analysis: March shows 200,000 shares at $187.32 → $37.5M additional repurchases
- Final Balance: $322.2M remaining → Total completion: 19.5%
This granular tracking enables research questions that were previously impossible to answer due to data limitations.
Beyond academic research, completion tracking has significant implications for investment strategies. Choubdaran (2024) documents that when firms are divided into quartiles based on their 3-month completion rates, both low-completion (Quartile 1) and high-completion (Quartile 4) firms outperform analyst expectations over a 4-year period—but with distinctly different timing:
- Low-completion firms: Outperform in years 1-2 post-announcement
- High-completion firms: Outperform in years 3-4 post-announcement
This temporal pattern suggests a refinement to the traditional "buyback anomaly" investment strategy. While the classic approach simply holds all announcing firms for 7-48 months post-announcement, a completion-conditioned strategy can better align investment horizons with performance timing:
- For low-completion firms (Quartile 1): Invest immediately, hold for months 7-24
- For high-completion firms (Quartile 4): Delay investment until month 31, hold through month 48
Performance Results: The completion-conditioned strategy delivers a monthly alpha of 0.54% versus 0.30% for the classic strategy (difference: 0.25% per month, p = 0.038). Over the 42-month trading window, this translates to approximately 11% higher cumulative returns, even after accounting for higher turnover and trading costs.
Practical Implementation: Completion rates are publicly observable from SEC filings and can be mechanically calculated 3 months post-announcement (accounting for filing lags). This makes the strategy implementable in real-world trading, representing an underutilized enhancement to traditional buyback-based investment approaches.
This demonstrates how the tool's data extraction capabilities translate directly into actionable investment strategies with measurable performance improvements.
SEC regulations (effective March 15, 2004) mandate that firms report monthly share repurchases under Item 2(e) of Form 10-Q and Item 5(c) of Form 10-K. A typical repurchase table looks like this:
Rows:
- 3 Monthly Intervals: January, February, March (or the three months of the reporting quarter)
- 1 Total Row: Quarterly aggregate (sometimes absent; can be calculated from monthly rows)
Columns (typically 4):
- Total Number of Shares Purchased: All shares bought back during the period
- Average Price Paid Per Share: Average price for those repurchases
- Total Number of Shares Purchased Under Publicly Announced Programs: Subset of column 1, excluding private transactions
- Maximum Dollar Value or Number of Shares Remaining: Remaining authorization under announced programs
Context:
- Text Above Table (
soup_before): Program authorization details, background information - Text Below Table (
soup_after): Footnotes, program details, explanations
- Information Preservation: Captures all data without loss, including missing value context
- Multi-Table Support: Handles filings with multiple repurchase programs or transaction types
- Flexible Output: Three complementary data sources for comprehensive analysis
- Research-Grade Quality: Maintains data integrity for academic and professional research
- Real-World Robustness: Handles edge cases and complex filing structures
git clone https://github.com/Ali-Choubdaran/sec-repurchase-extractor.git
cd sec-repurchase-extractor
pip install -r requirements.txtfrom src.main import RepurchaseExtractor
# Initialize extractor with SEC filing URL
extractor = RepurchaseExtractor("https://www.sec.gov/Archives/edgar/data/17843/000001784325000011/crs-20250331.htm")
# Extract data (updates instance variables)
extractor.extract()
# Access the three key outputs
soup_before = extractor.soup_before # Context before table
repurchase_data = extractor.repurchase_data # Structured table data
soup_after = extractor.soup_after # Footnotes and context
extraction_metadata = extractor.extraction_metadata # Extraction diagnosticsThe extractor provides four key outputs that work together to give you complete information:
BeautifulSoup object containing the HTML text immediately preceding the repurchase table. Typically includes:
- Program authorization announcements
- Authorization dates and amounts
- Program descriptions
- Background information
Usage:
# Get clean text
context_before = soup_before.get_text(separator=' ', strip=True)A pandas DataFrame containing the cleaned, structured repurchase data. This is the core output.
| Column | Description | Units |
|---|---|---|
row_label |
Period description or program type label | Text |
tot_shares |
Total number of shares purchased | Thousands |
avg_price |
Average price paid per share | Dollars (no conversion) |
prog_shares |
Shares purchased under publicly announced programs | Thousands |
remaining_auth |
Remaining authorization under programs | Millions |
id |
Row type identifier (1=Jan, 2=Feb, 3=Mar, 4=Total) | Integer |
table_id |
Program/table identifier (for multi-program filings) | Float |
table_score |
Program quality score (higher = more likely repurchase program) | Float |
beg_date |
Period beginning date | datetime |
end_date |
Period ending date | datetime |
tot_shares_dollar |
Dollar indicator for tot_shares (1=dollar, 0=shares) |
Integer |
avg_price_dollar |
Dollar indicator for avg_price (1=dollar, 0=shares) |
Integer |
prog_shares_dollar |
Dollar indicator for prog_shares (1=dollar, 0=shares) |
Integer |
remaining_auth_dollar |
Dollar indicator for remaining_auth (1=dollar, 0=shares) |
Integer |
- 1, 2, 3: Monthly intervals (first, second, third month of quarter)
- 4: Total row for the quarter (if present in original table)
When companies report multiple transaction types (e.g., repurchase programs vs. employee transactions), each type receives a unique table_id:
table_id = 0or1: Primary repurchase programtable_id = 2: Secondary transactions (often employee-related)
The table_score helps identify program quality based on contextual keywords:
- Positive score: Keywords like "repurchase", "program", "open market" (+1 each)
- Negative score: Keywords like "employee", "compensation", "option" (-1 each)
- Higher scores indicate more likely public repurchase programs
BeautifulSoup object containing the HTML text immediately following the table. Typically includes:
- Footnotes explaining table entries
- Program authorization details
- Employee transaction explanations
- Program terms and conditions
Usage:
# Get clean text
footnotes = soup_after.get_text(separator=' ', strip=True)A dictionary containing diagnostic information about the extraction process:
num_tables: Number of tables found in the filingnum_monthly_intervals: Number of monthly intervals identifiedtot_row_found: Whether a total row was found (1) or not (0)unit_source: Where unit information was found (e.g., ['table'])meta_data: Original column headers from the raw table- Error tracking fields for debugging
All monetary and share values are automatically converted to standardized units:
tot_shares: Always in thousands of sharesavg_price: No unit conversion (preserves original precision)prog_shares: Always in thousands of sharesremaining_auth: Always in millions (dollars or shares, indicated byremaining_auth_dollar)
The tool automatically detects and converts from the original units (which may be thousands, millions, or billions) to these standardized units.
The tool preserves information about why data is missing using specific notations:
NaN: Originally empty (no content in the cell)!d: Empty after removal of dollar sign (cell had$but no number)!u: Empty after removal of unit word (cell had "thousand", "million", "billion" but no number)!s: Empty after removal of star sign (cell had*but no number)!p/!P: Empty after removal of parentheses (mainly due to footnotes)!o: Empty after removal of other meaningless content (like "N/A", "NA", "-", etc.)
These notations help researchers understand the context of missing data and make informed decisions about data usage. The obsession with not throwing away information means every type of "empty" has its own story.
beg_date/end_date: Precise date ranges for each period (datetime objects)- Monthly intervals: SEC-required monthly breakdowns (automatically parsed)
- Quarter totals: Date range calculated from monthly intervals when total row exists
The following examples demonstrate how the tool transforms messy SEC filings into clean, structured data ready for analysis.
This is what researchers typically work with: HTML tables embedded in SEC filings with scattered context and footnotes.
Our tool extracts this messy filing into three clean, complementary outputs:
Item 2. Unregistered Sales of Equity Securities and Use of Proceeds In July 2024,
the Company's Board of Directors authorized a share repurchase program of up to
$400.0 million of the Company's outstanding common stock. There is no stated
expiration for the share repurchase program. The shares may be repurchased from
time to time at our discretion based on capital needs of the business, general
market conditions and market price of the stock. The timing or amount of the
shares to be repurchased cannot be assured. The share repurchase program may be
discontinued at any time. As of March 31, 2025, $322.2 million of the $400.0
million remained available for future purchases. During the quarter ended
March 31, 2025, the Company purchased 200,000 shares pursuant to the terms of
the share repurchase program. The following table contains information about
purchases by us of our common stock during the quarter ended March 31, 2025:
Key Information Extracted:
- 🏢 Authorization: $400M program approved in July 2024
- 💰 Remaining Balance: $322.2M as of March 31, 2025
- 📊 Completion Progress: $77.8M used (19.5% complete)
| row_label | tot_shares | avg_price | prog_shares | remaining_auth | id | table_id | table_score | beg_date | end_date |
|---|---|---|---|---|---|---|---|---|---|
| January 1-31, 2025 | !o |
!o |
!o |
359.7 | 1.0 | 0.0 | NaN | 2025-01-01 | 2025-01-31 |
| February 1-28, 2025 | !o |
!o |
!o |
359.7 | 2.0 | 0.0 | NaN | 2025-02-01 | 2025-02-28 |
| March 1-31, 2025 | 200.0 | 187.32 | 200.0 | 322.2 | 3.0 | 0.0 | NaN | 2025-03-01 | 2025-03-31 |
| Quarter ended March 31, 2025 | 200.0 | 187.32 | 200.0 | 322.2 | 4.0 | 0.0 | NaN | 2025-01-01 | 2025-03-31 |
Data Features:
- ✅ Standardized Units: Shares in thousands (200.0 = 200,000 shares), dollars in millions ($322.2M)
- ✅ Missing Value Intelligence:
!oindicates no repurchases in Jan/Feb (not missing data, but zero activity) - ✅ Precise Dates: Exact date ranges parsed for each period
- ✅ Total Row: Quarterly aggregate automatically calculated and dated
- ✅ Single Program:
table_id = 0.0(only one repurchase program in this filing) - ✅ No Scoring Needed:
table_score = NaN(no competing programs to differentiate)
Additional Columns (dollar indicators, not shown above for clarity):
tot_shares_dollar: 0 (shares, not dollars)avg_price_dollar: 1 (price in dollars)prog_shares_dollar: 0 (shares, not dollars)remaining_auth_dollar: 1 (in dollars, specifically $322.2M)
In addition to the share repurchase program, for the quarter ended March 31, 2025,
1,173 shares, at an average purchase price of $193.50, were surrendered by
employees to the Company for the payment of the minimum tax liability withholding
obligations upon the vesting of shares of restricted stock as well as the cost
obligations of stock options exercised. We do not consider this a share buyback
program.
Additional Insights:
- 👥 Employee Transactions: 1,173 shares at $193.50 (separate from main program)
- 📋 Clarity: Distinguishes between repurchase programs and employee transactions
- Missing Data Intelligence: The tool correctly identifies that January and February had no repurchases, using
!onotation rather than treating them as missing data - Context Preservation: Authorization details from
soup_beforeenable completion rate tracking - Clean Structure: Messy HTML transformed into a pandas DataFrame with meaningful column names
- Date Parsing: Automatic extraction of precise date ranges for each period
- Unit Standardization: Shares in thousands, dollars in millions—ready for analysis
# Calculate program completion
authorized = 400.0 # million
remaining = 322.2 # million
used = authorized - remaining # $77.8M
completion_rate = used / authorized
print(f"Program completion: {completion_rate:.1%}") # Output: 19.5%
# Analyze timing
march_data = repurchase_data[repurchase_data['id'] == 3]
shares_repurchased = march_data['tot_shares'].iloc[0] * 1000 # 200,000 shares
avg_price = march_data['avg_price'].iloc[0] # $187.32
amount_spent = (shares_repurchased * avg_price) / 1e6 # $37.5M
print(f"March spending: ${amount_spent:.1f}M")A more complex filing with multiple transaction types reported in the same table.
This example demonstrates the tool's ability to handle multi-program filings where each monthly interval has multiple rows for different transaction types.
Item 2. Unregistered Sales of Equity Securities and Use of Proceeds Issuer
Purchases of Equity Securities The following table sets forth information
regarding purchases of our common stock related to our share repurchase program
and employee transactions made by us or on our behalf during the most recent
quarter.
Key Information:
- 🏢 Two Transaction Types: Repurchase program and employee transactions
- 📊 Combined Reporting: Both types reported in a single table
| row_label | tot_shares | avg_price | prog_shares | remaining_auth | id | table_id | table_score | beg_date | end_date |
|---|---|---|---|---|---|---|---|---|---|
| Repurchase program (1)(4) | !o |
!o |
!o |
2225.091655 | 1.0 | 1.0 | 2.0 | 2024-01-01 | 2024-01-31 |
| Employee transactions (2) | 3.476 | 107.51 | NaN | NaN | 1.0 | 2.0 | -2.0 | 2024-01-01 | 2024-01-31 |
| Repurchase program (1)(4) | !o |
!o |
!o |
2225.091655 | 2.0 | 1.0 | 2.0 | 2024-02-01 | 2024-02-29 |
| Employee transactions (2) | 252.279 | 106.89 | NaN | NaN | 2.0 | 2.0 | -2.0 | 2024-02-01 | 2024-02-29 |
| Repurchase program (1)(4) | !o |
!o |
!o |
2225.091655 | 3.0 | 1.0 | 2.0 | 2024-03-01 | 2024-03-31 |
| Employee transactions (2) | 5.51 | 130.54 | NaN | NaN | 3.0 | 2.0 | -2.0 | 2024-03-01 | 2024-03-31 |
| Repurchase program (1)(4) | !o |
!o |
!o |
2225.091655 | 4.0 | 1.0 | 2.0 | 2024-01-01 | 2024-03-31 |
| Employee transactions (2) | 261.265 | 107.39 | NaN | NaN | 4.0 | 2.0 | -2.0 | 2024-01-01 | 2024-03-31 |
Data Features:
- ✅ Multi-Table Structure: 8 rows (2 per monthly interval + 2 total rows)
- ✅ table_id Separation:
1.0for repurchase program,2.0for employee transactions - ✅ Intelligent Scoring:
table_score = 2.0for repurchase program (keywords: "repurchase" +1, "program" +1) - ✅ Negative Scoring:
table_score = -2.0for employee transactions (keywords: "employee" -1, "transaction" -1) - ✅ Flexible Analysis: Researchers can filter, separate, or combine transaction types as needed
Additional Columns (dollar indicators):
tot_shares_dollar: 0 (all in shares)avg_price_dollar: 1 (prices in dollars)prog_shares_dollar: 0 (shares)remaining_auth_dollar: 0 (this company reports remaining auth in shares, not dollars!)
(1) In April 2023, our Board of Directors approved a new share repurchase program
authorizing the purchase of up to $2.7 billion of our outstanding shares of common
stock through June 30, 2024. This share repurchase authorization replaced our prior
$4.2 billion share repurchase program.
(2) Reflects shares withheld (under the terms of grants under employee stock
compensation plans) to offset tax withholding obligations that occur upon the
delivery of outstanding shares underlying restricted stock units or upon the
exercise of stock options.
(3) Average price paid per share excludes any excise tax.
(4) In accordance with the Merger Agreement with Capital One, share repurchases
have been paused through the completion of the merger. See "— Liquidity and
Capital Resources — Capital" for additional information.
Critical Insights:
- 🏢 Program Details: $2.7B program authorized in April 2023, expires June 30, 2024
- ⏸️ Paused Program: Repurchases paused due to Capital One merger (explains the
!ovalues) - 💰 Remaining Balance: $2.225B remaining (only $475M used, 17.6% completion)
- 👥 Employee Transactions: 261,265 total shares for employee tax obligations (separate from program)
- Multi-Table Handling: Automatically separates two transaction types into distinct
table_idgroups - Intelligent Scoring: Uses contextual keywords to score transaction quality (
table_score) - Researcher Flexibility: Different researchers can use the data differently:
- Focus on repurchase programs only:
df[df['table_id'] == 1] - Study employee transactions:
df[df['table_id'] == 2] - Analyze total activity: Use both tables combined
- Focus on repurchase programs only:
- Context Preservation: Footnotes explain why repurchases were paused (merger) and program details
- Missing Value Intelligence:
!oindicates no repurchases (program paused), while employee transactions continued
# Separate transaction types
repurchase_program = repurchase_data[repurchase_data['table_id'] == 1.0]
employee_transactions = repurchase_data[repurchase_data['table_id'] == 2.0]
# Analyze employee transactions only
total_employee_shares = employee_transactions[employee_transactions['id'] == 4]['tot_shares'].iloc[0]
avg_employee_price = employee_transactions[employee_transactions['id'] == 4]['avg_price'].iloc[0]
print(f"Employee transactions: {total_employee_shares * 1000:,.0f} shares at ${avg_employee_price:.2f}")
# Output: Employee transactions: 261,265 shares at $107.39
# Track program status (no actual repurchases this quarter)
program_remaining = repurchase_program[repurchase_program['id'] == 4]['remaining_auth'].iloc[0]
authorized = 2700.0 # $2.7B from footnotes (but remaining_auth is in shares here!)
print(f"Program status: {program_remaining:,.0f} shares remaining")
# Filter by transaction quality
high_quality = repurchase_data[repurchase_data['table_score'] > 0] # Only repurchase programs
print(f"High-quality repurchase rows: {len(high_quality)}")The combination of !o values for the repurchase program and the footnote explaining the Capital One merger provides crucial context. Without the footnote, one might misinterpret the lack of repurchases. This demonstrates the power of combining soup_after with repurchase_data for complete understanding.
The tool provides detailed error information through the extraction_metadata:
# Check extraction status
if extractor.extraction_metadata.get('self_term_re') == 'not_3_monthly_intervals':
print("Could not identify 3 monthly intervals")
elif extractor.extraction_metadata.get('error_term_re'):
print(f"Error: {extractor.extraction_metadata['error_term_re']}")
else:
print("Extraction successful")This project is licensed under the MIT License - see the LICENSE file for details.
If you use this tool in your research, please cite both the tool and the underlying research:
% Cite the research methodology
@unpublished{choubdaran2025disagreement,
author = {Choubdaran, Ali},
title = {Disagreement Resolution Horizon and Open Market Repurchase Program Completion},
year = {2025},
note = {PhD Thesis Chapter, London School of Economics. Available at: \url{https://alichoubdaran.com/uploads/disagreement-horizon.pdf}}
}
% Cite the tool
@software{sec_repurchase_extractor,
author = {Choubdaran, Ali},
title = {SEC Repurchase Data Extractor},
year = {2025},
url = {https://github.com/Ali-Choubdaran/sec-repurchase-extractor}
}For questions, issues, or feature requests, please open an issue on GitHub or contact the development team.
- Banyi, M. L., Dyl, E. A., & Kahle, K. M. (2008). Errors in estimating share repurchases. Journal of Corporate Finance, 14(4), 460-474.
- Choubdaran, A. (2025). Disagreement Resolution Horizon and Open Market Repurchase Program Completion. PhD Thesis Chapter, London School of Economics. Available at: https://alichoubdaran.com/uploads/disagreement-horizon.pdf


