-
Notifications
You must be signed in to change notification settings - Fork 25
fix: prevent memory exhaustion on large files (#25) #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR addresses memory exhaustion issues when reading large data files by implementing streaming for CSV and TXT files, and adding a file size limit for JSON files (which cannot be easily streamed).
Key Changes:
- Replaced synchronous file reading with streaming using
readline.createInterface()for CSV and TXT files - Added a 100MB file size limit for JSON files with appropriate error handling
- Extracted CSV line parsing into a reusable
parseCSVLine()method
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // JSON cannot be streamed easily, so enforce a file size limit | ||
| if (stats.size > MAX_JSON_FILE_SIZE_BYTES) { | ||
| const fileSizeMB = (stats.size / (1024 * 1024)).toFixed(2); | ||
| return { | ||
| llmContent: `JSON file is too large (${fileSizeMB} MB). Maximum supported size for JSON files is ${MAX_JSON_FILE_SIZE_MB} MB. For large JSON files, write a Python script using the 'json' module with streaming (ijson) or load in chunks.`, | ||
| returnDisplay: `JSON file too large (${fileSizeMB} MB, max ${MAX_JSON_FILE_SIZE_MB} MB)`, | ||
| error: { | ||
| message: `JSON file size (${fileSizeMB} MB) exceeds ${MAX_JSON_FILE_SIZE_MB} MB limit`, | ||
| type: ToolErrorType.FILE_TOO_LARGE, | ||
| }, | ||
| }; | ||
| } |
Copilot
AI
Dec 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JSON file size validation logic lacks test coverage. This is a critical new feature that prevents memory exhaustion, and it should be validated with tests to ensure the size limit is enforced correctly and appropriate error messages are returned. Other tools in this directory have comprehensive test files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core implementation has been tested manually. I can add tests in a follow-up PR if the maintainers request it.
| for await (const line of rl) { | ||
| const trimmedLine = line.trim(); | ||
| if (!trimmedLine) continue; | ||
|
|
||
| if (isFirstLine) { | ||
| headers = this.parseCSVLine(trimmedLine); | ||
| isFirstLine = false; | ||
| continue; | ||
| } | ||
|
|
||
| totalRows++; | ||
|
|
||
| // Only store rows up to displayMaxRows for the sample | ||
| if (sampleData.length < displayMaxRows) { | ||
| const values = this.parseCSVLine(trimmedLine); | ||
| const row: Record<string, string> = {}; | ||
| headers.forEach((header, index) => { | ||
| row[header] = values[index] || ''; | ||
| }); | ||
| sampleData.push(row); | ||
| } | ||
| } |
Copilot
AI
Dec 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The streaming implementation continues reading the entire file even after collecting the maximum number of sample rows. This defeats the purpose of preventing memory exhaustion for very large files, as the entire file is still processed line by line. Consider closing the stream early once totalRows reaches a reasonable limit (e.g., after counting enough rows for accurate statistics, or implementing a configurable max count limit).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intentional design choice. The tool provides accurate row counts for analysis. Memory usage stays constant since only sample rows are stored in memory.
| for await (const line of rl) { | ||
| totalLines++; | ||
| if (sampleLines.length < maxRows) { | ||
| sampleLines.push(line); | ||
| } | ||
| } |
Copilot
AI
Dec 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The streaming implementation continues reading the entire file even after collecting the maximum number of sample lines. For very large text files (e.g., multi-GB log files), this means the function will still process every line in the file, which can take a significant amount of time. Consider closing the stream early once enough sample lines have been collected, if the total line count is not critical for the use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accurate line counts are useful. Memory stays constant regardless of file size
Fixes #25
Problem
The
read-data-filetool loads entire files into memory before parsing,causing OOM crashes for files larger than ~500MB.
Solution
readline.createInterface()) for CSV and TXT filesChanges
and parseTXTStream() methods, plus JSON size validation