fix: prevent memory exhaustion on large files (#25) #36

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

vyagh wants to merge 2 commits into blackboxaicode:main from vyagh:fix/issue-25-memory-exhaustion

vyagh commented Dec 25, 2025

Fixes #25

Problem

The read-data-file tool loads entire files into memory before parsing,
causing OOM crashes for files larger than ~500MB.

Solution

Use streaming (readline.createInterface()) for CSV and TXT files
Add 100MB file size limit for JSON files (JSON can't be safely streamed)

Changes

packages/core/src/tools/read-data-file.ts: Added parseCSVStream()
and parseTXTStream() methods, plus JSON size validation


          fix: prevent memory exhaustion on large files (blackboxaicode#25)

4409aad

Copilot AI review requested due to automatic review settings

December 25, 2025 21:36

Copilot started reviewing on behalf of vyagh

December 25, 2025 21:36

Copilot AI reviewed

View reviewed changes

Copilot AI left a comment

Pull request overview

This PR addresses memory exhaustion issues when reading large data files by implementing streaming for CSV and TXT files, and adding a file size limit for JSON files (which cannot be easily streamed).

Key Changes:

Replaced synchronous file reading with streaming using readline.createInterface() for CSV and TXT files
Added a 100MB file size limit for JSON files with appropriate error handling
Extracted CSV line parsing into a reusable parseCSVLine() method

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

packages/core/src/tools/read-data-file.ts Outdated Show resolved Hide resolved

packages/core/src/tools/read-data-file.ts

Comment on lines 404 to 415

    
                        // JSON cannot be streamed easily, so enforce a file size limit

                        if (stats.size > MAX_JSON_FILE_SIZE_BYTES) {

                          const fileSizeMB = (stats.size / (1024 * 1024)).toFixed(2);

                          return {

                            llmContent: `JSON file is too large (${fileSizeMB} MB). Maximum supported size for JSON files is ${MAX_JSON_FILE_SIZE_MB} MB. For large JSON files, write a Python script using the 'json' module with streaming (ijson) or load in chunks.`,

                            returnDisplay: `JSON file too large (${fileSizeMB} MB, max ${MAX_JSON_FILE_SIZE_MB} MB)`,

                            error: {

                              message: `JSON file size (${fileSizeMB} MB) exceeds ${MAX_JSON_FILE_SIZE_MB} MB limit`,

                              type: ToolErrorType.FILE_TOO_LARGE,

                            },

                          };

                        }

Copilot AI Dec 25, 2025

The JSON file size validation logic lacks test coverage. This is a critical new feature that prevents memory exhaustion, and it should be validated with tests to ensure the size limit is enforced correctly and appropriate error messages are returned. Other tools in this directory have comprehensive test files.

Copilot uses AI. Check for mistakes.

Author

vyagh Dec 25, 2025

The core implementation has been tested manually. I can add tests in a follow-up PR if the maintainers request it.

packages/core/src/tools/read-data-file.ts Show resolved Hide resolved

packages/core/src/tools/read-data-file.ts Outdated Show resolved Hide resolved

packages/core/src/tools/read-data-file.ts Outdated Show resolved Hide resolved

packages/core/src/tools/read-data-file.ts Show resolved Hide resolved

packages/core/src/tools/read-data-file.ts Outdated Show resolved Hide resolved

packages/core/src/tools/read-data-file.ts Outdated

Comment on lines 112 to 133

    
                  for await (const line of rl) {

                    const trimmedLine = line.trim();

                    if (!trimmedLine) continue;

                    if (isFirstLine) {

                      headers = this.parseCSVLine(trimmedLine);

                      isFirstLine = false;

                      continue;

                    }

                    totalRows++;

                    // Only store rows up to displayMaxRows for the sample

                    if (sampleData.length < displayMaxRows) {

                      const values = this.parseCSVLine(trimmedLine);

                      const row: Record<string, string> = {};

                      headers.forEach((header, index) => {

                        row[header] = values[index] || '';

                      });

                      sampleData.push(row);

                    }

                  }

Copilot AI Dec 25, 2025

The streaming implementation continues reading the entire file even after collecting the maximum number of sample rows. This defeats the purpose of preventing memory exhaustion for very large files, as the entire file is still processed line by line. Consider closing the stream early once totalRows reaches a reasonable limit (e.g., after counting enough rows for accurate statistics, or implementing a configurable max count limit).

Copilot uses AI. Check for mistakes.

Author

vyagh Dec 25, 2025

Intentional design choice. The tool provides accurate row counts for analysis. Memory usage stays constant since only sample rows are stored in memory.

packages/core/src/tools/read-data-file.ts Outdated

Comment on lines 213 to 218

    
                  for await (const line of rl) {

                    totalLines++;

                    if (sampleLines.length < maxRows) {

                      sampleLines.push(line);

                    }

                  }

Copilot AI Dec 25, 2025

The streaming implementation continues reading the entire file even after collecting the maximum number of sample lines. For very large text files (e.g., multi-GB log files), this means the function will still process every line in the file, which can take a significant amount of time. Consider closing the stream early once enough sample lines have been collected, if the total line count is not critical for the use case.

Copilot uses AI. Check for mistakes.

Author

vyagh Dec 25, 2025

accurate line counts are useful. Memory stays constant regardless of file size


          fix: address review feedback - add stream cleanup and fix error message

ac3b7b7

Piyushrathoree approved these changes

View reviewed changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet