Skip to content

Conversation

@vyagh
Copy link

@vyagh vyagh commented Dec 25, 2025

Fixes #25

Problem

The read-data-file tool loads entire files into memory before parsing,
causing OOM crashes for files larger than ~500MB.

Solution

  • Use streaming (readline.createInterface()) for CSV and TXT files
  • Add 100MB file size limit for JSON files (JSON can't be safely streamed)

Changes

  • packages/core/src/tools/read-data-file.ts: Added parseCSVStream()
    and parseTXTStream() methods, plus JSON size validation

Copilot AI review requested due to automatic review settings December 25, 2025 21:36
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses memory exhaustion issues when reading large data files by implementing streaming for CSV and TXT files, and adding a file size limit for JSON files (which cannot be easily streamed).

Key Changes:

  • Replaced synchronous file reading with streaming using readline.createInterface() for CSV and TXT files
  • Added a 100MB file size limit for JSON files with appropriate error handling
  • Extracted CSV line parsing into a reusable parseCSVLine() method

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 404 to 415
// JSON cannot be streamed easily, so enforce a file size limit
if (stats.size > MAX_JSON_FILE_SIZE_BYTES) {
const fileSizeMB = (stats.size / (1024 * 1024)).toFixed(2);
return {
llmContent: `JSON file is too large (${fileSizeMB} MB). Maximum supported size for JSON files is ${MAX_JSON_FILE_SIZE_MB} MB. For large JSON files, write a Python script using the 'json' module with streaming (ijson) or load in chunks.`,
returnDisplay: `JSON file too large (${fileSizeMB} MB, max ${MAX_JSON_FILE_SIZE_MB} MB)`,
error: {
message: `JSON file size (${fileSizeMB} MB) exceeds ${MAX_JSON_FILE_SIZE_MB} MB limit`,
type: ToolErrorType.FILE_TOO_LARGE,
},
};
}
Copy link

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSON file size validation logic lacks test coverage. This is a critical new feature that prevents memory exhaustion, and it should be validated with tests to ensure the size limit is enforced correctly and appropriate error messages are returned. Other tools in this directory have comprehensive test files.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core implementation has been tested manually. I can add tests in a follow-up PR if the maintainers request it.

Comment on lines 112 to 133
for await (const line of rl) {
const trimmedLine = line.trim();
if (!trimmedLine) continue;

if (isFirstLine) {
headers = this.parseCSVLine(trimmedLine);
isFirstLine = false;
continue;
}

totalRows++;

// Only store rows up to displayMaxRows for the sample
if (sampleData.length < displayMaxRows) {
const values = this.parseCSVLine(trimmedLine);
const row: Record<string, string> = {};
headers.forEach((header, index) => {
row[header] = values[index] || '';
});
sampleData.push(row);
}
}
Copy link

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The streaming implementation continues reading the entire file even after collecting the maximum number of sample rows. This defeats the purpose of preventing memory exhaustion for very large files, as the entire file is still processed line by line. Consider closing the stream early once totalRows reaches a reasonable limit (e.g., after counting enough rows for accurate statistics, or implementing a configurable max count limit).

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional design choice. The tool provides accurate row counts for analysis. Memory usage stays constant since only sample rows are stored in memory.

Comment on lines 213 to 218
for await (const line of rl) {
totalLines++;
if (sampleLines.length < maxRows) {
sampleLines.push(line);
}
}
Copy link

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The streaming implementation continues reading the entire file even after collecting the maximum number of sample lines. For very large text files (e.g., multi-GB log files), this means the function will still process every line in the file, which can take a significant amount of time. Consider closing the stream early once enough sample lines have been collected, if the total line count is not critical for the use case.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accurate line counts are useful. Memory stays constant regardless of file size

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Memory Exhaustion on Large Files

2 participants