Skip to content

Conversation

@tmccaughey
Copy link
Contributor

Hi Falak. I have fixed the chunking size, it is not exactly identical to the one in the manual upload but a lot better. Thanks for your time! :)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an S3 knowledge connector feature to the AWS extension, enabling extraction and processing of documents from S3 buckets into knowledge sources. It also updates deprecated APIs and dependencies.

Key changes:

  • Implements S3 connector with support for multiple file types (txt, pdf, docx, csv, json, jsonl, md, pptx)
  • Integrates LangChain for document loading and text chunking with configurable chunk sizes
  • Updates deprecated Buffer constructor and context API methods

Reviewed changes

Copilot reviewed 15 out of 17 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
extensions/aws/tsconfig.json Added compiler options for better module compatibility
extensions/aws/src/nodes/lambdaInvoke.ts Fixed deprecated Buffer constructor and updated context API call
extensions/aws/src/module.ts Registered S3 connector and commented out existing nodes
extensions/aws/src/knowledge-connectors/s3Connector.ts Main connector implementation for processing S3 files into knowledge sources
extensions/aws/src/knowledge-connectors/helpers/text_extractor.ts Document loader using LangChain for various file types
extensions/aws/src/knowledge-connectors/helpers/text_chunker.ts Text splitting logic with configurable chunk sizes
extensions/aws/src/knowledge-connectors/helpers/new_utils.ts S3 file download and chunk extraction utilities
extensions/aws/src/knowledge-connectors/helpers/list_files.ts S3 bucket listing functionality
extensions/aws/src/knowledge-connectors/helpers/utils/*.ts Utility functions for text processing and configuration
extensions/aws/src/knowledge-connectors/helpers/creds.env Environment variable template for AWS credentials
extensions/aws/package.json Updated dependencies to support LangChain and newer AWS SDK
extensions/aws/.npmrc Added legacy peer deps flag for dependency resolution

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +13 to +16
// sayPollyNode,
// lambdaInvokeNode,
// s3GetObjectNode,
// s3PutObjectNode
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All existing nodes have been commented out, leaving the nodes array empty. This breaks existing functionality for users of this extension. If these nodes are being temporarily disabled for testing, consider using a feature flag instead, or create a separate branch for the S3 connector work that doesn't disable production features.

Suggested change
// sayPollyNode,
// lambdaInvokeNode,
// s3GetObjectNode,
// s3PutObjectNode
sayPollyNode,
lambdaInvokeNode,
s3GetObjectNode,
s3PutObjectNode

Copilot uses AI. Check for mistakes.
};

/**
* Custom PPTXLoader class to handle pptx files. Impelementation adapted
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'Impelementation' to 'Implementation'.

Suggested change
* Custom PPTXLoader class to handle pptx files. Impelementation adapted
* Custom PPTXLoader class to handle pptx files. Implementation adapted

Copilot uses AI. Check for mistakes.
return chunkSize;
};

const getRecursiveCharacterTextSplitter = () => {
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect indentation - this function definition has an extra tab character at the beginning of the line. The indentation should align with other top-level function definitions in the file.

Suggested change
const getRecursiveCharacterTextSplitter = () => {
const getRecursiveCharacterTextSplitter = () => {

Copilot uses AI. Check for mistakes.
});
return splitter;
};
export const splitTextUsingRecursiveCharacterTextSplitter = (text: string) => {
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect indentation - this exported function has an extra tab character at the beginning of the line. The indentation should align with other top-level exports in the file.

Suggested change
export const splitTextUsingRecursiveCharacterTextSplitter = (text: string) => {
export const splitTextUsingRecursiveCharacterTextSplitter = (text: string) => {

Copilot uses AI. Check for mistakes.
Comment on lines +2 to +4
import { get } from "http";


Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'get' import from 'http' is unused in this file and should be removed.

Suggested change
import { get } from "http";

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,67 @@
import { S3Client, ListObjectsV2Command, S3 } from "@aws-sdk/client-s3";
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'S3' import from '@aws-sdk/client-s3' is unused in this file and should be removed.

Suggested change
import { S3Client, ListObjectsV2Command, S3 } from "@aws-sdk/client-s3";
import { S3Client, ListObjectsV2Command } from "@aws-sdk/client-s3";

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +4
AWS_ACCESS_KEY_ID=your_access_key_here
AWS_SECRET_ACCESS_KEY=your_secret_key_here
AWS_REGION=eu-central-1
BUCKET_NAME=your_bucket_name_here No newline at end of file
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Environment credential files should not be committed to the repository. This file should be added to .gitignore and replaced with a .env.example or .env.template file to document required variables without exposing placeholder values that could be mistaken for real credentials.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this file?

Comment on lines +1 to +4
AWS_ACCESS_KEY_ID=your_access_key_here
AWS_SECRET_ACCESS_KEY=your_secret_key_here
AWS_REGION=eu-central-1
BUCKET_NAME=your_bucket_name_here No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this file?

});
break;

case "RecursiveCharacterTextSplitter":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just use RecursiveCharacterTextSplitter for all and remove option to switch beween splitters, to make the code simplified

return supportedExtensions.some(ext => keyLower.endsWith(`.${ext}`));
});

console.log(`Filtered to ${filteredObject.length} supported files (skipped ${s3Objects.length - filteredObject.length} other files)`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add so many console.logs? We can just add the comments instead of logs

const splitter = getRecursiveCharacterTextSplitter();

// Override separators to use default ones, to ensure chunk size limits are respected
splitter.separators = ["\n\n", "\n", " ", ""];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just use default sperator, did you try default seperator and what are the results? I think adding "", will cause chunks to be quite small.

return splitParagraphs;
}

const getChunkSizeInTokens = () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep only RecursiveCharacterTextSplitter, then this function is not needed

@@ -0,0 +1,13 @@
export const removeUnnecessaryChars = (text: string): string => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can move the utils like removeUnnecessaryChars and convertToUtf8.ts to text_extractor.ts as its only used there, no need to create separate files.

@@ -0,0 +1,97 @@
import type { IKnowledge } from "@cognigy/extension-tools";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file name new_utils should be renamed to something better.

@falak-asad
Copy link
Contributor

Suggested changes:

  • We need to minimize the code, and things that we are not using, should be removed
  • Fix some indentation issues, that copilot suggested in the review
  • Remove unnecessary files
  • Remove console logs and replace it with comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants