Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
464 changes: 464 additions & 0 deletions .editorconfig

Large diffs are not rendered by default.

5 changes: 4 additions & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
# Repository Guidelines

# Rules to follow
## Rules to follow
- Always run `dotnet build GraphRag.slnx` (or the relevant project) before executing any `dotnet test` command.
- Default to the latest available versions (e.g., Apache AGE `latest`) when selecting dependencies, per user request ("тобі треба latest").
- Do not create or rely on fake database stores (e.g., `FakePostgresGraphStore`); all tests must use real connectors/backing services.
- Keep default prompts in static C# classes; do not rely on prompt files under `prompts/` for built-in templates.
- Register language models through Microsoft.Extensions.AI keyed services; avoid bespoke `LanguageModelConfig` providers.
- Always run `dotnet format GraphRag.slnx` before finishing work.

# Conversations
any resulting updates to agents.md should go under the section "## Rules to follow"
Expand Down
6 changes: 3 additions & 3 deletions Directory.Build.props
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@
<RepositoryUrl>https://github.com/managedcode/graphrag</RepositoryUrl>
<PackageProjectUrl>https://github.com/managedcode/graphrag</PackageProjectUrl>
<Product>Managed Code GraphRag</Product>
<Version>0.0.2</Version>
<PackageVersion>0.0.2</PackageVersion>
<Version>0.0.3</Version>
<PackageVersion>0.0.3</PackageVersion>

</PropertyGroup>
<PropertyGroup Condition="'$(GITHUB_ACTIONS)' == 'true'">
Expand All @@ -42,7 +42,7 @@
<IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets>
</PackageReference>
</ItemGroup>

<ItemGroup>
<PackageReference Update="Microsoft.SourceLink.GitHub" Version="8.0.0" />
</ItemGroup>
Expand Down
3 changes: 2 additions & 1 deletion Directory.Packages.props
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
<PackageVersion Include="coverlet.collector" Version="6.0.4" />
<PackageVersion Include="Microsoft.Azure.Cosmos" Version="3.54.0" />
<PackageVersion Include="Microsoft.Extensions.Configuration" Version="9.0.10" />
<PackageVersion Include="Microsoft.Extensions.Caching.Memory" Version="9.0.10" />
<PackageVersion Include="Microsoft.Extensions.DependencyInjection" Version="9.0.10" />
<PackageVersion Include="Microsoft.Extensions.Logging" Version="9.0.10" />
<PackageVersion Include="Microsoft.Extensions.Logging.Abstractions" Version="9.0.10" />
Expand All @@ -20,4 +21,4 @@
<PackageVersion Include="xunit" Version="2.9.3" />
<PackageVersion Include="xunit.runner.visualstudio" Version="3.1.5" />
</ItemGroup>
</Project>
</Project>
68 changes: 65 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,10 +86,72 @@ graphrag/

## Integration Testing Strategy

- **No fakes.** We removed the legacy fake Postgres store. Every graph operation in tests uses real services orchestrated by Testcontainers.
- **Security coverage.** `Integration/PostgresGraphStoreIntegrationTests.cs` includes payloads that mimic SQL/Cypher injection attempts to ensure values remain literals and labels/types are strictly validated.
- **Cross-backend validation.** `Integration/GraphStoreIntegrationTests.cs` exercises Postgres, Neo4j, and Cosmos (when available) through the shared `IGraphStore` abstraction.
- **No fakes.** We removed the legacy fake Postgres store. Every graph operation in tests uses real services orchestrated by Testcontainers.
- **Security coverage.** `Integration/PostgresGraphStoreIntegrationTests.cs` includes payloads that mimic SQL/Cypher injection attempts to ensure values remain literals and labels/types are strictly validated.
- **Cross-backend validation.** `Integration/GraphStoreIntegrationTests.cs` exercises Postgres, Neo4j, and Cosmos (when available) through the shared `IGraphStore` abstraction.
- **Workflow smoke tests.** Pipelines (e.g., `IndexingPipelineRunnerTests`) and finalization steps run end-to-end with the fixture-provisioned infrastructure.
- **Prompt precedence.** `Integration/CommunitySummariesIntegrationTests.cs` proves manual prompt overrides win over auto-tuned assets while still falling back to auto templates when manual text is absent.
- **Callback and stats instrumentation.** `Runtime/PipelineExecutorTests.cs` now asserts that pipeline callbacks fire and runtime statistics are captured even when workflows fail early, so custom telemetry remains reliable.

---

## Pipeline Cache

Pipelines exchange state through the `IPipelineCache` abstraction. Every workflow step receives the same cache instance via `PipelineRunContext`, so it can reuse expensive results (LLM calls, chunk expansions, graph lookups) that were produced earlier in the run instead of recomputing them. The cache also keeps optional debug payloads per entry so you can persist trace metadata alongside the main value.

To use the built-in in-memory cache, register it alongside the standard ASP.NET Core services:

```csharp
using GraphRag.Cache;

builder.Services.AddMemoryCache();
builder.Services.AddSingleton<IPipelineCache, MemoryPipelineCache>();
```

Prefer a different backend? Implement `IPipelineCache` yourself and register it through DI—the pipeline will pick up your custom cache automatically.

- **Per-scope isolation.** `MemoryPipelineCache.CreateChild("stage")` scopes keys by prefix (`parent:stage:key`). Calling `ClearAsync` on the parent removes every nested key, so multi-step workflows do not leak data between stages.
- **Debug traces.** The cache stores optional debug payloads per entry; `DeleteAsync` and `ClearAsync` always clear these traces, preventing the diagnostic dictionary from growing unbounded.
- **Lifecycle guidance.** Create the root cache once per pipeline run (the default context factory does this for you) and spawn children inside individual workflows when you need an isolated namespace.

---

## Language Model Registration

GraphRAG delegates language-model configuration to [Microsoft.Extensions.AI](https://learn.microsoft.com/dotnet/ai/overview). Register keyed clients for every `ModelId` you reference in configuration—pick any string key that matches your config:

```csharp
using Azure;
using Azure.AI.OpenAI;
using GraphRag.Config;
using Microsoft.Extensions.AI;

var openAi = new OpenAIClient(new Uri(endpoint), new AzureKeyCredential(key));
const string chatModelId = "chat_model";
const string embeddingModelId = "embedding_model";

builder.Services.AddKeyedSingleton<IChatClient>(
chatModelId,
_ => openAi.GetChatClient(chatDeployment));

builder.Services.AddKeyedSingleton<IEmbeddingGenerator<string, Embedding>>(
embeddingModelId,
_ => openAi.GetEmbeddingClient(embeddingDeployment));
```

Rate limits, retries, and other policies should be configured when you create these clients (for example by wrapping them with `Polly` handlers). `GraphRagConfig.Models` simply tracks the set of model keys that have been registered so overrides can validate references.

---

## Indexing, Querying, and Prompt Tuning Alignment

The .NET port mirrors the [GraphRAG indexing architecture](https://microsoft.github.io/graphrag/index/overview/) and its query workflows so downstream applications retain parity with the Python reference implementation.

- **Indexing overview.** Workflows such as `extract_graph`, `create_communities`, and `community_summaries` map 1:1 to the [default data flow](https://microsoft.github.io/graphrag/index/default_dataflow/) and persist the same tables (`text_units`, `entities`, `relationships`, `communities`, `community_reports`, `covariates`). The new prompt template loader honours manual or auto-tuned prompts before falling back to the stock templates in `prompts/`.
- **Query capabilities.** The query pipeline retains global search, local search, drift search, and question generation semantics described in the [GraphRAG query overview](https://microsoft.github.io/graphrag/query/overview/). Each orchestrator continues to assemble context from the indexed tables so you can reference [global](https://microsoft.github.io/graphrag/query/global_search/) or [local](https://microsoft.github.io/graphrag/query/local_search/) narratives interchangeably.
- **Prompt tuning.** GraphRAG’s [manual](https://microsoft.github.io/graphrag/prompt_tuning/manual_prompt_tuning/) and [auto](https://microsoft.github.io/graphrag/prompt_tuning/auto_prompt_tuning/) strategies are surfaced through `GraphRagConfig.PromptTuning`. Store custom templates under `prompts/` or point `PromptTuning.Manual.Directory`/`PromptTuning.Auto.Directory` at your tuning outputs. You can also skip files entirely by assigning inline text (multi-line or prefixed with `inline:`) to workflow prompt properties. Stage keys and placeholders are documented in `docs/indexing-and-query.md`.

See [`docs/indexing-and-query.md`](docs/indexing-and-query.md) for a deeper mapping between the .NET workflows and the research publications underpinning GraphRAG.

---

Expand Down
99 changes: 99 additions & 0 deletions docs/indexing-and-query.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Indexing, Querying, and Prompt Tuning in GraphRAG for .NET

GraphRAG for .NET keeps feature parity with the Python reference project described in the [Microsoft Research blog](https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/) and the [GraphRAG paper](https://arxiv.org/pdf/2404.16130). This document explains how the .NET workflows map to the concepts documented on [microsoft.github.io/graphrag](https://microsoft.github.io/graphrag/), highlights the supported query modes, and shows how to customise prompts via manual or auto tuning outputs.

## Indexing Architecture

- **Workflow parity.** Each indexing stage matches the Python pipeline and the [default data flow](https://microsoft.github.io/graphrag/index/default_dataflow/):
- `load_input_documents` → `create_base_text_units` → `summarize_descriptions`
- `extract_graph` persists `entities` and `relationships`
- `create_communities` produces `communities`
- `community_summaries` writes `community_reports`
- `extract_covariates` stores `covariates`
- **Storage schema.** Tables share the column layout described under [index outputs](https://microsoft.github.io/graphrag/index/outputs/). The new strongly-typed records (`CommunityRecord`, `CovariateRecord`, etc.) mirror the JSON representation used by the Python implementation.
- **Cluster configuration.** `GraphRagConfig.ClusterGraph` exposes the same knobs as the Python `cluster_graph` settings, enabling largest-component filtering and deterministic seeding.

## Language Model Registration

Workflows resolve language models from the DI container via [Microsoft.Extensions.AI](https://learn.microsoft.com/dotnet/ai/overview). Register keyed services for every `ModelId` you plan to reference:

```csharp
using Azure;
using Azure.AI.OpenAI;
using GraphRag.Config;
using Microsoft.Extensions.AI;

var openAi = new OpenAIClient(new Uri(endpoint), new AzureKeyCredential(key));
const string chatModelId = "chat_model";
const string embeddingModelId = "embedding_model";

services.AddKeyedSingleton<IChatClient>(chatModelId, _ => openAi.GetChatClient(chatDeployment));
services.AddKeyedSingleton<IEmbeddingGenerator<string, Embedding>>(embeddingModelId, _ => openAi.GetEmbeddingClient(embeddingDeployment));
```

Configure retries, rate limits, and logging when you construct the concrete clients. `GraphRagConfig.Models` simply records the set of registered keys so configuration overrides can validate references.

## Pipeline Cache

`IPipelineCache` is intentionally infrastructure-neutral. To mirror ASP.NET Core's in-memory behaviour, register the built-in cache services alongside the provided adapter:

```csharp
services.AddMemoryCache();
services.AddSingleton<IPipelineCache, MemoryPipelineCache>();
```

Need Redis or something else? Implement `IPipelineCache` yourself and register it through DI; the pipeline will automatically consume your custom cache.

## Query Capabilities

The query layer ports the orchestrators documented in the [GraphRAG query overview](https://microsoft.github.io/graphrag/query/overview/):

- **Global search** ([docs](https://microsoft.github.io/graphrag/query/global_search/)) traverses community summaries and graph context to craft answers spanning the corpus.
- **Local search** ([docs](https://microsoft.github.io/graphrag/query/local_search/)) anchors on a document neighbourhood when you need focused context.
- **Drift search** ([docs](https://microsoft.github.io/graphrag/query/drift_search/)) monitors narrative changes across time slices.
- **Question generation** ([docs](https://microsoft.github.io/graphrag/query/question_generation/)) produces follow-up questions to extend an investigation.

Every orchestrator consumes the same indexed tables as the Python project, so the .NET stack interoperates with BYOG scenarios described in the [index architecture guide](https://microsoft.github.io/graphrag/index/architecture/).

## Prompt Tuning

Manual and auto prompt tuning are both available without code changes:

1. **Manual overrides** follow the rules from [manual prompt tuning](https://microsoft.github.io/graphrag/prompt_tuning/manual_prompt_tuning/).
- Place custom templates under a directory referenced by `GraphRagConfig.PromptTuning.Manual.Directory` and set `Enabled = true`.
- Filenames follow the stage key pattern `section/workflow/kind.txt` (see table below).
2. **Auto tuning** integrates the outputs documented in [auto prompt tuning](https://microsoft.github.io/graphrag/prompt_tuning/auto_prompt_tuning/).
- Point `GraphRagConfig.PromptTuning.Auto.Directory` at the folder containing the generated prompts and set `Enabled = true`.
- The runtime prefers explicit paths from workflow configs, then manual overrides, then auto-tuned files, and finally the built-in defaults in `prompts/`.
3. **Inline overrides** can be injected directly from code: set `ExtractGraphConfig.SystemPrompt`, `ExtractGraphConfig.Prompt`, or the equivalent properties to either a multi-line string or a value prefixed with `inline:`. Inline values bypass template file lookups and are used as-is.

### Stage Keys and Placeholders

| Workflow | Stage key | Purpose | Supported placeholders |
|----------|-----------|---------|------------------------|
| `extract_graph` (system) | `index/extract_graph/system.txt` | System prompt that instructs the extractor. | _N/A_ |
| `extract_graph` (user) | `index/extract_graph/user.txt` | User prompt template for individual text units. | `{{max_entities}}`, `{{text}}` |
| `community_summaries` (system) | `index/community_reports/system.txt` | System guidance for cluster summarisation. | _N/A_ |
| `community_summaries` (user) | `index/community_reports/user.txt` | User prompt template for entity lists. | `{{max_length}}`, `{{entities}}` |

Placeholders are replaced at runtime with values drawn from workflow configuration:

- `{{max_entities}}` → `ExtractGraphConfig.EntityTypes.Count + 5` (minimum 1)
- `{{text}}` → the original text unit content
- `{{max_length}}` → `CommunityReportsConfig.MaxLength`
- `{{entities}}` → bullet list of entity titles and descriptions

If a template is omitted, the runtime falls back to the built-in prompts defined in `GraphRagPromptLibrary`.

## Integration Tests

`tests/ManagedCode.GraphRag.Tests/Integration/CommunitySummariesIntegrationTests.cs` exercises the new prompt loader end-to-end using the file-backed pipeline storage. Combined with the existing Aspire-powered suites, the tests demonstrate how indexing, community detection, and summarisation behave with tuned prompts while remaining faithful to the [GraphRAG BYOG guidance](https://microsoft.github.io/graphrag/index/byog/).

## Further Reading

- [GraphRAG prompt tuning overview](https://microsoft.github.io/graphrag/prompt_tuning/overview/)
- [GraphRAG index methods](https://microsoft.github.io/graphrag/index/methods/)
- [GraphRAG query overview](https://microsoft.github.io/graphrag/query/overview/)
- [GraphRAG default dataflow](https://microsoft.github.io/graphrag/index/default_dataflow/)

These resources underpin the .NET implementation and provide broader context for customising or extending the library.
2 changes: 2 additions & 0 deletions prompts/community_graph.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
You are an investigative analyst. Produce concise, neutral summaries that describe the shared theme binding the supplied entities.
Highlight how they relate, why the cluster matters, and any notable signals the reader should know. Do not invent facts.
6 changes: 6 additions & 0 deletions prompts/community_text.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Summarise the key theme that connects the following entities in no more than {{max_length}} characters. Focus on what unites them and why the group matters. Avoid bullet lists.

Entities:
{{entities}}

Provide a single paragraph answer.
9 changes: 9 additions & 0 deletions prompts/index/extract_graph.system.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
You are a precise information extraction engine. Analyse the supplied text and return structured JSON describing:
- distinct entities (people, organisations, locations, products, events, concepts, technologies, dates, other)
- relationships between those entities

Rules:
- Only use information explicitly stated or implied in the text.
- Prefer short, human-readable titles.
- Use snake_case relationship types (e.g., "works_with", "located_in").
- Always return valid JSON adhering to the response schema.
28 changes: 28 additions & 0 deletions prompts/index/extract_graph.user.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
Extract up to {{max_entities}} of the most important entities and their relationships from the following text.

Text (between <BEGIN_TEXT> and <END_TEXT> markers):
<BEGIN_TEXT>
{{text}}
<END_TEXT>

Respond with JSON matching this schema:
{
"entities": [
{
"title": "string",
"type": "person | organization | location | product | event | concept | technology | date | other",
"description": "short description",
"confidence": 0.0 - 1.0
}
],
"relationships": [
{
"source": "entity title",
"target": "entity title",
"type": "relationship_type",
"description": "short description",
"weight": 0.0 - 1.0,
"bidirectional": true | false
}
]
}
Loading