Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 76 additions & 0 deletions api-reference/creating-apps.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
---
title: 'Creating Apps'
description: 'Provision isolated Morphik apps and generate connection URIs.'
---

Morphik apps are isolated data environments. Each app has its own documents, embeddings, and auth token, so data stays separated even when apps live on the same cluster. Think of an app as a separate Morphik instance with a shared control plane.

Common uses:
- Create one app per customer or tenant to keep data segregated.
- Split environments (prod, staging, sandbox) without running multiple clusters.
- Separate projects with different data retention or access policies.

## Create a new app (cloud)

**POST** `/cloud/generate_uri`

This endpoint creates an app and returns a Morphik URI that clients use to connect to it.

### Authentication

Provide a Bearer token in `Authorization: Bearer <JWT>`.
Use an existing Morphik API token to create apps and mint new URIs programmatically.

### Request Body

<Properties>
<Property name="app_id" type="string">
Optional client-generated app id (recommended: UUID). If omitted, the server generates one.
</Property>
<Property name="name" type="string" required={true}>
Human-friendly app name. Used in the Morphik URI.
</Property>
<Property name="expiry_days" type="integer">
Days until the token expires (default: 3650).
</Property>
</Properties>

### Example request

```bash
curl -X POST \
https://api.morphik.ai/cloud/generate_uri \
-H 'Authorization: Bearer YOUR_JWT_TOKEN' \
-H 'Content-Type: application/json' \
-d '{
"name": "customer-acme"
}'
```

### Response

<Properties>
<Property name="uri" type="string">
Connection URI in the format `morphik://name:token@host`.
</Property>
<Property name="app_id" type="string">
The app id associated with the URI.
</Property>
</Properties>

**Example response:**

```json
{
"uri": "morphik://customer-acme:eyJhbGciOi...@api.morphik.ai",
"app_id": "f5c5e51a-7a1b-4c8d-8d7e-3c5ed3c6c7b2"
}
```

### Notes

- The response always contains a newly minted token for the app.
- If `app_id` is omitted, the server generates one.
- `name` is required.
- App names must be unique per owner or org; duplicates return 409.
- If the account tier has reached its app limit, the API returns 403.
80 changes: 0 additions & 80 deletions api-reference/management-api.mdx

This file was deleted.

8 changes: 4 additions & 4 deletions concepts/colpali.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ description: 'Using Late-interaction and Contrastive learning to achieve state-o

## Introduction

Upto now, we've seen RAG techniques that **i)** parse a given document, **ii)** convert it to text, and **iii)** embed the text for retrieval. These techniques have been particualrly text-heavy. Embedding models expect text in, knowledge graphs expect text in, and prasers break down when provided with documents that aren't text-dominant. This motivates the question:
Upto now, we've seen RAG techniques that **i)** parse a given document, **ii)** convert it to text, and **iii)** embed the text for retrieval. These techniques have been particularly text-heavy. Embedding models expect text in, knowledge graphs expect text in, and parsers break down when provided with documents that aren't text-dominant. This motivates the question:

> When was the last time you looked at a document and only saw text?

Expand All @@ -17,7 +17,7 @@ In this guide, we'll explore a series of models, starting with *ColPali* that ar

## What is ColPali?

The core idea behind ColPali is simple: the core bottleneck in retrieval is not the performance of the embedding model, but **prior data ingestion pipeline**. As a result, this new techniques proposes doing away with any data preprocessing - embedding the entire document as a list of images instead.
The core idea behind ColPali is simple: the core bottleneck in retrieval is not the performance of the embedding model, but **prior data ingestion pipeline**. As a result, this new technique proposes doing away with any data preprocessing - embedding the entire document as a list of images instead.

![ColPali Architecture](/assets/colpali.png)

Expand All @@ -26,7 +26,7 @@ The diagram above shows the ColPali pipeline when compared with traditional layo
## How does it work?

### Embedding Process
The embedding process for ColPali borrows heavily from models like CLIP. That is, the vision encoder part of the model (as seen in the diagram above) is trained via a technique called **Contrastive Learning**. As we've discussed in previous explainers, an encoder is a function (usually a neural network or a transformer) that maps a given input to a fixed-length vector. Contrastive learning is a technique that allows us to train two encoders of different input types (such as image and text) to produce vectors in the "same embedding space". That is, the embedding of the word "dog" would be very close the embedding of the image of a dog. The way we can achieve this is simple in theory:
The embedding process for ColPali borrows heavily from models like CLIP. That is, the vision encoder part of the model (as seen in the diagram above) is trained via a technique called **Contrastive Learning**. As we've discussed in previous explainers, an encoder is a function (usually a neural network or a transformer) that maps a given input to a fixed-length vector. Contrastive learning is a technique that allows us to train two encoders of different input types (such as image and text) to produce vectors in the "same embedding space". That is, the embedding of the word "dog" would be very close to the embedding of the image of a dog. The way we can achieve this is simple in theory:

1) Take a large dataset of image and text pairs.
2) Pass the image and text through the vision and text encoders respectively.
Expand All @@ -40,7 +40,7 @@ So, we have a system that, given an image, can provide a vector embedding that l

### Retrieval Process

The retrieval process for ColPali borrows from late-interaction based reranking techniques such as [ColBERT](https://arxiv.org/abs/2004.12832). The idea is that instead of directly embedding an image or an entire block of text, we can embed individual patches or tokens instead. Then, instead of using the regular dot product or the cosine similarity, we can employ a slightly different scoring function. This scoring funciton looks at the most similar patches and tokens, and then sums those similarities up to obtain a final score.
The retrieval process for ColPali borrows from late-interaction based reranking techniques such as [ColBERT](https://arxiv.org/abs/2004.12832). The idea is that instead of directly embedding an image or an entire block of text, we can embed individual patches or tokens instead. Then, instead of using the regular dot product or the cosine similarity, we can employ a slightly different scoring function. This scoring function looks at the most similar patches and tokens, and then sums those similarities up to obtain a final score.

![ColBERT Architecture](/assets/colbert.png)

Expand Down
2 changes: 1 addition & 1 deletion concepts/metadata-filtering.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -196,4 +196,4 @@ response = scoped.list_documents(filters=filters, include_total_count=True)
- **“Metadata field … expects type …”** – The server couldn’t coerce the operand to the declared type. Ensure numbers/dates are valid JSON scalars or native Python types before serialization.
- **Range query returns nothing** – Confirm the target documents were ingested/updated with the corresponding `metadata_types`. Re-ingest or call `update_document_metadata` with the proper type hints if necessary.

Still stuck? Share your filter payload and endpoint at `founders@morphik.ai` or on [Discord](https://discord.gg/H7RN3XdGu3).
Still stuck? Share your filter payload and endpoint at `founders@morphik.ai` or on [Discord](https://discord.com/invite/BwMtv3Zaju).
6 changes: 3 additions & 3 deletions concepts/naive-rag.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,12 @@ it would be something like

> "seems like you're assembling chair CX-184. You may have skipped step 8 in the assembly process, since the rear leg is screwed backwards. Here is a step-by-step solution from the assembly guide: ...".

Note how both answers recognized the issue correctly, but since the LLM had additional context in the second answer, it was also able to provide a solution and more specific details. That's the jist of RAG - LLMs provide **higher-quality responses** when provided with **more context** surrounding a query.
Note how both answers recognized the issue correctly, but since the LLM had additional context in the second answer, it was also able to provide a solution and more specific details. That's the gist of RAG - LLMs provide **higher-quality responses** when provided with **more context** surrounding a query.

While the core concept itself is quite obvious, the complexity arises in _how_ we can effectively retrieve the correct information. In the following sections, we explain one way to effectively perform RAG based on the concept of vector embeddings and similarity search (we'll explain what these mean\!).

<Note>
In reality, Morphik uses a combination of different RAG techniques to achieve the best solution. We intend to talk about each of the techniques we implement in the [concepts](/concepts/) section of our documentation. If you're looking for a particular RAG technique, such as [ColPali](/concepts/colpali.mdx) or [Knowledge Graphs](/concepts/knowledge-graphs.mdx), you'll find it there. In this explainer, however, we'll restrict ourselves to talk about single vector-search based retrieval.
In reality, Morphik uses a combination of different RAG techniques to achieve the best solution. We intend to talk about each of the techniques we implement in the [concepts](/concepts/) section of our documentation. If you're looking for a particular RAG technique, such as [ColPali](/concepts/colpali) or [Knowledge Graphs](/concepts/knowledge-graphs), you'll find it there. In this explainer, however, we'll restrict ourselves to talk about single vector-search based retrieval.
</Note>

## How does RAG work?
Expand All @@ -33,7 +33,7 @@ In order to help add context to a prompt, we first need that context to exist. T

**Chunking** involves breaking down documents into smaller, manageable pieces. While LLMs have context windows that can handle thousands of tokens, we want to retrieve only the most relevant information for a given query. Chunking strategies vary based on the content type - code documentation might be chunked by function or class, while textbooks might be chunked by section or paragraph. The ideal chunk size balances granularity (smaller chunks for precise retrieval) with context preservation (larger chunks for maintaining semantic meaning).

**Embedding** transforms these text chunks into vector representations - essentially converting semantic meaning into mathematical space. This is done using embedding models that distill the essence of text into dense vectors. The [math and ML behind embeddings](https://www.3blue1brown.com/lessons/gpt#embedding) is really interesting. They have a [long history](https://en.wikipedia.org/wiki/Word_embedding) of development - with origins as old as 1957. Over time, models that produce word embeddings have gone through mulitple iterations - different domains, novel neural network architectures, as well as different training paradigms.
**Embedding** transforms these text chunks into vector representations - essentially converting semantic meaning into mathematical space. This is done using embedding models that distill the essence of text into dense vectors. The [math and ML behind embeddings](https://www.3blue1brown.com/lessons/gpt#embedding) is really interesting. They have a [long history](https://en.wikipedia.org/wiki/Word_embedding) of development - with origins as old as 1957. Over time, models that produce word embeddings have gone through multiple iterations - different domains, novel neural network architectures, as well as different training paradigms.

Here's a gif we made using [Manim](https://www.manim.community/) to explain word embeddings:

Expand Down
2 changes: 1 addition & 1 deletion configuration.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -82,5 +82,5 @@ When running Morphik in Docker:

## Need Help?

1. Join our [Discord community](https://discord.gg/BwMtv3Zaju)
1. Join our [Discord community](https://discord.com/invite/BwMtv3Zaju)
2. Check [GitHub](https://github.com/morphik-org/morphik-core) for issues
111 changes: 111 additions & 0 deletions core-functions/batch-get-chunks.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
---
title: "Batch Get Chunks"
description: "Retrieve specific chunks by document ID and chunk number"
---

Retrieve specific chunks by their document ID and chunk number in a single batch operation. Useful for fetching exact chunks after retrieval or for building custom pipelines.

<Tabs>
<Tab title="Python">
```python
from morphik import Morphik

db = Morphik("your-uri")

chunks = db.batch_get_chunks(
sources=[
{"document_id": "doc_abc123", "chunk_number": 0},
{"document_id": "doc_abc123", "chunk_number": 1},
{"document_id": "doc_xyz789", "chunk_number": 5}
],
folder_name="/reports",
use_colpali=True,
output_format="url"
)

for chunk in chunks:
print(f"Doc {chunk.document_id}, Chunk {chunk.chunk_number}")
print(f"Content: {chunk.content[:200]}...")
```
</Tab>
<Tab title="TypeScript">
```typescript
import Morphik from 'morphik';

// For Teams/Enterprise, use your dedicated host: https://companyname-api.morphik.ai
const client = new Morphik({
apiKey: process.env.MORPHIK_API_KEY,
baseURL: 'https://api.morphik.ai'
});

const chunks = await client.batch.retrieveChunks({
sources: [
{ document_id: 'doc_abc123', chunk_number: 0 },
{ document_id: 'doc_abc123', chunk_number: 1 },
{ document_id: 'doc_xyz789', chunk_number: 5 }
],
folder_name: '/reports',
use_colpali: true,
output_format: 'url'
});

chunks.forEach(chunk => {
console.log(`Doc ${chunk.document_id}, Chunk ${chunk.chunk_number}`);
console.log(`Content: ${chunk.content.slice(0, 200)}...`);
});
```
</Tab>
<Tab title="cURL">
```bash
curl -X POST "https://api.morphik.ai/batch/chunks" \
-H "Authorization: Bearer $MORPHIK_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"sources": [
{"document_id": "doc_abc123", "chunk_number": 0},
{"document_id": "doc_abc123", "chunk_number": 1},
{"document_id": "doc_xyz789", "chunk_number": 5}
],
"folder_name": "/reports",
"use_colpali": true,
"output_format": "url"
}'
```
</Tab>
</Tabs>

## Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `sources` | array | required | List of `{document_id, chunk_number}` objects |
| `use_colpali` | boolean | `true` | Use Morphik multimodal embeddings when available |
| `output_format` | string | `"base64"` | Image format: `base64`, `url`, or `text` |
| `folder_name` | string | `null` | Optional folder scope |

## Response

```json
[
{
"document_id": "doc_abc123",
"chunk_number": 0,
"content": "Introduction to the quarterly report...",
"content_type": "text/plain",
"score": 1.0,
"metadata": { "department": "sales" }
},
{
"document_id": "doc_abc123",
"chunk_number": 1,
"content": "Revenue highlights for Q4...",
"content_type": "text/plain",
"score": 1.0,
"metadata": { "department": "sales" }
}
]
```

<Note>
This is useful when you already know which chunks you need (e.g., from a previous retrieval result) and want to fetch their full content efficiently.
</Note>
Loading
Loading