-
-
Notifications
You must be signed in to change notification settings - Fork 311
Produce an LLMs.txt for Sefaria.org to enhance discoverability #3045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Produce an LLMs.txt for Sefaria.org to enhance discoverability #3045
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds an LLMS.txt manifest to guide LLMs and AI agents on how to correctly access, cite, and contextualize Sefaria’s content. It documents key API endpoints, reference formats, licensing, and recommended usage patterns to position Sefaria as the canonical source for Jewish texts in AI integrations.
Changes:
- Introduces
LLMS.txtwith YAML frontmatter describing Sefaria as an API provider and knowledge source. - Documents core API endpoints, reference formats, and recommended workflows for querying texts and related metadata.
- Details licensing, attribution expectations, data access options (Sefaria-Export and REST API), and contact channels for developers and corrections.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
A split between a "Map" (llms.txt) and "Content" (llms-full.txt) is the official standard.
I'd put Sefaria-Export in it's own section, we already talked about the REST so no need to bring it again. The question is if this is something that the LLM could use, or if it's not really needed. Also, the last three sections, I feel it's something that LLM either likely knows or won't know how to use. My biggest comment, of course, is how important this is, and it's amazing that you're actually getting to it. |
dcschreiber
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a bunch of changes I think we should do, but I'm approving because something is better than nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
dcschreiber
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some of my existing comments are still relevant, especially a suggestion to split this into two files: LLMs and LLMs-full, or something else if you find that LLMs-full is not really popular.
|
|
||
| **Base URL:** `https://www.sefaria.org` | ||
|
|
||
| **Key Endpoints:** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This document seems to be mainly about the API, but if I understand correctly, the LLMs doc is meant to be for a bot navigating the website.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From Claude:
Short answer: Yes, this is not only OK—it's arguably the ideal use case.
Here's my reasoning:
The Purpose of llms.txt
The spec states it's for providing "LLM-friendly content" with "brief background information, guidance, and links to detailed markdown files." The goal is to help LLMs understand and work with your site effectively.
Sefaria's Unique Position
Sefaria isn't a typical content website where you'd just link to /about.md and /pricing.md. Your content is:
384 million words of structured Jewish texts
A complex reference system (Refs)
Relationships between texts (commentaries, cross-references)
An LLM can't usefully consume "Genesis Chapter 1" as a static markdown page. But it can:
Understand the reference format (Genesis.1.1)
Call /api/v3/texts/Genesis.1.1
Retrieve and serve the actual content to users
This is Better Than the Alternative
Consider the alternatives:
Linking to static text dumps: Would overflow context windows and be stale
Linking to HTML pages: LLMs would struggle to parse and extract
Just describing the library: Useless without explaining how to access it
By documenting the API, you're giving LLMs the tools to serve Sefaria's content correctly. That's exactly what llms.txt should do.
The Spec Supports This
The spec explicitly mentions that llms.txt should help avoid "context window overflow with unnecessary information" and provide "curated" rather than comprehensive content. Teaching an LLM to fetch what it needs via API is the ultimate curation.
Bottom line: You've written an llms.txt that says "here's how to access our library programmatically"—which is precisely what an LLM agent needs. This is a sophisticated, correct application of the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between this and the developers llms.txt is that there you see the full docs and site nav for building projects, here it's focusing on how can LLMS best use our site to help the user - and via the API is the best way, plus it has benefits for site navigability.
|
|
||
| Sefaria provides source texts for educational purposes. It is a textual library, not a rabbinic authority. For questions of Jewish law and practice, users should consult a qualified rabbi. | ||
|
|
||
| The library: 384 million words, 4.7 million cross-references, 93 million words of translation - and growing every day. Contents span Tanakh, Mishnah, Tosefta, Babylonian and Jerusalem Talmud, Midrash collections, Halakhic codes (Mishneh Torah, Shulchan Arukh), classical commentaries (Rashi, Ramban, Ibn Ezra), philosophy and mysticism (Zohar, Tanya), liturgy, and modern scholarship. Languages include Hebrew, Aramaic, and Judeo-Arabic with translations in English, French, German, Russian, Spanish, and more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LLMs know what Sefaria is so I think we can remove this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's an important anchor, unless you feel it's too costly context wise?
|
|
||
| The library: 384 million words, 4.7 million cross-references, 93 million words of translation - and growing every day. Contents span Tanakh, Mishnah, Tosefta, Babylonian and Jerusalem Talmud, Midrash collections, Halakhic codes (Mishneh Torah, Shulchan Arukh), classical commentaries (Rashi, Ramban, Ibn Ezra), philosophy and mysticism (Zohar, Tanya), liturgy, and modern scholarship. Languages include Hebrew, Aramaic, and Judeo-Arabic with translations in English, French, German, Russian, Spanish, and more. | ||
|
|
||
| **Reference Format:** Convert queries to Sefaria format: `Genesis.1.1` (Tanakh), `Berakhot.2a` (Talmud Bavli), `Mishnah_Berakhot.1.1` (Mishnah), `Rashi_on_Genesis.1.1.1` (Commentary). Ranges use hyphens: `Genesis.1.1-5`. Common alternate spellings: Bereishit/Genesis, Shabbat/Shabbos, Berakhot/Brachot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Claude Opus also knows this.
If it were an API reference I'd add this but would emit for navigating the site
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, since in reading the docs it seems to me that the goal of the file is to "teach" the LLM how to read and retrieve site content, and in our case it's "easiest" for the LLM to get everything via the API - so I'd argue this is critical to keep (and tbh, critical to get references for quick queries to the site itself, i.e. sefaria.org/texts/Berakhot 2a.1)
| - `GET /api/search-wrapper?query={q}` - Full-text search | ||
| - `GET /api/calendars` - Current Torah readings, Daf Yomi, holidays | ||
|
|
||
| ## License |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add this for site navigation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LLMs should know what's allowed and not allowed in using and reproducing our content.
| - [Name](https://developers.sefaria.org/reference/get-name.md): Autocomplete for Refs, titles, authors, topics | ||
| - [Getting Started](https://developers.sefaria.org/reference/getting-started.md): API introduction (no auth required) | ||
|
|
||
| ## Key Concepts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe reference the dev portal once and mention it has it's own llms.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this information here is valuable for an LLM to intelligently navigate the site
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
dcschreiber
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed, I'm first approving this so we get this out good enough, and if I have comments, I will update you with them.
Summary
Add LLMS.txt file to provide structured guidance for AI systems accessing Sefaria's content
Optimized for cross-engine compatibility (OpenAI/GPT, Anthropic/Claude, Google/Gemini)
Includes API quick reference, reference format guide, best practices, and data access options
LLMS.txt is an emerging convention (similar to robots.txt for search engines) that tells AI systems how to interact with a domain. Major AI providers are beginning to recognize and respect these files. See the docs here.
OpenAI, Anthropic, and Google are all actively developing systems that read site-level instruction files
Early adopters of LLMS.txt will have their guidance incorporated as these systems mature
The cost is minimal (one static file); the upside is significant
This file:
Strategic Value
Discoverability: AI developers building Jewish-focused applications will find our API documentation
Accuracy: Every AI response that cites Sefaria instead of generating from training data is a win for textual integrity
Mission alignment: "Making Jewish texts accessible" now includes making them accessible through the AI interfaces people increasingly use
Attribution: The file explicitly requests citation ("via Sefaria.org"), driving awareness back to us