Biotessera is an AI-powered agent system developed for the NASA Space Apps Challenge 2025. It transforms hundreds of NASA publications on space biology into a navigable and interactive knowledge mosaic, enabling scientists and mission planners to find precise, synthesized answers backed by source data.
NASA's decades of space biology research represent a vast and invaluable resource. However, this knowledge is spread across hundreds of documents, making it difficult to quickly find specific, cross-referenced information needed for planning future long-duration missions.
Biotessera acts as an intelligent research assistant. It uses a multi-agent architecture (Coordinator/Worker pattern) to address this challenge:
- Data Preparation: An offline pipeline processes 607 full-text publications, extracts metadata, and generates vector embeddings for every text fragment (a "tessera"). These are stored in a local vector database (
TesseraStore). - Retrieval: The main agent,
TesseraConductor, receives a user's question and delegates tasks to specialized tools:TesseraMiner: Searches the localTesseraStorefor the most relevant and diverse text fragments from the 607 publications, utilizing Maximal Marginal Relevance (MMR) for enhanced result quality.DataFinder: Performs real-time searches on the NASA Open Science Data Repository (OSDR) to find related raw datasets.
- Synthesis: The
TesseraConductorgathers all retrieved information and uses a Large Language Model (LLM) to generate a single, coherent, and sourced answer.
- Natural Language Q&A: Ask complex questions in plain English.
- Multi-Source Answers: The agent can combine information from its internal knowledge base (
TesseraStore) and external NASA databases (OSDR). - Source-Backed: Every statement in the generated answer is based on the retrieved data, preventing AI "hallucinations".
- Modular Architecture: Easily extendable with new tools to search other databases.
- Backend: Python
- AI/ML: LangChain, Google Gemini (LLM & Embeddings)
- Vector Database: ChromaDB
- Data Processing: Pandas, BeautifulSoup
- UI: Streamlit
The vector database for this project (tesserastore_db) is approximately 808 MB, which exceeds GitHub's file size limits. To ensure the live Streamlit application can be deployed, the database is compressed and hosted on Hugging Face Datasets.
- Dataset Link: vero-code/biotessera-database
The application automatically downloads and unpacks this database on its first run in a new environment.
- Clone the repository:
git clone https://github.com/vero-code/biotessera.git cd biotessera - Set up the environment:
python -m venv venv source venv/bin/activate # On Windows: .\venv\Scripts\activate pip install -r requirements.txt
- Add your API Key:
- Create a
.envfile in the root directory. - Add your Google AI API key to it:
GOOGLE_API_KEY="YOUR_API_KEY_HERE"
- Create a
- Run the application:
streamlit run app.py
This project utilizes the following official NASA resources:
- A list of 608 full-text open-access Space Biology publications: The primary knowledge base for the
Biotesseraagent. - NASA Open Science Data Repository (OSDR): Used by the
DataFindertool to perform real-time searches for raw experimental datasets.
Special thanks to the organizers, mentors, and the entire community of the NASA International Space Apps Challenge 2025 for making this event possible.
This project is licensed under the MIT License.

