🧩 Biotessera: A Space Biology Knowledge Engine

Biotessera is an AI-powered agent system developed for the NASA Space Apps Challenge 2025. It transforms hundreds of NASA publications on space biology into a navigable and interactive knowledge mosaic, enabling scientists and mission planners to find precise, synthesized answers backed by source data.

🎯 The Problem

NASA's decades of space biology research represent a vast and invaluable resource. However, this knowledge is spread across hundreds of documents, making it difficult to quickly find specific, cross-referenced information needed for planning future long-duration missions.

💡 The Solution

Biotessera acts as an intelligent research assistant. It uses a multi-agent architecture (Coordinator/Worker pattern) to address this challenge:

Data Preparation: An offline pipeline processes 607 full-text publications, extracts metadata, and generates vector embeddings for every text fragment (a "tessera"). These are stored in a local vector database (TesseraStore).
Retrieval: The main agent, TesseraConductor, receives a user's question and delegates tasks to specialized tools:
- TesseraMiner: Searches the local TesseraStore for the most relevant and diverse text fragments from the 607 publications, utilizing Maximal Marginal Relevance (MMR) for enhanced result quality.
- DataFinder: Performs real-time searches on the NASA Open Science Data Repository (OSDR) to find related raw datasets.
Synthesis: The TesseraConductor gathers all retrieved information and uses a Large Language Model (LLM) to generate a single, coherent, and sourced answer.

✨ Features

Natural Language Q&A: Ask complex questions in plain English.
Multi-Source Answers: The agent can combine information from its internal knowledge base (TesseraStore) and external NASA databases (OSDR).
Source-Backed: Every statement in the generated answer is based on the retrieved data, preventing AI "hallucinations".
Modular Architecture: Easily extendable with new tools to search other databases.

🛠️ Tech Stack

Backend: Python
AI/ML: LangChain, Google Gemini (LLM & Embeddings)
Vector Database: ChromaDB
Data Processing: Pandas, BeautifulSoup
UI: Streamlit

☁️ Data Hosting

The vector database for this project (tesserastore_db) is approximately 808 MB, which exceeds GitHub's file size limits. To ensure the live Streamlit application can be deployed, the database is compressed and hosted on Hugging Face Datasets.

Dataset Link: vero-code/biotessera-database

The application automatically downloads and unpacks this database on its first run in a new environment.

🚀 How to Run

Clone the repository:

git clone https://github.com/vero-code/biotessera.git
cd biotessera

Set up the environment:

python -m venv venv
source venv/bin/activate  
# On Windows: .\venv\Scripts\activate
pip install -r requirements.txt

Add your API Key:
- Create a .env file in the root directory.
- Add your Google AI API key to it: GOOGLE_API_KEY="YOUR_API_KEY_HERE"
Run the application:
```
streamlit run app.py
```

🛰️ NASA Data & Resources Used

This project utilizes the following official NASA resources:

A list of 608 full-text open-access Space Biology publications: The primary knowledge base for the Biotessera agent.
NASA Open Science Data Repository (OSDR): Used by the DataFinder tool to perform real-time searches for raw experimental datasets.

🙏 Acknowledgments

Special thanks to the organizers, mentors, and the entire community of the NASA International Space Apps Challenge 2025 for making this event possible.

📜 License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
docs		docs
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SB_publication_PMC.csv		SB_publication_PMC.csv
app.py		app.py
requirements.txt		requirements.txt
tools.py		tools.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧩 Biotessera: A Space Biology Knowledge Engine

🎯 The Problem

💡 The Solution

✨ Features

🛠️ Tech Stack

☁️ Data Hosting

🚀 How to Run

🛰️ NASA Data & Resources Used

🙏 Acknowledgments

📜 License

About

Uh oh!

Releases

Packages

Languages

License

vero-code/biotessera

Folders and files

Latest commit

History

Repository files navigation

🧩 Biotessera: A Space Biology Knowledge Engine

🎯 The Problem

💡 The Solution

✨ Features

🛠️ Tech Stack

☁️ Data Hosting

🚀 How to Run

🛰️ NASA Data & Resources Used

🙏 Acknowledgments

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages