Scout Application for Intelligent Preprocessing of Reports and conversion to Att&ck format
Scout is a tool designed to streamline cybersecurity analysts' workflows by automating the collection, summarization, correlation, and reporting of cybersecurity events. Using advanced AI/ML and standardized formats like MITRE ATT&CK and STIX, Scout processes open-source threat reports, can store data in DeepLynx as JSON, and generate human-readable reports.
- Overview
- Minimum Viable Product (MVP)
- Prerequisites
- Configuration
- Docker Setup
- Development Setup
- Using Scout
- RSS Feeds and Topic Modeling
- Troubleshooting
Scout accelerates cybersecurity reporting by:
- Scraping and processing threat reports using Natural Language Processing (NLP).
- Identifying key entities like MITRE ATT&CK techniques and CyOTE observables.
- Outputting structured data in JSON/STIX formats and summarized reports via a Large Language Model (LLM).
The MVP defines Scout's core workflow:
- Analysts gather open-source threat reports.
- Scout scrapes and extracts relevant data from reports.
- NLP identifies key entities (e.g., MITRE ATT&CK techniques, CyOTE observables).
- Data is output as JSON/STIX and summarized into a human-readable report.
- Docker: Required for containerized services (installation guide).
- git-lfs: Needed for downloading ML models (installation guide).
- Node.js: For frontend and backend development (download).
- Python 3+: For microservices (download).
- pip: Python package manager for microservices.
Scout requires proper environment configuration to connect to databases, AI services, and external LLM providers. Configuration varies between local development and production deployment.
Scout uses environment files (.env and .env.production) located in the backend/ directory. Copy the example files and configure according to your deployment scenario:
cp backend/.env.example backend/.env
cp backend/.env.production.example backend/.env.production| Variable | Required | Description |
|---|---|---|
DB_URI |
β | MongoDB connection string |
DB_NAME |
β | MongoDB database name |
PORT |
β | Backend server port (default: 3001) |
SESSION_TOKEN_SECRET |
β | JWT session token secret (generate random) |
REFRESH_TOKEN_SECRET |
β | JWT refresh token secret (generate random) |
| Variable | Required | Description |
|---|---|---|
USE_REMOTE_NER_SERVICE |
β | true for remote NER, false for local |
USE_REMOTE_LLM_SERVICE |
β | true for remote LLM, false for local |
REMOTE_NER_URL |
NER endpoint (if using remote) | |
REMOTE_LLM_URL |
LLM endpoint (if using remote) | |
REMOTE_SERVER_API_KEY |
Authentication key (if using remote) |
| Variable | Required | Description |
|---|---|---|
REMOTE_SERVER_BASE_URL |
LLM base URL for topic labeling | |
REMOTE_SERVER_MODEL_ID |
Model identifier (e.g., "Mistral-Nemo-Instruct-2407") | |
BERTOPIC_OPENAI_API_KEY |
β | OpenAI API key (alternative to Custom/local) |
BERTOPIC_OPENAI_MODEL |
β | OpenAI model (e.g., "gpt-3.5-turbo") |
BERTOPIC_SKIP_LLM_LABELING |
β | Set true to skip LLM-based topic labeling |
For local development using containerized AI services:
backend/.env:
# Database
DB_URI="mongodb://scout:admin@localhost:27017"
DB_NAME="scout"
PORT="3001"
# Use local containerized AI services
USE_REMOTE_NER_SERVICE=false
USE_REMOTE_LLM_SERVICE=false
# Security tokens (generate new ones)
SESSION_TOKEN_SECRET="your-random-session-secret-here"
REFRESH_TOKEN_SECRET="your-random-refresh-secret-here"
# Remote-server configuration (unused in local mode)
REMOTE_SERVER_API_KEY=""
REMOTE_SERVER_BASE_URL="my-model-service.com/api"
REMOTE_SERVER_MODEL_ID="Mistral-Nemo-Instruct-2407"
# BERTopic configuration
BERTOPIC_SKIP_LLM_LABELING=trueDocker Services URLs (internal):
- STIX Service:
http://stix-microservice:8000 - NER Service:
http://scyner:8001 - LLM Service:
http://local-llm:8002 - BERTopic Service:
http://bertopic:8003
For production deployment using Remote-server's remote AI services:
backend/.env.production:
# Database
DB_URI="mongodb://scout:admin@db:27017"
DB_NAME="scout"
PORT="3001"
# Use remote Remote-server AI services
USE_REMOTE_NER_SERVICE=true
USE_REMOTE_LLM_SERVICE=true
# Remote-server service endpoints
REMOTE_NER_URL="my-model-service.com/api/ner"
REMOTE_LLM_URL="my-model-service.com/api/chat/completions"
REMOTE_SERVER_API_KEY="api-key-here"
REMOTE_SERVER_BASE_URL="my-model-service.com"
REMOTE_SERVER_MODEL_ID="Mistral-Nemo-Instruct-2407"
# Security tokens (generate new ones for production)
SESSION_TOKEN_SECRET="your-production-session-secret"
REFRESH_TOKEN_SECRET="your-production-refresh-secret"
# BERTopic will use Remote-server LLM for topic labeling
BERTOPIC_SKIP_LLM_LABELING=falseFor using your own LLM service (e.g., local Ollama, OpenAI, etc.):
# Option 1: Use OpenAI-compatible API
USE_REMOTE_LLM_SERVICE=true
REMOTE_LLM_URL="https://localhost:9443/api/llm/v1/chat/completions"
REMOTE_SERVER_API_KEY="xxxxx-xxxxx-xxxxx-xxxxx-xxxxx"
# Option 2: Use OpenAI directly for BERTopic
BERTOPIC_OPENAI_API_KEY="your-openai-api-key"
BERTOPIC_OPENAI_MODEL="gpt-3.5-turbo"
BERTOPIC_OPENAI_URL="https://api.openai.com/v1"-
Generate secure secrets:
# Generate random secrets node -e "console.log(require('crypto').randomBytes(32).toString('base64'))"
-
Never commit API keys to version control
-
Use different secrets for development and production
-
Regularly rotate API keys and secrets
-
Clone and configure the repository:
git clone <repository-url> cd Scout/
-
Set up environment configuration:
# Copy example environment files cp backend/.env.example backend/.env cp backend/.env.production.example backend/.env.production # Edit configuration for your deployment nano backend/.env.production # Configure for your LLM setup
-
Start all services with Docker Compose:
docker compose up
- View logs in the terminal for debugging.
- For background execution, use:
docker compose up --detach
- Access logs via Docker Desktop.
-
Verify services are running:
- Frontend:
http://localhost:5173 - Backend API:
http://localhost:3001 - MongoDB GUI:
http://localhost:8081(scout:mongo) - STIX Service:
http://localhost:8000/docs - NER Service:
http://localhost:8001/docs - LLM Service:
http://localhost:8002/docs - BERTopic Service:
http://localhost:8003/docs
- Frontend:
-
To rebuild after code changes:
docker compose up --build
- For specific services (e.g., frontend):
docker compose up --build --detach frontend
- For specific services (e.g., frontend):
-
Environment-specific deployment:
- Local development: Uses containerized AI services (no API keys required)
- Production with Remote-server: Requires
REMOTE_SERVER_API_KEYin.env.production - Custom LLM: Configure
REMOTE_LLM_URLand API key
Note: If developing specific services, pause their Docker containers and follow the Development Setup instructions.
For local development, you need a MongoDB instance (Dockerized by default) and service-specific setups. Always start MongoDB via Docker unless managing your own database.
- Run Docker Compose to start MongoDB and its Web GUI:
docker compose up
- Access the MongoDB GUI at
http://localhost:8081for database management.
- Ensure MongoDB is running (see above).
- Navigate to the backend directory and install dependencies:
cd Scout/backend npm install - Start the Express server:
npm run dev
- The server waits for the MongoDB connection and listens on the default port (check terminal output).
- Navigate to the root
Scout/directory and install dependencies:cd Scout/ npm install - Start the Vite frontend with hot-reloading:
npm run dev
- Access the frontend at
http://localhost:<port>(port shown in terminal).
- Access the frontend at
Microservices (STIX, NER, Summarization) require Python 3+ and pip.
- Navigate to the desired service directory (e.g.,
Scout/backend/src/services/stix). - Install dependencies:
pip install -r requirements.txt
- If no
requirements.txt, install manually:pip install stix2 fastapi pydantic uvicorn gunicorn
- If no
- Start the FastAPI service:
uvicorn main:app --reload
- Access the service at
http://localhost:8000and API docs athttp://localhost:8000/docs.
- Access the service at
- Repeat for other services (
scynerfor NER,localLLMfor summarization) on free ports.
Scout provides a web-based interface and APIs to process cybersecurity threat reports, extract entities (e.g., MITRE ATT&CK techniques, CyOTE observables), and generate structured (JSON/STIX) or human-readable reports. Below are instructions for using Scout after setup.
- Ensure the frontend is running (see Frontend setup).
- Open a browser and navigate to
http://localhost:<port>(replace<port>with the port shown after runningnpm run dev, typically5173for Vite). - Log in or register:
- Register: Create an account with First Name, Last Name, Email, and a Password (must be 8+ characters with at least one uppercase, lowercase, number, and special character).
- Log In: Use your credentials to access the Main Dashboard.
The Main Dashboard is the central hub for managing reports, accessible after login. Key features include:
- Create New Report: Start a new cybersecurity report.
- My Reports: View and edit your created reports.
- Import Report: Upload existing reports for further processing.
- Configure: Manage AI models and data sources (e.g., RSS feeds).
- From the Main Dashboard, click Create New Report or the plus icon (+) under My Reports.
- In the pop-up window, enter:
- Report Name: A descriptive name for the report (e.g., "Q3 Threat Analysis").
- Target: The audience or system (e.g., "Security Team", "SIEM").
- Requested By: The person or entity requesting the report.
- Due Date: The report deadline.
- Click Save to create the report. It will appear under My Reports.
- Navigate to My Reports and click a report to open it.
- The report interface includes tabs for managing the workflow:
- Summary: View/edit report details (Name, Target, Due Date, Requested By, Created By, Created On).
- Edit fields as needed (except Created By/On, which are fixed).
- Export: Download the report and assets as a ZIP file.
- Delete: Permanently remove the report (irreversible).
- Direction: Add an Assignment Synopsis (brief overview) and Requirements for the report.
- Collection/Processing: Add and process threat report sources.
- Analysis: Review AI-generated outputs (e.g., STIX, comments).
- Dissemination: Generate and distribute the final report.
- Summary: View/edit report details (Name, Target, Due Date, Requested By, Created By, Created On).
- In the Collection/Processing tab, click the plus icon (+) to add a source.
- Add a source via:
- Drag and Drop: Upload
.txtor.pdffiles. - Enter URL: Input a URL to scrape online content.
- Paste Text: Paste report text directly.
- Browse Files: Select local
.txtor.pdffiles.
- Drag and Drop: Upload
- Assign a Source Title for identification and add Bibliography details (Author, Title, Publication, Publisher, Year).
- Click Save to add the source to the Source List.
- In the Source List, view:
- Source Number, Name, Progress (percentage of observables reviewed), Date, and Actions (Edit, Refresh, Delete, Include).
- Use the Source Explorer to review AI-extracted entities:
- NLP Results: View source text with highlighted observables (e.g., MITRE ATT&CK techniques). Click highlighted text to see Entity Number, Label, Confidence, and Status (Accept/Reject).
- TRAM View: Review a list of observables with Number, Observable, Source Text, and Status (Accept/Reject).
- Save or load analysis versions as needed.
- In the Analysis tab, view AI-generated outputs:
- Generated STIX: Structured threat intelligence in JSON/STIX format.
- Toggle between Full STIX and Trimmed STIX.
- Switch views: Compare (STIX vs. text), Code (raw STIX), or Text (report text).
- Analyst Comments: Add observations, insights, or recommendations. Save or load versions.
- Generated STIX: Structured threat intelligence in JSON/STIX format.
- Review and edit outputs to refine the analysis.
- In the Dissemination tab, click Call LLM to generate a human-readable report (requires an API key for the LLM).
- Review the generated report, which includes:
- Title, Introduction, Body, Conclusion, References, and optional Disclaimer.
- Use options to Regenerate or Save the report.
- Click Disseminate to distribute the report to the target audience (e.g., via email or download).
- Export the final report in JSON/STIX or as a human-readable document.
For programmatic access:
- Ensure microservices are running (see Microservices).
- Access API documentation at:
- STIX:
http://localhost:8000/docs - NER:
http://localhost:<ner-port>/docs(e.g.,8001) - Summarization:
http://localhost:<llm-port>/docs(e.g.,8002)
- STIX:
- Example API call to process a report:
curl -X POST http://localhost:8000/process -H "Content-Type: application/json" -d '{"report": "Sample threat report text"}'
- Access the MongoDB GUI at
http://localhost:8081(see MongoDB Setup). - Inspect stored data (e.g., JSON outputs, extracted entities) in the relevant database/collection.
- Use this for debugging or manual analysis.
- No output: Ensure all services (frontend, backend, microservices) are running and ports are not conflicting.
- API errors: Check API documentation and verify the correct endpoint/port.
- Report issues: Confirm sources are included and observables are accepted in the Source Explorer.
- View logs via
docker compose upor Docker Desktop for debugging.
Scout includes powerful RSS feed management and BERTopic-based topic modeling capabilities for automated threat intelligence collection and analysis. This section covers how to configure RSS feeds, download articles, generate topic models, and integrate findings into your reports.
RSS feeds provide automated collection of cybersecurity threat intelligence from various sources. Scout supports multiple RSS feed configurations for comprehensive threat monitoring.
- Access Configuration: Navigate to the Configure section from the main dashboard.
- RSS Feed Management: Click on RSS Feeds to access the feed configuration interface.
- Add New Feed: Click the Add Feed button to configure a new RSS source:
- Feed URL: Enter the RSS/Atom feed URL (e.g.,
https://feeds.feedburner.com/eset/blog) - Feed Name: Provide a descriptive name for the feed (e.g., "ESET Threat Blog")
- Category: Assign a category for organization (e.g., "Vendor Blogs", "Government Alerts")
- Active Status: Enable/disable the feed for automatic collection
- Update Frequency: Set how often to check for new articles (hourly, daily, weekly)
- Feed URL: Enter the RSS/Atom feed URL (e.g.,
Popular Cybersecurity RSS Feeds:
- Krebs on Security: https://krebsonsecurity.com/feed/
- SANS Internet Storm Center: https://isc.sans.edu/rssfeed.xml
- US-CERT Alerts: https://www.cisa.gov/cybersecurity-advisories/all.xml
- Threatpost: https://threatpost.com/feed/
- Bleeping Computer: https://www.bleepingcomputer.com/feed/
- Save Configuration: Click Save to add the feed to your monitoring list.
- Test Feed: Use the Test Feed button to verify the RSS URL is accessible and returning articles.
- Edit: Modify feed URLs, names, categories, or update frequencies
- Enable/Disable: Toggle feeds on/off without deleting configuration
- Delete: Remove feeds permanently from the system
- View Stats: Check article collection statistics and last update times
Scout automatically downloads and processes RSS feed articles based on your configuration schedule. You can also trigger manual downloads for immediate collection.
- Scheduled Collection: RSS feeds are automatically checked based on their configured update frequency
- Article Processing: New articles are downloaded, parsed for content, and stored in the database
- Deduplication: Scout automatically filters out duplicate articles based on URL and content similarity
- Content Extraction: Full article text is extracted from RSS summaries and linked pages when possible
- Access RSS Management: Go to Configure β RSS Feeds
- Trigger Download: Click Download All Feeds to immediately collect new articles from all active feeds
- Individual Feed Download: Use the Download button next to specific feeds for targeted collection
- Monitor Progress: View real-time download progress and statistics in the interface
Command Line RSS Processing:
# Navigate to the RSS reader service
cd backend/src/services/parsers
# Run manual RSS collection
node rssreader.js
# With specific environment configuration
DB_URI="mongodb://scout:admin@localhost:27017" node rssreader.jsEnvironment Variables for RSS:
RSS_DOWNLOAD_LIMIT: Maximum articles to download per feed (default: 50)RSS_CONTENT_EXTRACTION: Enable full content extraction from linked articlesRSS_DUPLICATE_THRESHOLD: Similarity threshold for duplicate detection (0.0-1.0)
BERTopic integration allows Scout to automatically discover emerging themes and trends in collected RSS articles and threat reports. Topic modeling helps identify patterns across large document collections.
- Access BERTopic Interface: Navigate to Configure β Topic Modeling or use the BERTopic tab in reports
- Select Data Source: Choose from available document collections:
- RSS Articles: Use collected RSS feed articles
- Report Sources: Include documents from existing reports
- Combined Dataset: Merge RSS articles with report documents
- Configure Model Parameters:
- Model Type: Choose between Simple (faster) or Complex (more accurate) BERTopic models
- Number of Topics: Set target number of topics (auto-detect or specify range)
- Language: Select language for text processing (English, multi-language)
- Date Range: Filter documents by publication date
- Minimum Document Length: Exclude very short articles
- Advanced Options:
- Representation Model: KeyBERT for keyword extraction
- Embedding Model: Choose text embedding approach
- Clustering Algorithm: HDBSCAN parameters for topic clustering
- Dimensionality Reduction: UMAP settings for visualization
- Start Training: Click Train Model to begin the BERTopic training process
- Monitor Progress: View real-time training status including:
- Documents loaded and processed
- Current training step (embedding, clustering, topic generation)
- Elapsed time and estimated completion
- Memory usage and system resources
- Training Completion: Receive notification when model training finishes
- Automatic Saving: Trained models are automatically saved with timestamps for future use
The training interface provides detailed progress information:
π Training Status: In Progress
β±οΈ Start Time: 2024-01-15 14:30:22
π Documents: 1,247 articles processed
π Current Step: Generating topics
π·οΈ Topics Generated: 23 topics identified
β³ Elapsed Time: 8m 34s
Scout maintains a library of trained topic models, allowing you to switch between different models and compare results across various datasets and time periods.
- Access Model Library: Go to Configure β Topic Modeling β Model Management
- Available Models: View list of trained models with metadata:
- Model Name: Descriptive name and creation date
- Document Count: Number of documents used for training
- Topic Count: Number of topics discovered
- Training Date: When the model was created
- Model Type: Simple or Complex BERTopic configuration
- Data Sources: RSS feeds, reports, or combined datasets used
- Select Active Model: Choose which model to use for analysis and visualization
- Model Comparison: Compare topic distributions across different models
- Set as Active: Make a model the default for topic analysis
- Rename Model: Update model names for better organization
- Export Model: Download model files for external analysis
- Delete Model: Remove unused models to free storage space
- Model Details: View comprehensive training parameters and statistics
Each model stores detailed information:
{
"model_id": "rss_model_2024_01_15",
"creation_date": "2024-01-15T14:30:22Z",
"document_count": 1247,
"topic_count": 23,
"model_type": "complex",
"data_sources": ["rss_feeds", "manual_uploads"],
"training_parameters": {
"embedding_model": "all-MiniLM-L6-v2",
"clustering_algorithm": "hdbscan",
"representation_model": "keybert"
}
}Scout provides comprehensive visualization and analysis tools for exploring topic model results, including topic distributions, document assignments, and trend analysis.
- Access Topic Analysis: Navigate to Analysis β Topic Modeling or the BERTopic tab in reports
- Topic Summary: View high-level topic statistics:
- Total number of topics discovered
- Document distribution across topics
- Top keywords and representative terms
- Topic coherence scores and quality metrics
Topic Distribution Plot:
- 2D Visualization: Interactive scatter plot showing topic relationships
- Topic Clusters: Visual grouping of related topics
- Hover Details: View topic keywords and document counts
- Zoom and Pan: Explore topic space in detail
Topic Hierarchical View:
- Topic Tree: Hierarchical clustering of topics
- Merge History: See how topics were combined during training
- Sub-topic Analysis: Drill down into topic components
- Browse by Topic: Select topics to view assigned documents
- Document List: Sortable table showing:
- Document Title: Article headline or document name
- Source: RSS feed or report origin
- Publication Date: When the document was published
- Topic Probability: Confidence score for topic assignment
- Preview: First few sentences of document content
- Document Details: Click documents to view:
- Full content with topic-relevant highlights
- Entity extraction results (if processed)
- Source metadata and bibliography information
- Topic assignment probabilities across all topics
Keyword Analysis:
- Representative Words: Most important terms defining each topic
- KeyBERT Extraction: Automatically extracted key phrases
- TF-IDF Scores: Statistical importance of terms
- Custom Labels: User-defined topic names and descriptions
Temporal Analysis:
- Topic Trends: Track topic popularity over time
- Emerging Topics: Identify rapidly growing themes
- Seasonal Patterns: Discover cyclical topic patterns
- Event Correlation: Connect topics to security events
Document Quality Metrics:
- Topic Coherence: How well documents fit their assigned topics
- Outlier Detection: Identify documents that don't fit well
- Duplicate Analysis: Find similar documents across topics
Scout enables direct integration of topic modeling results into threat intelligence reports, allowing analysts to incorporate relevant documents discovered through topic analysis.
- Access Topic Interface: From within a report, navigate to the Analysis β Topic Modeling tab
- Browse Topics: Explore available topics and their document collections
- Filter Documents: Use filters to narrow document selection:
- Date Range: Select documents from specific time periods
- Source Type: Filter by RSS feeds, manual uploads, or existing reports
- Topic Probability: Include only high-confidence topic assignments
- Content Length: Filter by document size or complexity
- Entity Presence: Select documents containing specific entities or IOCs
Bulk Document Addition:
- Select Topic: Choose a relevant topic from the topic visualization
- Review Documents: Browse the document list for the selected topic
- Multi-Select: Use checkboxes to select multiple relevant documents
- Add to Report: Click Add Selected to Report to include documents as sources
- Automatic Processing: Selected documents are automatically processed for entity extraction
Individual Document Addition:
- Document Review: Click on individual documents to read full content
- Relevance Assessment: Evaluate document relevance to current report scope
- Add Single Document: Use Add to Report button for individual documents
- Custom Metadata: Add custom tags or notes when adding documents
Automatic Processing Pipeline:
- Source Creation: Selected documents become new sources in the report
- Entity Extraction: CyNER automatically processes documents for cybersecurity entities
- STIX Generation: Extracted entities are converted to STIX format
- Review Queue: Documents enter the analyst review workflow
- Report Integration: Approved entities are included in final report generation
Metadata Preservation:
- Original Source: RSS feed URL and publication information
- Topic Assignment: Topic labels and probability scores
- Discovery Method: Topic modeling as discovery mechanism
- Processing History: Timeline of analysis and extraction steps
Trend Analysis Integration:
- Topic Trends: Include topic popularity charts in reports
- Emerging Threats: Highlight rapidly growing security topics
- Cross-Reference Analysis: Connect topics to known threat campaigns
- Temporal Correlation: Link topic emergence to security events
Automated Report Sections:
- Topic Summary: Auto-generated sections describing relevant topics
- Document Statistics: Quantitative analysis of document sources
- Keyword Extraction: Key terms and phrases from topic analysis
- Related Documents: Suggestions for additional relevant sources
Quality Assurance:
- Relevance Scoring: Automatic assessment of document relevance to report scope
- Duplicate Detection: Prevent inclusion of duplicate or similar content
- Entity Validation: Cross-reference extracted entities with existing sources
- Analyst Review: Human validation of AI-selected documents
- Topic Selection: Choose topics that align with report objectives and threat landscape
- Document Diversity: Include documents from multiple sources and time periods
- Quality Over Quantity: Prioritize high-quality, relevant documents over volume
- Human Validation: Always review AI-selected documents for relevance and accuracy
- Metadata Documentation: Maintain clear records of how documents were discovered and selected
- Regular Updates: Refresh topic models periodically to capture evolving threat landscape
This is a private and archived project. Contributing is not currently allowed. Please download the project code to build upon this code.
MIT License. Copyright 2025, Battelle Energy Alliance, LLC. All Rights Reserved.
