Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
29c7a3a
Initial plan
Copilot Dec 23, 2025
e742209
Add Figshare API token support and comprehensive documentation
Copilot Dec 23, 2025
ef9d296
Add implementation summary and next steps
Copilot Dec 23, 2025
d39681d
Refactor 403 error handling to reduce code duplication
Copilot Dec 23, 2025
b4bb7bc
Improve logging message to avoid potential security concerns
Copilot Dec 23, 2025
e310467
Add rate limiting and make force-refresh optional in workflow
Copilot Dec 23, 2025
a8178da
Refactor Figshare API workflow to use author cache and enhance error …
marc-hanheide Dec 23, 2025
44bd3bb
Update figshare processing workflow to use actions/cache@v5 and chang…
marc-hanheide Dec 23, 2025
69ccbd4
Refactor caching mechanism in FigShare class to use shelve for persis…
marc-hanheide Dec 23, 2025
2b6138f
Fix output path formatting in artifact upload step
marc-hanheide Dec 23, 2025
125a97e
Add max_retries parameter and implement retry logic for 403 errors in…
marc-hanheide Dec 23, 2025
87fe3ef
Update retrieve method to use cache in Author class and change defaul…
marc-hanheide Dec 23, 2025
afa7583
Update figshare processing workflow to correctly restore and save cac…
marc-hanheide Dec 23, 2025
2dc753e
Ensure cache is always saved from the output folder in figshare proce…
marc-hanheide Dec 23, 2025
5bdcba1
Create output directory if it doesn't exist and list contents
marc-hanheide Dec 23, 2025
d7f1167
Enhance caching logging in FigShare class and improve hash key genera…
marc-hanheide Dec 23, 2025
26aeac2
Update cache key generation to include run ID and add restore keys fo…
marc-hanheide Dec 23, 2025
73d134c
Change cache key logging from info to debug level in FigShare class
marc-hanheide Dec 23, 2025
2d9e4b3
Update cron schedule for figshare processing workflow to run every 4 …
marc-hanheide Dec 23, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 39 additions & 6 deletions .github/workflows/figshare-processing.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,17 @@ name: figshare-cache

on:
workflow_dispatch:
inputs:
use_author_cache:
description: 'Use cached author data (instead of refreshing)'
required: false
default: 'false'
type: choice
options:
- 'true'
- 'false'
schedule:
- cron: "30 2 * * 2"
- cron: "30 */4 * * *"
push:
branches:
- main
Expand All @@ -20,13 +29,18 @@ jobs:
fetch-depth: 1

- name: Use Cache in folder ./output
uses: actions/cache@v3
id: cache-restore-output
uses: actions/cache/restore@v5
with:
path: ./output
key: cache-files
key: cache-files-${{ github.run_id }}
restore-keys: |
cache-files-

- name: Create output directory if it doesn't exist
run: mkdir -p output
run: |
mkdir -p output
find ./output

- run: echo "🎉 The job was automatically triggered by a ${{ github.event_name }} event."

Expand All @@ -46,11 +60,30 @@ jobs:
pip install -r requirements-frozen.txt

- name: Run figshare exporter
env:
FIGSHARE_TOKEN: ${{ secrets.FIGSHARE_TOKEN }}
run: |
set -e
cd ./output
python ../figshare.py --force-refresh
if [ "${{ github.event_name }}" = "workflow_dispatch" ] && [ "${{ github.event.inputs.use_author_cache }}" = "true" ]; then
echo "Running with --use-author-cache (manually triggered)"
python ../figshare.py --use-author-cache
else
echo "Running without cache (default behavior)"
python ../figshare.py
fi

- name: Save Cache from folder ./output
uses: actions/cache/save@v5
if: always()
with:
path: ./output
key: ${{ steps.cache-restore-output.outputs.cache-primary-key || 'cache-files' }}

- name: Generate publication statistics
run: |
cd ./output
python ../generate_stats.py --all-csv figshare_articles_all.csv --dedup-csv figshare_articles.csv >> $GITHUB_STEP_SUMMARY

- name: Nexus Repo Publish bibtex
if: ${{ github.event_name != 'pull_request' }}
Expand Down Expand Up @@ -97,6 +130,6 @@ jobs:
with:
name: outputs
path: |
./output/*.csv
./output/*.csv
./output/*.bib
retention-days: 30
187 changes: 187 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
# LCAS EPrint Cache

This repository automatically exports and caches publication data from Figshare for LCAS (Lincoln Centre for Autonomous Systems) researchers.

## Overview

The system:
- Retrieves publication metadata from Figshare repository
- Processes author information and generates BibTeX entries
- Exports data in CSV and BibTeX formats
- Publishes to Nexus repository for public access

## Setup

### Prerequisites

- Python 3.10+
- Figshare API token (required)

### Configuration

#### Figshare API Token

This application requires a Figshare API token to function properly. To set up:

1. **Create a Figshare account**: Visit [https://figshare.com](https://figshare.com) and create an account
2. **Generate an API token**:
- Log in to Figshare
- Go to Account Settings → Applications
- Create a new personal token
- Copy the token securely
3. **For local development**: Set the environment variable
```bash
export FIGSHARE_TOKEN="your_token_here"
```
4. **For GitHub Actions**: Add the token as a repository secret named `FIGSHARE_TOKEN`
- Go to repository Settings → Secrets and variables → Actions
- Create a new secret named `FIGSHARE_TOKEN`
- Paste your Figshare API token

**Note**: Without a valid API token, requests to the Figshare API will fail with 403 errors.

### Installation

```bash
# Install dependencies
pip install -r requirements-frozen.txt
```

## Usage

### Command Line

```bash
# Run with default authors list
python figshare.py

# Run with specific authors
python figshare.py --authors "Marc Hanheide" "Tom Duckett"

# Run with authors from file
python figshare.py --authors-file staff.json

# Force refresh (ignore cache)
python figshare.py --force-refresh

# Adjust rate limiting (default is 1 second delay between requests)
python figshare.py --rate-limit-delay 2.0

# Enable debug logging
python figshare.py --debug

# Custom output filenames
python figshare.py --output my_articles.csv --output-all my_articles_all.csv
```

### Arguments

- `-a, --authors`: List of author names to process
- `-f, --authors-file`: Path to file containing author names (one per line)
- `-s, --since`: Process only publications since this date (YYYY-MM-DD), default: 2021-01-01
- `-o, --output`: Output CSV filename for deduplicated publications, default: figshare_articles.csv
- `-O, --output-all`: Output CSV filename for all publications (with duplicates), default: figshare_articles_all.csv
- `--force-refresh`: Force refresh data instead of loading from cache
- `--rate-limit-delay`: Delay in seconds between Figshare API requests, default: 1.0
- `--debug`: Enable debug logging

## Output Files

The script generates several output files:

- `lcas.bib`: Combined BibTeX file with all publications (deduplicated)
- `figshare_articles.csv`: CSV with deduplicated articles
- `figshare_articles_all.csv`: CSV with all articles (includes duplicates when multiple authors)
- `{author_name}.bib`: Individual BibTeX files per author
- `{author_name}.csv`: Individual CSV files per author
- `{author_name}.db`: Cached data per author (shelve database)

## Cache Files

The application uses several cache files to minimize API calls:

- `figshare_cache.pkl`: Cached Figshare API responses
- `bibtext_cache`: Cached BibTeX entries from DOI lookups
- `shortdoi_cache`: Cached short DOI mappings
- `crossref_cache.db`: Cached Crossref API responses for DOI guessing

## GitHub Actions Workflow

The workflow runs automatically:
- Weekly on Tuesdays at 02:30 UTC (uses cache by default)
- On push to main branch (uses cache by default)
- On pull requests (uses cache by default)
- Can be manually triggered via workflow_dispatch with optional force refresh

### Manual Workflow Trigger

When manually triggering the workflow:
1. Go to Actions → figshare-cache workflow
2. Click "Run workflow"
3. Choose whether to force refresh:
- **false** (default): Uses cached data, faster and respects rate limits
- **true**: Ignores cache and fetches fresh data from Figshare API

**Note**: Force refresh should only be used when you need to ensure the latest data, as it makes many API requests and takes longer to complete.

### Workflow Steps

1. Checkout repository
2. Restore cache
3. Install Python dependencies
4. Run Figshare exporter (with or without --force-refresh based on trigger)
5. Publish results to Nexus repository
6. Upload artifacts

### Rate Limiting

The script includes built-in rate limiting with a 1-second delay between API requests to avoid hitting Figshare API rate limits. This helps ensure reliable operation even with authenticated requests.

## Troubleshooting

### 403 Forbidden Errors

If you encounter 403 errors when accessing the Figshare API:
1. Ensure the `FIGSHARE_TOKEN` environment variable is set
2. Verify the token is valid and hasn't expired
3. Check that the token has appropriate permissions (read access to public articles)

For detailed information about the 403 error and resolution steps, see [FIGSHARE_API_RESEARCH.md](FIGSHARE_API_RESEARCH.md).

### Empty Results

If no articles are found:
- Check that author names match exactly as they appear in Figshare
- Verify the articles are in the Lincoln repository (https://repository.lincoln.ac.uk)
- Use `--debug` flag for detailed logging

### JSON Decode Errors

The application includes validation for JSON responses. If issues persist:
- Check your internet connection
- Verify Figshare API is accessible
- Review logs for specific error messages

## Development

### Running Tests

```bash
# Run with a single test author
python figshare.py --authors "Marc Hanheide" --debug
```

### Code Structure

- `figshare.py`: Main script with FigShare API client and processing logic
- `doi2bib`: Class for DOI to BibTeX conversion
- `FigShare`: Class for Figshare API interactions
- `Author`: Class for author-specific processing

## License

[Add license information here]

## Contact

For issues or questions, please open an issue in the GitHub repository.
Loading