From 29c7a3afb6848fe342afb64965aa437141d2071a Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Tue, 23 Dec 2025 13:55:33 +0000 Subject: [PATCH 01/19] Initial plan From e742209c9eeed1b1ad53890f522508e7378631c7 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Tue, 23 Dec 2025 14:02:42 +0000 Subject: [PATCH 02/19] Add Figshare API token support and comprehensive documentation Co-authored-by: marc-hanheide <1153084+marc-hanheide@users.noreply.github.com> --- .github/workflows/figshare-processing.yaml | 2 + FIGSHARE_API_RESEARCH.md | 261 +++++++++++++++++++++ README.md | 168 +++++++++++++ figshare.py | 30 +++ 4 files changed, 461 insertions(+) create mode 100644 FIGSHARE_API_RESEARCH.md create mode 100644 README.md diff --git a/.github/workflows/figshare-processing.yaml b/.github/workflows/figshare-processing.yaml index 53fad7e..e20d508 100644 --- a/.github/workflows/figshare-processing.yaml +++ b/.github/workflows/figshare-processing.yaml @@ -46,6 +46,8 @@ jobs: pip install -r requirements-frozen.txt - name: Run figshare exporter + env: + FIGSHARE_TOKEN: ${{ secrets.FIGSHARE_TOKEN }} run: | set -e cd ./output diff --git a/FIGSHARE_API_RESEARCH.md b/FIGSHARE_API_RESEARCH.md new file mode 100644 index 0000000..bfe845e --- /dev/null +++ b/FIGSHARE_API_RESEARCH.md @@ -0,0 +1,261 @@ +# Figshare API 403 Error Research + +## Issue Description +The workflow is experiencing 403 (Forbidden) errors when calling the Figshare API `/articles/search` endpoint. + +## API Endpoint Information + +### Endpoint: POST /v2/articles/search +- **Base URL**: https://api.figshare.com/v2 +- **Method**: POST +- **Purpose**: Search for articles in Figshare repository + +## Common Causes of 403 Errors in REST APIs + +### 1. Authentication Required +Many public APIs require authentication even for read operations to: +- Prevent abuse and rate limiting +- Track usage +- Control access to certain features + +### 2. Rate Limiting +APIs may return 403 when: +- Too many requests from the same IP +- Exceeding the allowed request rate +- No authentication token provided (forcing lower rate limits for anonymous users) + +### 3. Geographic Restrictions +Some APIs block requests from certain regions or IP ranges + +### 4. User-Agent Blocking +APIs may block requests that don't include proper User-Agent headers + +## Figshare API Authentication + +### Public vs Private Endpoints +Figshare API has two types of endpoints: +- **Public endpoints**: Generally don't require authentication (GET requests for public data) +- **Private endpoints**: Require authentication + +### Authentication Methods +Figshare API supports OAuth2 authentication: +- Uses personal access tokens +- Token should be included in the Authorization header: `Authorization: token YOUR_TOKEN` + +### POST /articles/search Endpoint +This endpoint performs a search operation using POST method (to allow complex search queries in the body). + +**Key Issue**: While some Figshare search operations may work without authentication, the POST method to `/articles/search` may require authentication or have different rate limits compared to anonymous access. + +## Current Implementation Analysis + +Looking at `figshare.py` lines 125-176: + +```python +def __init__(self, page_size=100): + self.token = os.getenv('FIGSHARE_TOKEN') + # ... token is optional + +def __post(self, url, params=None, use_cache=True): + headers = { "Authorization": "token " + self.token } if self.token else {} + response = post(self.base_url + url, headers=headers, json=params) +``` + +**Current behavior**: +- Token is optional (read from environment variable) +- If no token is provided, requests are made anonymously +- This may work sometimes but fail with 403 when: + - Rate limits are hit + - API policy changes + - IP-based restrictions apply + +## Recommendations + +### 1. Obtain a Figshare API Token + +**How to get a token**: +1. Create a Figshare account at https://figshare.com +2. Go to Account Settings +3. Navigate to "Applications" or "API" section +4. Create a new application/token +5. Generate a personal access token +6. Copy and store the token securely + +**Token Permissions**: +- For read-only operations (searching, retrieving articles), read permissions are sufficient +- No write permissions needed for this use case + +### 2. Add Token to GitHub Secrets + +**Steps**: +1. Go to repository Settings +2. Navigate to Secrets and variables → Actions +3. Create a new repository secret named `FIGSHARE_TOKEN` +4. Paste the Figshare API token +5. The workflow already references this secret in the environment (if added) + +**Note**: Check if workflow file needs to be updated to pass the secret as an environment variable. + +### 3. Update Workflow (if needed) + +If not already present, add to `.github/workflows/figshare-processing.yaml`: + +```yaml +env: + FIGSHARE_TOKEN: ${{ secrets.FIGSHARE_TOKEN }} +``` + +Or in the specific job/step that runs the Python script. + +## Alternative Solutions + +### 1. Add Retry Logic with Exponential Backoff +If 403 is intermittent, add retry logic to handle temporary rate limit issues. + +### 2. Add User-Agent Header +Some APIs require a proper User-Agent header. Update the request headers to include: +```python +headers = { + "Authorization": f"token {self.token}" if self.token else "", + "User-Agent": "LCAS-eprint-cache/1.0" +} +``` + +### 3. Implement Caching More Aggressively +The code already has caching, but ensure it's used effectively to minimize API calls. + +### 4. Use GET endpoint if available +Check if there's a GET version of the articles/search endpoint that might have different authentication requirements. + +## Workflow Configuration Issue + +**Current Status**: The workflow file does NOT pass the `FIGSHARE_TOKEN` environment variable to the Python script. + +Looking at `.github/workflows/figshare-processing.yaml`: +- Line 48-52: The "Run figshare exporter" step does not include any environment variables +- The Python script expects `FIGSHARE_TOKEN` via `os.getenv('FIGSHARE_TOKEN')` (figshare.py line 125) +- Without the token, all requests are anonymous and more likely to hit rate limits or be rejected + +## Conclusion + +**Root Cause**: The 403 error is caused by missing authentication when calling the Figshare API `/articles/search` endpoint. + +**Evidence**: +1. The Python code supports token authentication (line 125, 158, 175) +2. The workflow file does not pass the `FIGSHARE_TOKEN` environment variable +3. Anonymous requests to POST endpoints are more restricted and likely to fail with 403 + +**Recommended Solution**: + +### Step 1: Obtain a Figshare API Token +1. Create a Figshare account at https://figshare.com +2. Log in to your account +3. Go to Account Settings (click your profile icon → Settings) +4. Navigate to "Applications" section +5. Click "Create Personal Token" or "Create New Application" +6. Give it a descriptive name (e.g., "LCAS eprint cache GitHub Actions") +7. Select appropriate permissions (read access to public articles is sufficient) +8. Generate the token and copy it securely + +### Step 2: Add Token to GitHub Repository Secrets +1. Go to the GitHub repository: https://github.com/LCAS/eprint_cache +2. Navigate to Settings → Secrets and variables → Actions +3. Click "New repository secret" +4. Name: `FIGSHARE_TOKEN` +5. Value: Paste the Figshare API token +6. Click "Add secret" + +### Step 3: Update Workflow to Pass Token +Add the environment variable to the "Run figshare exporter" step in `.github/workflows/figshare-processing.yaml`: + +```yaml +- name: Run figshare exporter + env: + FIGSHARE_TOKEN: ${{ secrets.FIGSHARE_TOKEN }} + run: | + set -e + cd ./output + python ../figshare.py --force-refresh +``` + +### Step 4: Test the Changes +1. Create a pull request with the workflow change +2. The workflow should run automatically +3. Verify that the 403 error no longer occurs +4. Check that articles are successfully retrieved + +## Additional Recommendations + +### 1. Add Better Error Handling +Update the `__post` method to provide more informative error messages: + +```python +def __post(self, url, params=None, use_cache=True): + hash_key = f"POST{url}?{params}" + if hash_key in self.__cache and use_cache: + return self.__cache[hash_key] + else: + headers = { "Authorization": "token " + self.token } if self.token else {} + response = post(self.base_url + url, headers=headers, json=params) + + if response.status_code == 403: + self.logger.error(f"403 Forbidden: Authentication may be required. " + f"Ensure FIGSHARE_TOKEN environment variable is set. " + f"Response: {response.text}") + return [] + + if response.ok and response.headers.get('Content-Type', '').lower().startswith('application/json') and response.text.strip(): + result = response.json() + self.__cache[hash_key] = result + self.save_cache() + return result + else: + self.logger.warning(f"Received empty or invalid JSON response for POST {self.base_url + url} (status: {response.status_code})") + return [] +``` + +### 2. Add Retry Logic +Consider adding retry logic with exponential backoff for transient errors: + +```python +import time +from requests.adapters import HTTPAdapter +from requests.packages.urllib3.util.retry import Retry + +def __init__(self, page_size=100): + self.logger = getLogger("FigShare") + self.token = os.getenv('FIGSHARE_TOKEN') + self.page_size = page_size + self.base_url = "https://api.figshare.com/v2" + + # Configure retry strategy + retry_strategy = Retry( + total=3, + backoff_factor=1, + status_forcelist=[429, 500, 502, 503, 504], + allowed_methods=["GET", "POST"] + ) + adapter = HTTPAdapter(max_retries=retry_strategy) + self.session = requests.Session() + self.session.mount("https://", adapter) +``` + +### 3. Log Token Status +Add logging to indicate whether token authentication is being used: + +```python +def __init__(self, page_size=100): + self.logger = getLogger("FigShare") + self.token = os.getenv('FIGSHARE_TOKEN') + if self.token: + self.logger.info("Using authenticated requests with FIGSHARE_TOKEN") + else: + self.logger.warning("No FIGSHARE_TOKEN found - using anonymous requests (may hit rate limits)") + # ... rest of init +``` + +## References +- Figshare API Documentation: https://docs.figshare.com/ +- Figshare API Reference: https://docs.figshare.com/#figshare-documentation-api-description +- Figshare API Authentication: https://docs.figshare.com/#authentication +- GitHub Actions Secrets: https://docs.github.com/en/actions/security-guides/encrypted-secrets diff --git a/README.md b/README.md new file mode 100644 index 0000000..420f486 --- /dev/null +++ b/README.md @@ -0,0 +1,168 @@ +# LCAS EPrint Cache + +This repository automatically exports and caches publication data from Figshare for LCAS (Lincoln Centre for Autonomous Systems) researchers. + +## Overview + +The system: +- Retrieves publication metadata from Figshare repository +- Processes author information and generates BibTeX entries +- Exports data in CSV and BibTeX formats +- Publishes to Nexus repository for public access + +## Setup + +### Prerequisites + +- Python 3.10+ +- Figshare API token (required) + +### Configuration + +#### Figshare API Token + +This application requires a Figshare API token to function properly. To set up: + +1. **Create a Figshare account**: Visit [https://figshare.com](https://figshare.com) and create an account +2. **Generate an API token**: + - Log in to Figshare + - Go to Account Settings → Applications + - Create a new personal token + - Copy the token securely +3. **For local development**: Set the environment variable + ```bash + export FIGSHARE_TOKEN="your_token_here" + ``` +4. **For GitHub Actions**: Add the token as a repository secret named `FIGSHARE_TOKEN` + - Go to repository Settings → Secrets and variables → Actions + - Create a new secret named `FIGSHARE_TOKEN` + - Paste your Figshare API token + +**Note**: Without a valid API token, requests to the Figshare API will fail with 403 errors. + +### Installation + +```bash +# Install dependencies +pip install -r requirements-frozen.txt +``` + +## Usage + +### Command Line + +```bash +# Run with default authors list +python figshare.py + +# Run with specific authors +python figshare.py --authors "Marc Hanheide" "Tom Duckett" + +# Run with authors from file +python figshare.py --authors-file staff.json + +# Force refresh (ignore cache) +python figshare.py --force-refresh + +# Enable debug logging +python figshare.py --debug + +# Custom output filenames +python figshare.py --output my_articles.csv --output-all my_articles_all.csv +``` + +### Arguments + +- `-a, --authors`: List of author names to process +- `-f, --authors-file`: Path to file containing author names (one per line) +- `-s, --since`: Process only publications since this date (YYYY-MM-DD), default: 2021-01-01 +- `-o, --output`: Output CSV filename for deduplicated publications, default: figshare_articles.csv +- `-O, --output-all`: Output CSV filename for all publications (with duplicates), default: figshare_articles_all.csv +- `--force-refresh`: Force refresh data instead of loading from cache +- `--debug`: Enable debug logging + +## Output Files + +The script generates several output files: + +- `lcas.bib`: Combined BibTeX file with all publications (deduplicated) +- `figshare_articles.csv`: CSV with deduplicated articles +- `figshare_articles_all.csv`: CSV with all articles (includes duplicates when multiple authors) +- `{author_name}.bib`: Individual BibTeX files per author +- `{author_name}.csv`: Individual CSV files per author +- `{author_name}.db`: Cached data per author (shelve database) + +## Cache Files + +The application uses several cache files to minimize API calls: + +- `figshare_cache.pkl`: Cached Figshare API responses +- `bibtext_cache`: Cached BibTeX entries from DOI lookups +- `shortdoi_cache`: Cached short DOI mappings +- `crossref_cache.db`: Cached Crossref API responses for DOI guessing + +## GitHub Actions Workflow + +The workflow runs automatically: +- Weekly on Tuesdays at 02:30 UTC +- On push to main branch +- On pull requests +- Can be manually triggered via workflow_dispatch + +### Workflow Steps + +1. Checkout repository +2. Restore cache +3. Install Python dependencies +4. Run Figshare exporter +5. Publish results to Nexus repository +6. Upload artifacts + +## Troubleshooting + +### 403 Forbidden Errors + +If you encounter 403 errors when accessing the Figshare API: +1. Ensure the `FIGSHARE_TOKEN` environment variable is set +2. Verify the token is valid and hasn't expired +3. Check that the token has appropriate permissions (read access to public articles) + +For detailed information about the 403 error and resolution steps, see [FIGSHARE_API_RESEARCH.md](FIGSHARE_API_RESEARCH.md). + +### Empty Results + +If no articles are found: +- Check that author names match exactly as they appear in Figshare +- Verify the articles are in the Lincoln repository (https://repository.lincoln.ac.uk) +- Use `--debug` flag for detailed logging + +### JSON Decode Errors + +The application includes validation for JSON responses. If issues persist: +- Check your internet connection +- Verify Figshare API is accessible +- Review logs for specific error messages + +## Development + +### Running Tests + +```bash +# Run with a single test author +python figshare.py --authors "Marc Hanheide" --debug +``` + +### Code Structure + +- `figshare.py`: Main script with FigShare API client and processing logic +- `doi2bib`: Class for DOI to BibTeX conversion +- `FigShare`: Class for Figshare API interactions +- `Author`: Class for author-specific processing + +## License + +[Add license information here] + +## Contact + +For issues or questions, please open an issue in the GitHub repository. diff --git a/figshare.py b/figshare.py index 7725139..efd952a 100644 --- a/figshare.py +++ b/figshare.py @@ -123,6 +123,10 @@ class FigShare: def __init__(self, page_size=100): self.logger = getLogger("FigShare") self.token = os.getenv('FIGSHARE_TOKEN') + if self.token: + self.logger.info("Using authenticated requests with FIGSHARE_TOKEN") + else: + self.logger.warning("No FIGSHARE_TOKEN found - using anonymous requests (may hit rate limits or receive 403 errors)") self.page_size = page_size self.base_url = "https://api.figshare.com/v2" @@ -157,6 +161,19 @@ def __get(self, url, params=None, use_cache=True): else: headers = { "Authorization": "token " + self.token } if self.token else {} response = get(self.base_url + url, headers=headers, params=params) + + # Handle 403 Forbidden errors with helpful message + if response.status_code == 403: + if not self.token: + self.logger.error(f"403 Forbidden for GET {self.base_url + url}: " + f"Authentication required. Set FIGSHARE_TOKEN environment variable. " + f"See README.md for instructions.") + else: + self.logger.error(f"403 Forbidden for GET {self.base_url + url}: " + f"Token may be invalid or lack permissions. " + f"Response: {response.text[:200]}") + return {} + # Check if response is valid and contains JSON if response.ok and response.headers.get('Content-Type', '').lower().startswith('application/json') and response.text.strip(): result = response.json() @@ -174,6 +191,19 @@ def __post(self, url, params=None, use_cache=True): else: headers = { "Authorization": "token " + self.token } if self.token else {} response = post(self.base_url + url, headers=headers, json=params) + + # Handle 403 Forbidden errors with helpful message + if response.status_code == 403: + if not self.token: + self.logger.error(f"403 Forbidden for POST {self.base_url + url}: " + f"Authentication required. Set FIGSHARE_TOKEN environment variable. " + f"See README.md for instructions.") + else: + self.logger.error(f"403 Forbidden for POST {self.base_url + url}: " + f"Token may be invalid or lack permissions. " + f"Response: {response.text[:200]}") + return [] + # Check if response is valid and contains JSON if response.ok and response.headers.get('Content-Type', '').lower().startswith('application/json') and response.text.strip(): result = response.json() From ef9d296a3f05d7684e0781f0f54280ac5fd0f8c6 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Tue, 23 Dec 2025 14:03:39 +0000 Subject: [PATCH 03/19] Add implementation summary and next steps Co-authored-by: marc-hanheide <1153084+marc-hanheide@users.noreply.github.com> --- IMPLEMENTATION_SUMMARY.md | 113 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 113 insertions(+) create mode 100644 IMPLEMENTATION_SUMMARY.md diff --git a/IMPLEMENTATION_SUMMARY.md b/IMPLEMENTATION_SUMMARY.md new file mode 100644 index 0000000..7981479 --- /dev/null +++ b/IMPLEMENTATION_SUMMARY.md @@ -0,0 +1,113 @@ +# SUMMARY: Figshare API 403 Error Resolution + +## Problem Identified +Your GitHub Actions workflow was failing with a **403 Forbidden** error when trying to access the Figshare API `/articles/search` endpoint. + +## Root Cause +The Figshare API requires authentication for the `/articles/search` POST endpoint. While your Python code already supported token authentication through the `FIGSHARE_TOKEN` environment variable, the GitHub Actions workflow was not passing this token to the script. + +## Changes Made + +### 1. Updated GitHub Actions Workflow +**File**: `.github/workflows/figshare-processing.yaml` + +Added the `FIGSHARE_TOKEN` environment variable to the Python script execution step: +```yaml +- name: Run figshare exporter + env: + FIGSHARE_TOKEN: ${{ secrets.FIGSHARE_TOKEN }} + run: | + set -e + cd ./output + python ../figshare.py --force-refresh +``` + +### 2. Enhanced Error Handling +**File**: `figshare.py` + +- Added logging on initialization to warn if no token is present +- Enhanced error handling in `__get()` and `__post()` methods to detect 403 errors +- Provides helpful error messages directing users to setup instructions + +### 3. Comprehensive Documentation +Created two new documentation files: + +**FIGSHARE_API_RESEARCH.md** +- Detailed analysis of 403 error causes +- Explanation of Figshare API authentication +- Step-by-step token setup instructions +- Additional recommendations for retry logic and error handling + +**README.md** +- Complete project overview and setup guide +- How to obtain a Figshare API token +- Usage instructions and command-line arguments +- Troubleshooting section +- Output files explanation + +## REQUIRED ACTION: Setup Figshare API Token + +To resolve the 403 error, you **must** add a Figshare API token to your GitHub repository: + +### Step 1: Obtain a Figshare API Token +1. Go to https://figshare.com and create an account (or log in) +2. Navigate to **Account Settings** → **Applications** +3. Click **"Create Personal Token"** or **"Create New Application"** +4. Name it (e.g., "LCAS eprint cache GitHub Actions") +5. Select **read permissions** for public articles +6. Generate and copy the token + +### Step 2: Add Token to GitHub Secrets +1. Go to your repository: https://github.com/LCAS/eprint_cache +2. Click **Settings** → **Secrets and variables** → **Actions** +3. Click **"New repository secret"** +4. **Name**: `FIGSHARE_TOKEN` +5. **Value**: Paste the token from Figshare +6. Click **"Add secret"** + +### Step 3: Test the Fix +Once you've added the secret: +1. The workflow will automatically use it on the next run +2. You can manually trigger a workflow run to test it immediately +3. Go to **Actions** tab → Select the workflow → Click **"Run workflow"** + +## What Happens Now + +✅ **With the token configured**: +- The workflow will authenticate with Figshare API +- Requests will succeed without 403 errors +- Higher rate limits will apply +- Reliable access to publication data + +❌ **Without the token**: +- The code will still run but issue warnings +- Anonymous requests may fail with 403 errors +- Lower rate limits apply +- Workflow will likely fail + +## Benefits of These Changes + +1. **Clear Error Messages**: If the token is missing or invalid, you'll see helpful error messages +2. **Better Logging**: The script now logs whether it's using authenticated or anonymous requests +3. **Complete Documentation**: README provides full setup and usage instructions +4. **Research Documentation**: Detailed analysis of the issue for future reference + +## Testing Locally + +To test the changes locally: +```bash +export FIGSHARE_TOKEN="your_token_here" +python figshare.py --authors "Marc Hanheide" --debug +``` + +## Questions or Issues? + +If you encounter any problems after setting up the token: +1. Check that the secret name is exactly `FIGSHARE_TOKEN` +2. Verify the token hasn't expired in Figshare +3. Review the workflow logs for specific error messages +4. See `FIGSHARE_API_RESEARCH.md` for detailed troubleshooting + +--- + +**Next Step**: Please add the `FIGSHARE_TOKEN` secret to your repository as described above. This is the only remaining action needed to fully resolve the 403 error. From d39681d2e87a3c9f1be2eb9ad2b07829be25fecc Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Tue, 23 Dec 2025 14:05:23 +0000 Subject: [PATCH 04/19] Refactor 403 error handling to reduce code duplication Co-authored-by: marc-hanheide <1153084+marc-hanheide@users.noreply.github.com> --- figshare.py | 29 +++++++++++++---------------- 1 file changed, 13 insertions(+), 16 deletions(-) diff --git a/figshare.py b/figshare.py index efd952a..fc65196 100644 --- a/figshare.py +++ b/figshare.py @@ -154,6 +154,17 @@ def __init_params(self): "page_size": self.page_size } + def __handle_403_error(self, url, method="GET"): + """Handle 403 Forbidden errors with helpful messages""" + if not self.token: + self.logger.error(f"403 Forbidden for {method} {self.base_url + url}: " + f"Authentication required. Set FIGSHARE_TOKEN environment variable. " + f"See README.md for instructions.") + else: + self.logger.error(f"403 Forbidden for {method} {self.base_url + url}: " + f"Token may be invalid or lack permissions. " + f"Check token in Figshare account settings.") + def __get(self, url, params=None, use_cache=True): hash_key = f"GET{url}?{params}" if hash_key in self.__cache and use_cache: @@ -164,14 +175,7 @@ def __get(self, url, params=None, use_cache=True): # Handle 403 Forbidden errors with helpful message if response.status_code == 403: - if not self.token: - self.logger.error(f"403 Forbidden for GET {self.base_url + url}: " - f"Authentication required. Set FIGSHARE_TOKEN environment variable. " - f"See README.md for instructions.") - else: - self.logger.error(f"403 Forbidden for GET {self.base_url + url}: " - f"Token may be invalid or lack permissions. " - f"Response: {response.text[:200]}") + self.__handle_403_error(url, "GET") return {} # Check if response is valid and contains JSON @@ -194,14 +198,7 @@ def __post(self, url, params=None, use_cache=True): # Handle 403 Forbidden errors with helpful message if response.status_code == 403: - if not self.token: - self.logger.error(f"403 Forbidden for POST {self.base_url + url}: " - f"Authentication required. Set FIGSHARE_TOKEN environment variable. " - f"See README.md for instructions.") - else: - self.logger.error(f"403 Forbidden for POST {self.base_url + url}: " - f"Token may be invalid or lack permissions. " - f"Response: {response.text[:200]}") + self.__handle_403_error(url, "POST") return [] # Check if response is valid and contains JSON From b4bb7bcf2117c81ff17c6a6c7dcb728c9f1dc85e Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Tue, 23 Dec 2025 14:06:50 +0000 Subject: [PATCH 05/19] Improve logging message to avoid potential security concerns Co-authored-by: marc-hanheide <1153084+marc-hanheide@users.noreply.github.com> --- figshare.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/figshare.py b/figshare.py index fc65196..36b4d3b 100644 --- a/figshare.py +++ b/figshare.py @@ -124,9 +124,9 @@ def __init__(self, page_size=100): self.logger = getLogger("FigShare") self.token = os.getenv('FIGSHARE_TOKEN') if self.token: - self.logger.info("Using authenticated requests with FIGSHARE_TOKEN") + self.logger.info("Figshare API: Using authenticated requests") else: - self.logger.warning("No FIGSHARE_TOKEN found - using anonymous requests (may hit rate limits or receive 403 errors)") + self.logger.warning("Figshare API: No authentication token found - using anonymous requests (may hit rate limits or receive 403 errors)") self.page_size = page_size self.base_url = "https://api.figshare.com/v2" From e3104678ac9e4a4527870d6c890df4b167bd8a85 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Tue, 23 Dec 2025 14:48:48 +0000 Subject: [PATCH 06/19] Add rate limiting and make force-refresh optional in workflow - Add configurable rate_limit_delay parameter (default 1 second) to FigShare class - Add sleep after each API request to avoid rate limits - Add --rate-limit-delay CLI argument - Update workflow to only use --force-refresh when manually dispatched with force_refresh=true - By default, workflow now uses cache for all automated runs - Update README with rate limiting documentation and workflow behavior Co-authored-by: marc-hanheide <1153084+marc-hanheide@users.noreply.github.com> --- .github/workflows/figshare-processing.yaml | 17 ++++++++++++- README.md | 29 ++++++++++++++++++---- figshare.py | 23 ++++++++++++++--- 3 files changed, 59 insertions(+), 10 deletions(-) diff --git a/.github/workflows/figshare-processing.yaml b/.github/workflows/figshare-processing.yaml index e20d508..1e93db1 100644 --- a/.github/workflows/figshare-processing.yaml +++ b/.github/workflows/figshare-processing.yaml @@ -2,6 +2,15 @@ name: figshare-cache on: workflow_dispatch: + inputs: + force_refresh: + description: 'Force refresh data (ignore cache)' + required: false + default: 'false' + type: choice + options: + - 'true' + - 'false' schedule: - cron: "30 2 * * 2" push: @@ -51,7 +60,13 @@ jobs: run: | set -e cd ./output - python ../figshare.py --force-refresh + if [ "${{ github.event_name }}" = "workflow_dispatch" ] && [ "${{ github.event.inputs.force_refresh }}" = "true" ]; then + echo "Running with --force-refresh (manually triggered)" + python ../figshare.py --force-refresh + else + echo "Running with cache (default behavior)" + python ../figshare.py + fi - name: Nexus Repo Publish bibtex diff --git a/README.md b/README.md index 420f486..edf95d4 100644 --- a/README.md +++ b/README.md @@ -64,6 +64,9 @@ python figshare.py --authors-file staff.json # Force refresh (ignore cache) python figshare.py --force-refresh +# Adjust rate limiting (default is 1 second delay between requests) +python figshare.py --rate-limit-delay 2.0 + # Enable debug logging python figshare.py --debug @@ -79,6 +82,7 @@ python figshare.py --output my_articles.csv --output-all my_articles_all.csv - `-o, --output`: Output CSV filename for deduplicated publications, default: figshare_articles.csv - `-O, --output-all`: Output CSV filename for all publications (with duplicates), default: figshare_articles_all.csv - `--force-refresh`: Force refresh data instead of loading from cache +- `--rate-limit-delay`: Delay in seconds between Figshare API requests, default: 1.0 - `--debug`: Enable debug logging ## Output Files @@ -104,20 +108,35 @@ The application uses several cache files to minimize API calls: ## GitHub Actions Workflow The workflow runs automatically: -- Weekly on Tuesdays at 02:30 UTC -- On push to main branch -- On pull requests -- Can be manually triggered via workflow_dispatch +- Weekly on Tuesdays at 02:30 UTC (uses cache by default) +- On push to main branch (uses cache by default) +- On pull requests (uses cache by default) +- Can be manually triggered via workflow_dispatch with optional force refresh + +### Manual Workflow Trigger + +When manually triggering the workflow: +1. Go to Actions → figshare-cache workflow +2. Click "Run workflow" +3. Choose whether to force refresh: + - **false** (default): Uses cached data, faster and respects rate limits + - **true**: Ignores cache and fetches fresh data from Figshare API + +**Note**: Force refresh should only be used when you need to ensure the latest data, as it makes many API requests and takes longer to complete. ### Workflow Steps 1. Checkout repository 2. Restore cache 3. Install Python dependencies -4. Run Figshare exporter +4. Run Figshare exporter (with or without --force-refresh based on trigger) 5. Publish results to Nexus repository 6. Upload artifacts +### Rate Limiting + +The script includes built-in rate limiting with a 1-second delay between API requests to avoid hitting Figshare API rate limits. This helps ensure reliable operation even with authenticated requests. + ## Troubleshooting ### 403 Forbidden Errors diff --git a/figshare.py b/figshare.py index 36b4d3b..8497b0b 100644 --- a/figshare.py +++ b/figshare.py @@ -27,6 +27,7 @@ import argparse from datetime import datetime from difflib import SequenceMatcher +import time basicConfig(level=INFO) @@ -120,7 +121,7 @@ def entries_to_str(self, entries): class FigShare: - def __init__(self, page_size=100): + def __init__(self, page_size=100, rate_limit_delay=1.0): self.logger = getLogger("FigShare") self.token = os.getenv('FIGSHARE_TOKEN') if self.token: @@ -128,7 +129,11 @@ def __init__(self, page_size=100): else: self.logger.warning("Figshare API: No authentication token found - using anonymous requests (may hit rate limits or receive 403 errors)") self.page_size = page_size + self.rate_limit_delay = rate_limit_delay self.base_url = "https://api.figshare.com/v2" + + if self.rate_limit_delay > 0: + self.logger.info(f"Rate limiting enabled: {self.rate_limit_delay} second delay between API requests") # if cache file exist, load it self.cache_file = "figshare_cache.pkl" @@ -173,6 +178,10 @@ def __get(self, url, params=None, use_cache=True): headers = { "Authorization": "token " + self.token } if self.token else {} response = get(self.base_url + url, headers=headers, params=params) + # Rate limiting: sleep after each API request + if self.rate_limit_delay > 0: + time.sleep(self.rate_limit_delay) + # Handle 403 Forbidden errors with helpful message if response.status_code == 403: self.__handle_403_error(url, "GET") @@ -196,6 +205,10 @@ def __post(self, url, params=None, use_cache=True): headers = { "Authorization": "token " + self.token } if self.token else {} response = post(self.base_url + url, headers=headers, json=params) + # Rate limiting: sleep after each API request + if self.rate_limit_delay > 0: + time.sleep(self.rate_limit_delay) + # Handle 403 Forbidden errors with helpful message if response.status_code == 403: self.__handle_403_error(url, "POST") @@ -233,12 +246,12 @@ def get_article(self, article_id, use_cache=True): return self.__get(f"/articles/{article_id}", use_cache=use_cache) class Author: - def __init__(self, name, debug=False): + def __init__(self, name, debug=False, rate_limit_delay=1.0): self.logger = getLogger("Author") if debug: self.logger.setLevel(DEBUG) self.name = name - self.fs = FigShare() + self.fs = FigShare(rate_limit_delay=rate_limit_delay) self.articles = {} self.public_html_prefix = "https://repository.lincoln.ac.uk" self.df = None @@ -481,6 +494,8 @@ def parse_args(): # help='Output CSV filename for publications since specified date') parser.add_argument('--force-refresh', action='store_true', help='Force refresh data instead of loading from cache') + parser.add_argument('--rate-limit-delay', type=float, default=1.0, + help='Delay in seconds between Figshare API requests (default: 1.0)') parser.add_argument('--debug', action='store_true', help='Enable debug logging') @@ -541,7 +556,7 @@ def figshare_processing(): for author_name in authors_list: logger.info(f"*** Processing {author_name}...") - authors[author_name] = Author(author_name, debug=args.debug) + authors[author_name] = Author(author_name, debug=args.debug, rate_limit_delay=args.rate_limit_delay) cache_exists = os.path.exists(f"{author_name}.db") if cache_exists and not args.force_refresh: From a8178dafcf2f2c0b9ab98be54bdabe2333927782 Mon Sep 17 00:00:00 2001 From: Marc Hanheide Date: Tue, 23 Dec 2025 15:26:34 +0000 Subject: [PATCH 07/19] Refactor Figshare API workflow to use author cache and enhance error handling; add publication statistics generation script --- .github/workflows/figshare-processing.yaml | 15 +- FIGSHARE_API_RESEARCH.md | 261 --------------------- IMPLEMENTATION_SUMMARY.md | 113 --------- figshare.py | 12 +- generate_stats.py | 111 +++++++++ 5 files changed, 126 insertions(+), 386 deletions(-) delete mode 100644 FIGSHARE_API_RESEARCH.md delete mode 100644 IMPLEMENTATION_SUMMARY.md create mode 100755 generate_stats.py diff --git a/.github/workflows/figshare-processing.yaml b/.github/workflows/figshare-processing.yaml index 1e93db1..1c73a20 100644 --- a/.github/workflows/figshare-processing.yaml +++ b/.github/workflows/figshare-processing.yaml @@ -3,8 +3,8 @@ name: figshare-cache on: workflow_dispatch: inputs: - force_refresh: - description: 'Force refresh data (ignore cache)' + use_author_cache: + description: 'Use cached author data (instead of refreshing)' required: false default: 'false' type: choice @@ -60,14 +60,17 @@ jobs: run: | set -e cd ./output - if [ "${{ github.event_name }}" = "workflow_dispatch" ] && [ "${{ github.event.inputs.force_refresh }}" = "true" ]; then - echo "Running with --force-refresh (manually triggered)" - python ../figshare.py --force-refresh + if [ "${{ github.event_name }}" = "workflow_dispatch" ] && [ "${{ github.event.inputs.use_author_cache }}" = "true" ]; then + echo "Running with --use-author-cache (manually triggered)" + python ../figshare.py --use-author-cache else - echo "Running with cache (default behavior)" + echo "Running without cache (default behavior)" python ../figshare.py fi + - name: Generate publication statistics + run: | + python ../generate_stats.py --all-csv figshare_articles_all.csv --dedup-csv figshare_articles.csv >> $GITHUB_STEP_SUMMARY - name: Nexus Repo Publish bibtex if: ${{ github.event_name != 'pull_request' }} diff --git a/FIGSHARE_API_RESEARCH.md b/FIGSHARE_API_RESEARCH.md deleted file mode 100644 index bfe845e..0000000 --- a/FIGSHARE_API_RESEARCH.md +++ /dev/null @@ -1,261 +0,0 @@ -# Figshare API 403 Error Research - -## Issue Description -The workflow is experiencing 403 (Forbidden) errors when calling the Figshare API `/articles/search` endpoint. - -## API Endpoint Information - -### Endpoint: POST /v2/articles/search -- **Base URL**: https://api.figshare.com/v2 -- **Method**: POST -- **Purpose**: Search for articles in Figshare repository - -## Common Causes of 403 Errors in REST APIs - -### 1. Authentication Required -Many public APIs require authentication even for read operations to: -- Prevent abuse and rate limiting -- Track usage -- Control access to certain features - -### 2. Rate Limiting -APIs may return 403 when: -- Too many requests from the same IP -- Exceeding the allowed request rate -- No authentication token provided (forcing lower rate limits for anonymous users) - -### 3. Geographic Restrictions -Some APIs block requests from certain regions or IP ranges - -### 4. User-Agent Blocking -APIs may block requests that don't include proper User-Agent headers - -## Figshare API Authentication - -### Public vs Private Endpoints -Figshare API has two types of endpoints: -- **Public endpoints**: Generally don't require authentication (GET requests for public data) -- **Private endpoints**: Require authentication - -### Authentication Methods -Figshare API supports OAuth2 authentication: -- Uses personal access tokens -- Token should be included in the Authorization header: `Authorization: token YOUR_TOKEN` - -### POST /articles/search Endpoint -This endpoint performs a search operation using POST method (to allow complex search queries in the body). - -**Key Issue**: While some Figshare search operations may work without authentication, the POST method to `/articles/search` may require authentication or have different rate limits compared to anonymous access. - -## Current Implementation Analysis - -Looking at `figshare.py` lines 125-176: - -```python -def __init__(self, page_size=100): - self.token = os.getenv('FIGSHARE_TOKEN') - # ... token is optional - -def __post(self, url, params=None, use_cache=True): - headers = { "Authorization": "token " + self.token } if self.token else {} - response = post(self.base_url + url, headers=headers, json=params) -``` - -**Current behavior**: -- Token is optional (read from environment variable) -- If no token is provided, requests are made anonymously -- This may work sometimes but fail with 403 when: - - Rate limits are hit - - API policy changes - - IP-based restrictions apply - -## Recommendations - -### 1. Obtain a Figshare API Token - -**How to get a token**: -1. Create a Figshare account at https://figshare.com -2. Go to Account Settings -3. Navigate to "Applications" or "API" section -4. Create a new application/token -5. Generate a personal access token -6. Copy and store the token securely - -**Token Permissions**: -- For read-only operations (searching, retrieving articles), read permissions are sufficient -- No write permissions needed for this use case - -### 2. Add Token to GitHub Secrets - -**Steps**: -1. Go to repository Settings -2. Navigate to Secrets and variables → Actions -3. Create a new repository secret named `FIGSHARE_TOKEN` -4. Paste the Figshare API token -5. The workflow already references this secret in the environment (if added) - -**Note**: Check if workflow file needs to be updated to pass the secret as an environment variable. - -### 3. Update Workflow (if needed) - -If not already present, add to `.github/workflows/figshare-processing.yaml`: - -```yaml -env: - FIGSHARE_TOKEN: ${{ secrets.FIGSHARE_TOKEN }} -``` - -Or in the specific job/step that runs the Python script. - -## Alternative Solutions - -### 1. Add Retry Logic with Exponential Backoff -If 403 is intermittent, add retry logic to handle temporary rate limit issues. - -### 2. Add User-Agent Header -Some APIs require a proper User-Agent header. Update the request headers to include: -```python -headers = { - "Authorization": f"token {self.token}" if self.token else "", - "User-Agent": "LCAS-eprint-cache/1.0" -} -``` - -### 3. Implement Caching More Aggressively -The code already has caching, but ensure it's used effectively to minimize API calls. - -### 4. Use GET endpoint if available -Check if there's a GET version of the articles/search endpoint that might have different authentication requirements. - -## Workflow Configuration Issue - -**Current Status**: The workflow file does NOT pass the `FIGSHARE_TOKEN` environment variable to the Python script. - -Looking at `.github/workflows/figshare-processing.yaml`: -- Line 48-52: The "Run figshare exporter" step does not include any environment variables -- The Python script expects `FIGSHARE_TOKEN` via `os.getenv('FIGSHARE_TOKEN')` (figshare.py line 125) -- Without the token, all requests are anonymous and more likely to hit rate limits or be rejected - -## Conclusion - -**Root Cause**: The 403 error is caused by missing authentication when calling the Figshare API `/articles/search` endpoint. - -**Evidence**: -1. The Python code supports token authentication (line 125, 158, 175) -2. The workflow file does not pass the `FIGSHARE_TOKEN` environment variable -3. Anonymous requests to POST endpoints are more restricted and likely to fail with 403 - -**Recommended Solution**: - -### Step 1: Obtain a Figshare API Token -1. Create a Figshare account at https://figshare.com -2. Log in to your account -3. Go to Account Settings (click your profile icon → Settings) -4. Navigate to "Applications" section -5. Click "Create Personal Token" or "Create New Application" -6. Give it a descriptive name (e.g., "LCAS eprint cache GitHub Actions") -7. Select appropriate permissions (read access to public articles is sufficient) -8. Generate the token and copy it securely - -### Step 2: Add Token to GitHub Repository Secrets -1. Go to the GitHub repository: https://github.com/LCAS/eprint_cache -2. Navigate to Settings → Secrets and variables → Actions -3. Click "New repository secret" -4. Name: `FIGSHARE_TOKEN` -5. Value: Paste the Figshare API token -6. Click "Add secret" - -### Step 3: Update Workflow to Pass Token -Add the environment variable to the "Run figshare exporter" step in `.github/workflows/figshare-processing.yaml`: - -```yaml -- name: Run figshare exporter - env: - FIGSHARE_TOKEN: ${{ secrets.FIGSHARE_TOKEN }} - run: | - set -e - cd ./output - python ../figshare.py --force-refresh -``` - -### Step 4: Test the Changes -1. Create a pull request with the workflow change -2. The workflow should run automatically -3. Verify that the 403 error no longer occurs -4. Check that articles are successfully retrieved - -## Additional Recommendations - -### 1. Add Better Error Handling -Update the `__post` method to provide more informative error messages: - -```python -def __post(self, url, params=None, use_cache=True): - hash_key = f"POST{url}?{params}" - if hash_key in self.__cache and use_cache: - return self.__cache[hash_key] - else: - headers = { "Authorization": "token " + self.token } if self.token else {} - response = post(self.base_url + url, headers=headers, json=params) - - if response.status_code == 403: - self.logger.error(f"403 Forbidden: Authentication may be required. " - f"Ensure FIGSHARE_TOKEN environment variable is set. " - f"Response: {response.text}") - return [] - - if response.ok and response.headers.get('Content-Type', '').lower().startswith('application/json') and response.text.strip(): - result = response.json() - self.__cache[hash_key] = result - self.save_cache() - return result - else: - self.logger.warning(f"Received empty or invalid JSON response for POST {self.base_url + url} (status: {response.status_code})") - return [] -``` - -### 2. Add Retry Logic -Consider adding retry logic with exponential backoff for transient errors: - -```python -import time -from requests.adapters import HTTPAdapter -from requests.packages.urllib3.util.retry import Retry - -def __init__(self, page_size=100): - self.logger = getLogger("FigShare") - self.token = os.getenv('FIGSHARE_TOKEN') - self.page_size = page_size - self.base_url = "https://api.figshare.com/v2" - - # Configure retry strategy - retry_strategy = Retry( - total=3, - backoff_factor=1, - status_forcelist=[429, 500, 502, 503, 504], - allowed_methods=["GET", "POST"] - ) - adapter = HTTPAdapter(max_retries=retry_strategy) - self.session = requests.Session() - self.session.mount("https://", adapter) -``` - -### 3. Log Token Status -Add logging to indicate whether token authentication is being used: - -```python -def __init__(self, page_size=100): - self.logger = getLogger("FigShare") - self.token = os.getenv('FIGSHARE_TOKEN') - if self.token: - self.logger.info("Using authenticated requests with FIGSHARE_TOKEN") - else: - self.logger.warning("No FIGSHARE_TOKEN found - using anonymous requests (may hit rate limits)") - # ... rest of init -``` - -## References -- Figshare API Documentation: https://docs.figshare.com/ -- Figshare API Reference: https://docs.figshare.com/#figshare-documentation-api-description -- Figshare API Authentication: https://docs.figshare.com/#authentication -- GitHub Actions Secrets: https://docs.github.com/en/actions/security-guides/encrypted-secrets diff --git a/IMPLEMENTATION_SUMMARY.md b/IMPLEMENTATION_SUMMARY.md deleted file mode 100644 index 7981479..0000000 --- a/IMPLEMENTATION_SUMMARY.md +++ /dev/null @@ -1,113 +0,0 @@ -# SUMMARY: Figshare API 403 Error Resolution - -## Problem Identified -Your GitHub Actions workflow was failing with a **403 Forbidden** error when trying to access the Figshare API `/articles/search` endpoint. - -## Root Cause -The Figshare API requires authentication for the `/articles/search` POST endpoint. While your Python code already supported token authentication through the `FIGSHARE_TOKEN` environment variable, the GitHub Actions workflow was not passing this token to the script. - -## Changes Made - -### 1. Updated GitHub Actions Workflow -**File**: `.github/workflows/figshare-processing.yaml` - -Added the `FIGSHARE_TOKEN` environment variable to the Python script execution step: -```yaml -- name: Run figshare exporter - env: - FIGSHARE_TOKEN: ${{ secrets.FIGSHARE_TOKEN }} - run: | - set -e - cd ./output - python ../figshare.py --force-refresh -``` - -### 2. Enhanced Error Handling -**File**: `figshare.py` - -- Added logging on initialization to warn if no token is present -- Enhanced error handling in `__get()` and `__post()` methods to detect 403 errors -- Provides helpful error messages directing users to setup instructions - -### 3. Comprehensive Documentation -Created two new documentation files: - -**FIGSHARE_API_RESEARCH.md** -- Detailed analysis of 403 error causes -- Explanation of Figshare API authentication -- Step-by-step token setup instructions -- Additional recommendations for retry logic and error handling - -**README.md** -- Complete project overview and setup guide -- How to obtain a Figshare API token -- Usage instructions and command-line arguments -- Troubleshooting section -- Output files explanation - -## REQUIRED ACTION: Setup Figshare API Token - -To resolve the 403 error, you **must** add a Figshare API token to your GitHub repository: - -### Step 1: Obtain a Figshare API Token -1. Go to https://figshare.com and create an account (or log in) -2. Navigate to **Account Settings** → **Applications** -3. Click **"Create Personal Token"** or **"Create New Application"** -4. Name it (e.g., "LCAS eprint cache GitHub Actions") -5. Select **read permissions** for public articles -6. Generate and copy the token - -### Step 2: Add Token to GitHub Secrets -1. Go to your repository: https://github.com/LCAS/eprint_cache -2. Click **Settings** → **Secrets and variables** → **Actions** -3. Click **"New repository secret"** -4. **Name**: `FIGSHARE_TOKEN` -5. **Value**: Paste the token from Figshare -6. Click **"Add secret"** - -### Step 3: Test the Fix -Once you've added the secret: -1. The workflow will automatically use it on the next run -2. You can manually trigger a workflow run to test it immediately -3. Go to **Actions** tab → Select the workflow → Click **"Run workflow"** - -## What Happens Now - -✅ **With the token configured**: -- The workflow will authenticate with Figshare API -- Requests will succeed without 403 errors -- Higher rate limits will apply -- Reliable access to publication data - -❌ **Without the token**: -- The code will still run but issue warnings -- Anonymous requests may fail with 403 errors -- Lower rate limits apply -- Workflow will likely fail - -## Benefits of These Changes - -1. **Clear Error Messages**: If the token is missing or invalid, you'll see helpful error messages -2. **Better Logging**: The script now logs whether it's using authenticated or anonymous requests -3. **Complete Documentation**: README provides full setup and usage instructions -4. **Research Documentation**: Detailed analysis of the issue for future reference - -## Testing Locally - -To test the changes locally: -```bash -export FIGSHARE_TOKEN="your_token_here" -python figshare.py --authors "Marc Hanheide" --debug -``` - -## Questions or Issues? - -If you encounter any problems after setting up the token: -1. Check that the secret name is exactly `FIGSHARE_TOKEN` -2. Verify the token hasn't expired in Figshare -3. Review the workflow logs for specific error messages -4. See `FIGSHARE_API_RESEARCH.md` for detailed troubleshooting - ---- - -**Next Step**: Please add the `FIGSHARE_TOKEN` secret to your repository as described above. This is the only remaining action needed to fully resolve the 403 error. diff --git a/figshare.py b/figshare.py index 8497b0b..3e70972 100644 --- a/figshare.py +++ b/figshare.py @@ -481,9 +481,9 @@ def parse_args(): formatter_class=argparse.ArgumentDefaultsHelpFormatter ) parser.add_argument('-a', '--authors', nargs='+', - help='List of author names to process') + help='List of author names to process (uses default list if not specified)') parser.add_argument('-f', '--authors-file', type=str, - help='Path to file containing list of authors (one per line)') + help='Path to file containing list of authors, one per line (uses default list if not specified)') parser.add_argument('-s', '--since', type=str, default='2021-01-01', help='Process only publications since this date (YYYY-MM-DD)') parser.add_argument('-o', '--output', type=str, default='figshare_articles.csv', @@ -492,8 +492,8 @@ def parse_args(): help='Output CSV filename for all publications by authors (includes duplicates when multiple authors per output)') # parser.add_argument('-r', '--recent-output', type=str, default='figshare_articles_recent.csv', # help='Output CSV filename for publications since specified date') - parser.add_argument('--force-refresh', action='store_true', - help='Force refresh data instead of loading from cache') + parser.add_argument('--use-author-cache', action='store_true', + help='Use cached author data instead of refreshing from API') parser.add_argument('--rate-limit-delay', type=float, default=1.0, help='Delay in seconds between Figshare API requests (default: 1.0)') parser.add_argument('--debug', action='store_true', @@ -559,12 +559,12 @@ def figshare_processing(): authors[author_name] = Author(author_name, debug=args.debug, rate_limit_delay=args.rate_limit_delay) cache_exists = os.path.exists(f"{author_name}.db") - if cache_exists and not args.force_refresh: + if cache_exists and args.use_author_cache: logger.info(f"Loading cached data for {author_name}") authors[author_name].load() else: logger.info(f"Retrieving data for {author_name}") - authors[author_name].retrieve(not args.force_refresh) + authors[author_name].retrieve(args.use_author_cache) authors[author_name].save() if authors[author_name].df is not None: diff --git a/generate_stats.py b/generate_stats.py new file mode 100755 index 0000000..653eac1 --- /dev/null +++ b/generate_stats.py @@ -0,0 +1,111 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +""" +Generate publication statistics from figshare articles CSV. +Outputs a markdown table showing publications per author per year. +""" + +import pandas as pd +import sys +import argparse +from pathlib import Path + +def generate_statistics(all_csv='figshare_articles_all.csv', dedup_csv='figshare_articles.csv'): + """ + Read the figshare articles CSVs and generate statistics. + + Args: + all_csv: CSV file with all publications (includes duplicates for multi-author papers) + dedup_csv: CSV file with deduplicated publications (for calculating true totals) + + Returns: + A markdown table string showing statistics. + """ + try: + # Read the per-author CSV file (includes duplicates for multi-author papers) + df_all = pd.read_csv(all_csv) + + # Read the deduplicated CSV file (for accurate totals) + df_dedup = pd.read_csv(dedup_csv) + + if df_all.empty: + return "No publication data available." + + # Ensure we have the required columns + if 'author' not in df_all.columns or 'online_year' not in df_all.columns: + return "Error: Required columns (author, online_year) not found in all articles CSV." + + if 'online_year' not in df_dedup.columns: + return "Error: Required column (online_year) not found in deduplicated CSV." + + # Group by author and year, count publications per author + stats = df_all.groupby(['author', 'online_year']).size().reset_index(name='count') + + # Pivot to get years as columns + pivot = stats.pivot(index='author', columns='online_year', values='count').fillna(0).astype(int) + + # Sort columns (years) in descending order (most recent first) + pivot = pivot[sorted(pivot.columns, reverse=True)] + + # Calculate total per author (from their individual publications) + pivot['Total'] = pivot.sum(axis=1) + + # Sort by total publications (descending) + pivot = pivot.sort_values('Total', ascending=False) + + # Calculate actual yearly totals from deduplicated data + dedup_by_year = df_dedup.groupby('online_year').size() + + # Generate markdown table + md_lines = ["# Publication Statistics by Author and Year", ""] + md_lines.append(f"**Total Authors:** {len(pivot)}\n") + md_lines.append(f"**Total Publications (deduplicated):** {len(df_dedup)}\n") + md_lines.append("") + + # Create table header + headers = ['**Author**', '**Total**'] + [str(year) for year in pivot.columns if year != 'Total'] + md_lines.append('| ' + ' | '.join(headers) + ' |') + md_lines.append('| ' + ' | '.join(['---' for _ in headers]) + ' |') + + # Create table rows + for author, row in pivot.iterrows(): + values = [f"**{author}**", f"**{int(row['Total'])}**"] + [str(int(row[year])) if row[year] > 0 else '-' for year in pivot.columns if year != 'Total'] + md_lines.append('| ' + ' | '.join(values) + ' |') + + # Add yearly totals row using deduplicated data + year_columns = [year for year in pivot.columns if year != 'Total'] + year_totals = ['**Total (unique)**', f"**{len(df_dedup)}**"] + [str(int(dedup_by_year.get(year, 0))) for year in year_columns] + md_lines.append('| ' + ' | '.join(year_totals) + ' |') + + return '\n'.join(md_lines) + + except FileNotFoundError as e: + return f"Error: File not found - {e.filename}" + except Exception as e: + return f"Error generating statistics: {str(e)}" + +if __name__ == "__main__": + parser = argparse.ArgumentParser( + description="Generate publication statistics from FigShare articles CSV files.", + formatter_class=argparse.ArgumentDefaultsHelpFormatter + ) + parser.add_argument( + '--all-csv', + type=str, + default='figshare_articles_all.csv', + help='Path to CSV file with all publications (includes duplicates for multi-author papers)' + ) + parser.add_argument( + '--dedup-csv', + type=str, + default='figshare_articles.csv', + help='Path to CSV file with deduplicated publications (for accurate total counts)' + ) + + args = parser.parse_args() + + # Generate and print statistics + stats = generate_statistics(args.all_csv, args.dedup_csv) + print(stats) + From 44bd3bb989646fb39620fb3e7554e839b691ed78 Mon Sep 17 00:00:00 2001 From: Marc Hanheide Date: Tue, 23 Dec 2025 15:38:54 +0000 Subject: [PATCH 08/19] Update figshare processing workflow to use actions/cache@v5 and change directory before generating publication statistics --- .github/workflows/figshare-processing.yaml | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/.github/workflows/figshare-processing.yaml b/.github/workflows/figshare-processing.yaml index 1c73a20..44eb2e2 100644 --- a/.github/workflows/figshare-processing.yaml +++ b/.github/workflows/figshare-processing.yaml @@ -29,7 +29,7 @@ jobs: fetch-depth: 1 - name: Use Cache in folder ./output - uses: actions/cache@v3 + uses: actions/cache@v5 with: path: ./output key: cache-files @@ -70,6 +70,7 @@ jobs: - name: Generate publication statistics run: | + cd ./output python ../generate_stats.py --all-csv figshare_articles_all.csv --dedup-csv figshare_articles.csv >> $GITHUB_STEP_SUMMARY - name: Nexus Repo Publish bibtex From 69ccbd4edca117ee094163735354c98ef2cf655b Mon Sep 17 00:00:00 2001 From: Marc Hanheide Date: Tue, 23 Dec 2025 15:51:46 +0000 Subject: [PATCH 09/19] Refactor caching mechanism in FigShare class to use shelve for persistent storage --- figshare.py | 48 ++++++++++++++++++------------------------------ 1 file changed, 18 insertions(+), 30 deletions(-) diff --git a/figshare.py b/figshare.py index 3e70972..9b0bb70 100644 --- a/figshare.py +++ b/figshare.py @@ -5,12 +5,9 @@ from json import loads from pprint import pformat import pandas as pd -from functools import lru_cache, wraps -from datetime import datetime from logging import getLogger, basicConfig, INFO, DEBUG import os -from pickle import load, dump from flatten_dict import flatten @@ -135,23 +132,8 @@ def __init__(self, page_size=100, rate_limit_delay=1.0): if self.rate_limit_delay > 0: self.logger.info(f"Rate limiting enabled: {self.rate_limit_delay} second delay between API requests") - # if cache file exist, load it - self.cache_file = "figshare_cache.pkl" - if os.path.exists(self.cache_file): - try: - with open(self.cache_file, "rb") as f: - self.__cache = load(f) - self.logger.debug(f"Loaded cache from {self.cache_file} with {len(self.__cache)} entries") - except Exception as e: - self.logger.warning(f"Failed to load cache: {e}") - self.__cache = {} - else: - self.logger.info(f"No cache file found at {self.cache_file}") - self.__cache = {} - - def save_cache(self): - with open(self.cache_file,"wb") as f: - dump(self.__cache, f) + # Use shelve for persistent caching + self.cache_file = "figshare_cache.db" def __init_params(self): @@ -172,9 +154,12 @@ def __handle_403_error(self, url, method="GET"): def __get(self, url, params=None, use_cache=True): hash_key = f"GET{url}?{params}" - if hash_key in self.__cache and use_cache: - return self.__cache[hash_key] - else: + + with shelve.open(self.cache_file) as cache: + if hash_key in cache and use_cache: + self.logger.info(f"Cache hit for GET {url}") + return cache[hash_key] + headers = { "Authorization": "token " + self.token } if self.token else {} response = get(self.base_url + url, headers=headers, params=params) @@ -190,8 +175,8 @@ def __get(self, url, params=None, use_cache=True): # Check if response is valid and contains JSON if response.ok and response.headers.get('Content-Type', '').lower().startswith('application/json') and response.text.strip(): result = response.json() - self.__cache[hash_key] = result - self.save_cache() + cache[hash_key] = result + self.logger.debug(f"Cached result for GET {url}") return result else: self.logger.warning(f"Received empty or invalid JSON response for GET {self.base_url + url} (status: {response.status_code})") @@ -199,9 +184,12 @@ def __get(self, url, params=None, use_cache=True): def __post(self, url, params=None, use_cache=True): hash_key = f"POST{url}?{params}" - if hash_key in self.__cache and use_cache: - return self.__cache[hash_key] - else: + + with shelve.open(self.cache_file) as cache: + if hash_key in cache and use_cache: + self.logger.debug(f"Cache hit for POST {url}") + return cache[hash_key] + headers = { "Authorization": "token " + self.token } if self.token else {} response = post(self.base_url + url, headers=headers, json=params) @@ -217,8 +205,8 @@ def __post(self, url, params=None, use_cache=True): # Check if response is valid and contains JSON if response.ok and response.headers.get('Content-Type', '').lower().startswith('application/json') and response.text.strip(): result = response.json() - self.__cache[hash_key] = result - self.save_cache() + cache[hash_key] = result + self.logger.debug(f"Cached result for POST {url}") return result else: self.logger.warning(f"Received empty or invalid JSON response for POST {self.base_url + url} (status: {response.status_code})") From 2b6138f160bc119f5dbecfb47ef2b9a3825f02b6 Mon Sep 17 00:00:00 2001 From: Marc Hanheide Date: Tue, 23 Dec 2025 15:53:46 +0000 Subject: [PATCH 10/19] Fix output path formatting in artifact upload step --- .github/workflows/figshare-processing.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/figshare-processing.yaml b/.github/workflows/figshare-processing.yaml index 44eb2e2..c4d2884 100644 --- a/.github/workflows/figshare-processing.yaml +++ b/.github/workflows/figshare-processing.yaml @@ -118,6 +118,6 @@ jobs: with: name: outputs path: | - ./output/*.csv + ./output/*.csv ./output/*.bib retention-days: 30 From 125a97ef5651393100aaf0f3b73107578b739cc6 Mon Sep 17 00:00:00 2001 From: Marc Hanheide Date: Tue, 23 Dec 2025 16:03:14 +0000 Subject: [PATCH 11/19] Add max_retries parameter and implement retry logic for 403 errors in FigShare class --- figshare.py | 67 +++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 50 insertions(+), 17 deletions(-) diff --git a/figshare.py b/figshare.py index 9b0bb70..59b4e38 100644 --- a/figshare.py +++ b/figshare.py @@ -118,7 +118,7 @@ def entries_to_str(self, entries): class FigShare: - def __init__(self, page_size=100, rate_limit_delay=1.0): + def __init__(self, page_size=100, rate_limit_delay=1.0, max_retries=5): self.logger = getLogger("FigShare") self.token = os.getenv('FIGSHARE_TOKEN') if self.token: @@ -127,6 +127,7 @@ def __init__(self, page_size=100, rate_limit_delay=1.0): self.logger.warning("Figshare API: No authentication token found - using anonymous requests (may hit rate limits or receive 403 errors)") self.page_size = page_size self.rate_limit_delay = rate_limit_delay + self.max_retries = max_retries self.base_url = "https://api.figshare.com/v2" if self.rate_limit_delay > 0: @@ -141,7 +142,7 @@ def __init_params(self): "page_size": self.page_size } - def __handle_403_error(self, url, method="GET"): + def __handle_403_error(self, url, method="GET", response_text=""): """Handle 403 Forbidden errors with helpful messages""" if not self.token: self.logger.error(f"403 Forbidden for {method} {self.base_url + url}: " @@ -151,6 +152,8 @@ def __handle_403_error(self, url, method="GET"): self.logger.error(f"403 Forbidden for {method} {self.base_url + url}: " f"Token may be invalid or lack permissions. " f"Check token in Figshare account settings.") + if response_text: + self.logger.error(f"Response text: {response_text}") def __get(self, url, params=None, use_cache=True): hash_key = f"GET{url}?{params}" @@ -161,17 +164,31 @@ def __get(self, url, params=None, use_cache=True): return cache[hash_key] headers = { "Authorization": "token " + self.token } if self.token else {} - response = get(self.base_url + url, headers=headers, params=params) + # Retry logic for 403 errors + for attempt in range(self.max_retries): + response = get(self.base_url + url, headers=headers, params=params) + + # Handle 403 Forbidden errors with retry logic + if response.status_code == 403: + if attempt < self.max_retries - 1: + # Exponential backoff: 1s, 2s, 4s, 8s, 16s + wait_time = 2 ** attempt + self.logger.warning(f"403 Forbidden for GET {url} (attempt {attempt + 1}/{self.max_retries}), retrying in {wait_time}s...") + time.sleep(wait_time) + continue + else: + # Final attempt failed, log error and return + self.__handle_403_error(url, "GET", response.text) + return {} + + # Success - break out of retry loop + break + # Rate limiting: sleep after each API request if self.rate_limit_delay > 0: time.sleep(self.rate_limit_delay) - # Handle 403 Forbidden errors with helpful message - if response.status_code == 403: - self.__handle_403_error(url, "GET") - return {} - # Check if response is valid and contains JSON if response.ok and response.headers.get('Content-Type', '').lower().startswith('application/json') and response.text.strip(): result = response.json() @@ -191,17 +208,31 @@ def __post(self, url, params=None, use_cache=True): return cache[hash_key] headers = { "Authorization": "token " + self.token } if self.token else {} - response = post(self.base_url + url, headers=headers, json=params) + + # Retry logic for 403 errors + for attempt in range(self.max_retries): + response = post(self.base_url + url, headers=headers, json=params) + + # Handle 403 Forbidden errors with retry logic + if response.status_code == 403: + if attempt < self.max_retries - 1: + # Exponential backoff: 1s, 2s, 4s, 8s, 16s + wait_time = 2 ** attempt + self.logger.warning(f"403 Forbidden for POST {url} (attempt {attempt + 1}/{self.max_retries}), retrying in {wait_time}s...") + time.sleep(wait_time) + continue + else: + # Final attempt failed, log error and return + self.__handle_403_error(url, "POST", response.text) + return [] + + # Success - break out of retry loop + break # Rate limiting: sleep after each API request if self.rate_limit_delay > 0: time.sleep(self.rate_limit_delay) - # Handle 403 Forbidden errors with helpful message - if response.status_code == 403: - self.__handle_403_error(url, "POST") - return [] - # Check if response is valid and contains JSON if response.ok and response.headers.get('Content-Type', '').lower().startswith('application/json') and response.text.strip(): result = response.json() @@ -234,12 +265,12 @@ def get_article(self, article_id, use_cache=True): return self.__get(f"/articles/{article_id}", use_cache=use_cache) class Author: - def __init__(self, name, debug=False, rate_limit_delay=1.0): + def __init__(self, name, debug=False, rate_limit_delay=1.0, max_retries=5): self.logger = getLogger("Author") if debug: self.logger.setLevel(DEBUG) self.name = name - self.fs = FigShare(rate_limit_delay=rate_limit_delay) + self.fs = FigShare(rate_limit_delay=rate_limit_delay, max_retries=max_retries) self.articles = {} self.public_html_prefix = "https://repository.lincoln.ac.uk" self.df = None @@ -484,6 +515,8 @@ def parse_args(): help='Use cached author data instead of refreshing from API') parser.add_argument('--rate-limit-delay', type=float, default=1.0, help='Delay in seconds between Figshare API requests (default: 1.0)') + parser.add_argument('--max-retries', type=int, default=5, + help='Maximum number of retry attempts for 403 errors (default: 5)') parser.add_argument('--debug', action='store_true', help='Enable debug logging') @@ -544,7 +577,7 @@ def figshare_processing(): for author_name in authors_list: logger.info(f"*** Processing {author_name}...") - authors[author_name] = Author(author_name, debug=args.debug, rate_limit_delay=args.rate_limit_delay) + authors[author_name] = Author(author_name, debug=args.debug, rate_limit_delay=args.rate_limit_delay, max_retries=args.max_retries) cache_exists = os.path.exists(f"{author_name}.db") if cache_exists and args.use_author_cache: From 87fe3ef1229b9b2e9a263368fd53bf319b13ee82 Mon Sep 17 00:00:00 2001 From: Marc Hanheide Date: Tue, 23 Dec 2025 16:11:35 +0000 Subject: [PATCH 12/19] Update retrieve method to use cache in Author class and change default max_retries to 1 --- figshare.py | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/figshare.py b/figshare.py index 59b4e38..919d31a 100644 --- a/figshare.py +++ b/figshare.py @@ -454,7 +454,7 @@ def _flatten(self): def retrieve(self, use_cache=True): self._retrieve_figshare(use_cache=use_cache) self._remove_non_repository() - self._retrieve_details() + self._retrieve_details(use_cache=True) self._custom_fields_to_dicts() self._flatten() self._create_dataframe() @@ -515,8 +515,8 @@ def parse_args(): help='Use cached author data instead of refreshing from API') parser.add_argument('--rate-limit-delay', type=float, default=1.0, help='Delay in seconds between Figshare API requests (default: 1.0)') - parser.add_argument('--max-retries', type=int, default=5, - help='Maximum number of retry attempts for 403 errors (default: 5)') + parser.add_argument('--max-retries', type=int, default=1, + help='Maximum number of retry attempts for 403 errors (default: 1)') parser.add_argument('--debug', action='store_true', help='Enable debug logging') From afa758350660872c929e01f4c9e9f55f20b68287 Mon Sep 17 00:00:00 2001 From: Marc Hanheide Date: Tue, 23 Dec 2025 16:16:41 +0000 Subject: [PATCH 13/19] Update figshare processing workflow to correctly restore and save cache for output directory --- .github/workflows/figshare-processing.yaml | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/.github/workflows/figshare-processing.yaml b/.github/workflows/figshare-processing.yaml index c4d2884..3732eed 100644 --- a/.github/workflows/figshare-processing.yaml +++ b/.github/workflows/figshare-processing.yaml @@ -29,7 +29,8 @@ jobs: fetch-depth: 1 - name: Use Cache in folder ./output - uses: actions/cache@v5 + id: cache-restore-output + uses: actions/cache/restore@v5 with: path: ./output key: cache-files @@ -68,6 +69,12 @@ jobs: python ../figshare.py fi + - name: Save Cache from folder ./output + uses: actions/cache/save@v5 + with: + path: ./output + key: ${{ steps.cache-restore-output.outputs.cache-primary-key || 'cache-files' }} + - name: Generate publication statistics run: | cd ./output From 2dc753eaf9dd5f7e69d1dc72f3e313597f19b165 Mon Sep 17 00:00:00 2001 From: Marc Hanheide Date: Tue, 23 Dec 2025 16:18:30 +0000 Subject: [PATCH 14/19] Ensure cache is always saved from the output folder in figshare processing workflow --- .github/workflows/figshare-processing.yaml | 1 + 1 file changed, 1 insertion(+) diff --git a/.github/workflows/figshare-processing.yaml b/.github/workflows/figshare-processing.yaml index 3732eed..35bc29b 100644 --- a/.github/workflows/figshare-processing.yaml +++ b/.github/workflows/figshare-processing.yaml @@ -71,6 +71,7 @@ jobs: - name: Save Cache from folder ./output uses: actions/cache/save@v5 + if: always() with: path: ./output key: ${{ steps.cache-restore-output.outputs.cache-primary-key || 'cache-files' }} From 5bdcba17bf9450a615dc0945cd9181f5f646185e Mon Sep 17 00:00:00 2001 From: Marc Hanheide Date: Tue, 23 Dec 2025 16:23:21 +0000 Subject: [PATCH 15/19] Create output directory if it doesn't exist and list contents --- .github/workflows/figshare-processing.yaml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/.github/workflows/figshare-processing.yaml b/.github/workflows/figshare-processing.yaml index 35bc29b..cf94ca8 100644 --- a/.github/workflows/figshare-processing.yaml +++ b/.github/workflows/figshare-processing.yaml @@ -36,7 +36,9 @@ jobs: key: cache-files - name: Create output directory if it doesn't exist - run: mkdir -p output + run: | + mkdir -p output + find ./output - run: echo "🎉 The job was automatically triggered by a ${{ github.event_name }} event." From d7f11679f132c1b4737043eb460edbba84be117b Mon Sep 17 00:00:00 2001 From: Marc Hanheide Date: Tue, 23 Dec 2025 16:34:25 +0000 Subject: [PATCH 16/19] Enhance caching logging in FigShare class and improve hash key generation for GET/POST requests --- figshare.py | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/figshare.py b/figshare.py index 919d31a..e4013c0 100644 --- a/figshare.py +++ b/figshare.py @@ -136,6 +136,11 @@ def __init__(self, page_size=100, rate_limit_delay=1.0, max_retries=5): # Use shelve for persistent caching self.cache_file = "figshare_cache.db" + with shelve.open(self.cache_file) as cache: + self.logger.info(f"Figshare API: Using cache file {self.cache_file} with {len(cache.keys())} entries") + for key in list(cache.keys()): + self.logger.info(f" existing cache key: {key}") + def __init_params(self): return { @@ -156,7 +161,7 @@ def __handle_403_error(self, url, method="GET", response_text=""): self.logger.error(f"Response text: {response_text}") def __get(self, url, params=None, use_cache=True): - hash_key = f"GET{url}?{params}" + hash_key = f"GET{url}{'?' + str(params) if params else ''}" with shelve.open(self.cache_file) as cache: if hash_key in cache and use_cache: @@ -200,7 +205,7 @@ def __get(self, url, params=None, use_cache=True): return {} def __post(self, url, params=None, use_cache=True): - hash_key = f"POST{url}?{params}" + hash_key = f"POST{url}{'?' + str(params) if params else ''}" with shelve.open(self.cache_file) as cache: if hash_key in cache and use_cache: From 26aeac28f0e6ef21cc080448957eb299f9adae13 Mon Sep 17 00:00:00 2001 From: Marc Hanheide Date: Tue, 23 Dec 2025 16:39:52 +0000 Subject: [PATCH 17/19] Update cache key generation to include run ID and add restore keys for improved cache retrieval --- .github/workflows/figshare-processing.yaml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/.github/workflows/figshare-processing.yaml b/.github/workflows/figshare-processing.yaml index cf94ca8..5ef39cd 100644 --- a/.github/workflows/figshare-processing.yaml +++ b/.github/workflows/figshare-processing.yaml @@ -33,7 +33,9 @@ jobs: uses: actions/cache/restore@v5 with: path: ./output - key: cache-files + key: cache-files-${{ github.run_id }} + restore-keys: | + cache-files- - name: Create output directory if it doesn't exist run: | From 73d134c6d081cff8b58429558f04fd5512bac52a Mon Sep 17 00:00:00 2001 From: Marc Hanheide Date: Tue, 23 Dec 2025 16:45:36 +0000 Subject: [PATCH 18/19] Change cache key logging from info to debug level in FigShare class --- figshare.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/figshare.py b/figshare.py index e4013c0..8ea2c14 100644 --- a/figshare.py +++ b/figshare.py @@ -139,7 +139,7 @@ def __init__(self, page_size=100, rate_limit_delay=1.0, max_retries=5): with shelve.open(self.cache_file) as cache: self.logger.info(f"Figshare API: Using cache file {self.cache_file} with {len(cache.keys())} entries") for key in list(cache.keys()): - self.logger.info(f" existing cache key: {key}") + self.logger.debug(f" existing cache key: {key}") def __init_params(self): From 2d9e4b394ff73516b4edd0aa2a8759aac71ed785 Mon Sep 17 00:00:00 2001 From: Marc Hanheide Date: Tue, 23 Dec 2025 16:49:01 +0000 Subject: [PATCH 19/19] Update cron schedule for figshare processing workflow to run every 4 hours --- .github/workflows/figshare-processing.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/figshare-processing.yaml b/.github/workflows/figshare-processing.yaml index 5ef39cd..67dd566 100644 --- a/.github/workflows/figshare-processing.yaml +++ b/.github/workflows/figshare-processing.yaml @@ -12,7 +12,7 @@ on: - 'true' - 'false' schedule: - - cron: "30 2 * * 2" + - cron: "30 */4 * * *" push: branches: - main