A centralised Python implementation of InterPro production procedures.
- Python 3.11+, with packages
oracledb,mysqlclient,psycopg3, andmundone(link) GCCwith thesqlite3.hheader
pip install .The pyinterprod package relies on three configuration files:
main.conf: contains database connection strings, paths to files provided by/to UniProtKB, and various workflow parameters.members.conf: contains path to files used to update InterPro's member databases (e.g. files containing signatures, HMM files, etc.).analyses.conf: contains settings for the InterProScan match calculation (ipr-calc).
All files can be renamed. main.conf is passed as a command line argument, and the paths to members.conf and analyses.conf are defined in main.conf.
The expected format for database connection strings is
user/password@host:port/service. For Oracle databases,user/password@servicemay work as well, depending ontnsnames.ora.
- oracle
- ipro-interpro: connection string for the
interprouser in the InterPro database - ipro-iprscan: connection string for the
iprscanuser in the InterPro database - ipro-uniparc: connection string for the
uniparcuser in the InterPro database - iscn-iprscan: connection string for the
iprscanuser in the InterProScan database - iscn-uniparc: connection string for the
uniparcuser in the InterProScan database - unpr-goapro: connection string for the GOA database
- unpr-swpread: connection string for the Swiss-Prot database
- unpr-uapro: connection string for the UniParc production database
- unpr-uaread: connection string for the UniParc database
- ipro-interpro: connection string for the
- postgresql:
- pronto: connection string
- uniprot:
- version: release number (e.g.
2019_08) - date: date for the public release (e.g.
18-Sep-2019) - swiss-prot: path to Swiss-Prot flat file
- trembl: path to TrEMBL flat file
- unirule: path to file listing InterPro entries and member database signatures used in UniRule
- xrefs: path to directory where to export InterPro cross-references (generated for UniProt)
- version: release number (e.g.
- emails:
- server: outgoing server (format:
host:port) - sender: sender's email address (e.g. user running the workflow)
- aa: email address of the Automatic Annotation team
- aa_dev: email address of the Automatic Annotation development team
- interpro: email address of the InterPro team
- uniprot_db: email address of the UniProt database team
- uniprot_db: email address of the UniProt production team
- unirule: email address of the UniRule team (curators from EMBL-EBI, SIB, and PIR)
- sib: email address of the Swiss-Prot team
- server: outgoing server (format:
- misc:
- analyses: path to the
analyses.confconfig file - members: path to the
members.confconfig file - scheduler: scheduler and queue (format:
scheduler:queue, e.g.lsf:production) - pronto_url: URL of the Pronto curation application
- data_dir: directory where to store staging files
- match_calc_dir: directory where to run InterProScan match calculation
- temporary_dir: directory for temporary files
- workflows_dir: directory for workflows SQLite files, and jobs' input/output files
- analyses: path to the
Each section corresponds to a member database (or a sequence feature database), e.g.
[profile]
signatures =
Supported properties are:
| Name | Description |
|---|---|
signatures |
Path to the source of database signatures. |
hmm |
Path to an HMM file, used for databases that employ HMMER3-based models. Required when running ipr-hmm. |
fasta |
Path to sequences used by models, in the FASTA format. |
members |
Path to file containing the clan-signature mapping. |
go-terms |
Path to file or directory of GO annotations. PANTHER and NCBIFAM only. |
summary |
Path to file of summary information. CDD only. |
seed |
Path to file of SEED alignments. Pfam only. |
full |
Path to file of full alignments. Pfam only. |
clans |
Path to file of clan information. Pfam only. |
mapping |
Path to file of model-signature mapping. CATH-Gene3D only. |
classes |
Path to file of information about classes. ELM only. |
instances |
Path to file of information about instances. ELM only. |
The DEFAULT section defines the defaults values for the following properties:
job_cpu: number of processes to request when submitting a job.job_mem: the maximum amount of memory a job should be allowed to use (in MB).job_size: the number of sequences to process in each job.job_timeout: the number of hours a job is allowed to run for before being killed. Any value lower than 1 disable the timeout.
The default values can be overridden. For instance, adding the following block under the DEFAULT section ensure that MobiDB-Lite jobs timeout after 48 hours and that PRINTS jobs are allocated 16GB of memory:
[mobidb-lite]
job_timeout = 48
[prints]
job_mem = 16384
Update proteins and matches to the latest private UniProt release.
$ ipr-uniprot [OPTIONS] main.confThe optional arguments are:
-t, --tasks: list of tasks to run, by default all tasks are run (see Tasks for a description of available tasks)--dry-run: do not run tasks, only list those about to be run
| Name | Description | Dependencies |
|---|---|---|
| update-uniparc | Import UniParc cross-references | |
| taxonomy | Import the latest taxonomy data from UniProt | |
| update-ipm-matches | Update protein matches from ISPRO | |
| update-ipm-sites | Update protein site matches from ISPRO | |
| update-proteins | Import the new Swiss-Prot and TrEMBL proteins, and compare with the current ones | |
| delete-proteins | Delete obsolete proteins in all production tables | update-proteins |
| check-proteins | Track UniParc sequences (UPI) associated to UniProt entries that need to be imported (e.g. new or updated sequence) | delete-proteins, update-uniparc |
| update-matches | Update protein matches for new or updated sequences, run various checks, and track changes in protein counts for InterPro entries | update-ipm-matches, check-proteins |
| update-fmatches | Update protein matches for sequence features (e.g. MobiDB-lite, Coils, etc.) | update-matches |
| update-tmatches | Remove TOAD matches for recently changed sequences | update-matches |
| export-sib | Export Oracle tables required by the Swiss-Prot team | update-matches |
| report-changes | Report recent integration changes to the UniRule team | update-matches |
| aa-iprscan | Build the AA_IPRSCAN table, required by the Automatic Annotation team | update-matches |
| xref-condensed | Build the XREF_CONDENSED table for the Automatic Annotation team (contains representations of protein matches for InterPro entries) | update-matches |
| xref-summary | Build the XREF_SUMMARY table for the Automatic Annotation team (contains protein matches for integrated member database signatures) | report-changes |
| export-xrefs | Export text files containing protein matches for the UniProt database team | xref-summary |
| notify-interpro | Notify the InterPro team that all tables required by the Automatic Annotation team are ready, so we can take a snapshot of our database | update-fmatches, aa-iprscan, xref-condensed, xref-summary |
| swissprot-de | Export Swiss-Prot descriptions associated to member database signatures in the public release of UniProt (i.e. the release we are updating *from*) | |
| unirule | Update the list of signatures used by UniRule, so InterPro curators are warned if they attempt to unintegrated one of these signatures. | |
| update-varsplic | Update splice variant matches | update-ipm-matches |
| update-sites | Update residue annotations | update-ipm-sites, update-matches |
| Pronto | Update the Pronto PostgreSQL table | taxonomy, update-fmatches, swissprot-de, unirule |
| send-report | Send reports to curators, and inform them that Pronto is ready | Pronto tasks |
Update models and protein matches for one or more member databases.
Before running the update, this command must be repeated for each member database. -n is the name of the database (case-insensitive), -d is the release date (of the member database), and -v is the release version.
$ ipr-pre-memdb main.conf -n DATABASE -d YYYY-MM-DD -v VERSIONThen, the actual update can be run:
$ ipr-memdb [OPTIONS] main.conf database [database ...]The optional arguments are:
-t, --tasks: list of tasks to run, by default all tasks are run (see Tasks for a description of available tasks)--dry-run: do not run tasks, only list those about to be run
| Name | Description | Dependencies |
|---|---|---|
| update-ipm-matches | Update protein matches from ISPRO | |
| load-signatures | Import member database signatures for the version to update to | |
| track-changes | Compare signatures between versions (e.g. name, description, matched proteins) | load-signatures |
| delete-obsoletes | Remove signatures that are not in the latest version of the member database(s) | track-changes |
| update-signatures | Update metadata for existing signatures, and add new signatures | delete-obsoletes |
| update-matches | Update and check matches in production tables | update-ipm-matches, update-signatures |
| update-tmatches | Update TOAD matches | update-signatures |
| update-varsplic | Update splice variant matches | update-ipm-matches, update-signatures |
| persist-pfam-a | Parse Pfam-A files and store relevant information (only when updating Pfam) | update-ipm-matches, update-signatures |
| persist-pfam-c | Parse Pfam-C to store clan information (only when updating Pfam) | update-ipm-matches, update-signatures |
| update-features | Update sequence features for non-member databases (e.g. MobiDB-lite, COILS, etc.) | update-ipm-matches |
| update-fmatches | Update matches for sequence features | update-features |
| update-ipm-sites | Update protein site matches from ISPRO | |
| update-sites | Update residue annotations (if updating a member database with residue annotations) | update-ipm-sites, update-matches |
| Pronto | Update the Pronto PostgreSQL tables | update-matches |
| send-report | Send reports to curators, and inform them that Pronto is ready | Pronto tasks |
$ ipr-pronto [OPTIONS] main.confThe optional arguments are:
-t, --tasks: list of tasks to run, by default all tasks are run (see Tasks for a description of available tasks)--dry-run: do not run tasks, only list those about to be run
| Name | Description | Dependencies |
|---|---|---|
| go-terms | Import publications associated to protein annotations | |
| go-constraints | Import GO taxonomic constraints | |
| proteins-similarities | Import UniProt general annotations (comments) on sequence similarities | |
| proteins-names | Import UniProt sequence names | |
| databases | Import database information (e.g. version, release date) | |
| proteins | Import general information on proteins (e.g. accession, length, species) | |
| init-matches | Create the match table (empty) | |
| export-matches | Export protein matches for member database signatures | init-matches |
| insert-matches | Insert protein matches for member database signatures | export-matches |
| insert-fmatches | Insert protein matches for sequence features (AntiFam, etc.) | init-matches |
| index-matches | Index and cluster the match table | insert-matches, insert-fmatches |
| insert-signature2proteins | Associate member database signatures with UniProt proteins, UniProt descriptions, taxonomic origins, and GO terms | export-matches, proteins-names |
| index-signature2proteins | Index the signature2proteins table | insert-signature2proteins |
| signatures | Import and compare member database signatures | databases, export-matches |
| taxonomy | Import UniProt taxonomy | |
| structures | Import structural matches |
$ ipr-calc main.conf [COMMAND] [OPTIONS]The available commands (and their optional arguments) are:
import: import sequences from the UniParc Oracle database--top-up: import new sequences only
clean: delete obsolete data-a, --analyses: IDs of analyses to clean (default: all)
search: scan sequences using InterProScan-l, --list: list active analyses and exit-a, --analyses: IDs of analyses to run (default: all)--concurrent-jobs: maximum number of concurrently running InterProScan jobs (default: 1000)--max-jobs: maximum number of jobs to run per analysis before exiting (default: disabled)--max-retries: number of times a failed job is resubmitted (default: disabled)--keep none|all|failed: keep input/output files (default: none)
Import new UniParc sequences:
ipr-calc main.conf import --top-upProcess jobs for analysis 42 only, allow each job to run three times (i.e. restart twice), but keep all temporary files, regardless of the job success/failure:
ipr-calc main.conf search -a 42 --max-retries 2 --keep allRun 10 jobs per analysis, and keep failed jobs to investigate:
ipr-calc main.conf search --max-retries 10 --keep failedUpdate clans and run profile-profile alignments.
$ ipr-clans [OPTIONS] main.conf database [database ...]The optional arguments are:
-t, --threads: number of alignment workers-T, --tempdir: directory to use for temporary files
Load HMMs in the database.
$ ipr-hmms main.conf database [database ...]