This is the public code and data repository for Steichen et al. 2024.
There are both local and remote datasets in this repository.
These are datasets in this repository. They are stored in the data/ directory.
data/all_processed_combined.csv.gzis the processed 10X dataset with paired heavy and light chains as well as the single cell sort paired data from Sanger sequencing. In addition to having all AIRR compliant fields, this dataset contains the following:
| Field | Description |
|---|---|
| sequence_id | A unique ID describing the sequence, takes the form: <animal id>_<weeks post>_<tissue>_<sequencing method>_<index_number>_<probe> |
| barcode | The barcode from the 10X GEM |
| NHP | The NHP animal ID |
| weeks_post | Weeks post vaccination. 3,4,7,10,13 or 710 (weeks 7 or 10) |
| tissue | GC or MBC |
| seq_method | Sequencing Method, 10X or Sanger |
| prime_boost | Sequence isolated after prime or boost or was a control (MD39) |
| probe | Probe population of this sequence [Env+, Env-, Env++KO-] |
| KO. | How many counts of the barcode labeled KO probe |
| well_id | Well ID for the single cell sort |
| IgG | The ELISA signal for IgG |
| N332-GT5 WT | The ELISA signal for GT5 |
| N332-GT5 KO | The ELISA signal for the KO |
| B23 | The boosting reagent ELISA |
| IgG.1 | The boolean measure for if it is an IgG |
-
data/all_processed_combined_personalized.csv.gzis the same asdata/all_processed_combined.csv.gzbut with the sequences personalized to the IGHD3-43 allele haplotypes found in data/genotypes.xlsx. The haplotypes are:a.
IGHD3-43*01/IGHD3-43*01b.
IGHD3-43*01/IGHD3-43*01_S8240c.
IGHD3-43*01_S8240/IGHD3-43*01_S8240
These are datasets that are too large to store in this repository. They are stored in the an AWS S3 bucket.
74 Macaque naive BCR sequences found in this study
a. Annotated in feather format and tar zipped here. This data does not contain the animal IDs but contains the SRA number.
b. Annotated in parquet format and tar zipped here. This data has the animal IDs and after its uncompressed it can be used in AWS EMR.
We also are adding local notebooks and EMR notebooks that were used in this study.
Personalize Seqs.ipynb reads in all 10X data from data/all_processed_combined.feather and reannotates the sequences based on the IGHD3-43 allele haplotypes found in data/genotypes.xlsx
Process Seqs.ipynb reads in all 10X and single-cell sorting paired and personalized sequences from data/all_processed_combined_personalized.feather and adds the following fields:
- Add closest human ortholog to the V and J genes
- Add the HCDR3 and LCDR3 length
- Annotate sequences with BG18 type I criteria.
- Annotate sequences with BG18 type I criteria with alternative D3-41 reading frame.
- Assign precursor definitions
- Run mutational analysis
- Run clustering on BG18 sequences using the criteria from the paper.
BG18 human precursors searched for human BG18-like precursors and calculate their frequencies on NGS datasets of 1.1 billion human BCR heavy chain sequences from 14 human donors that were previously described (Briney et al., 2019; Steichen et al., 2019; Willis et al., 2023).
BG18 macaque precursors searched for rhesus macaque BG18-like precursors and calculate their frequencies on 154 datasets of 95.4 million macaque BCR sequences from 60 macaques.