Update training workflow to handle CRISPR data with multiple cell types by mayasheth · Pull Request #85 · EngreitzLab/ENCODE_rE2G

mayasheth · 2025-06-09T15:04:57Z

No description provided.

kaybrand

I like how you pulled the model directories out of the dataset directories. It is also nice that you supply a single merged CRISPRi training dataset to the train_model rule, using the CellType column to combine distinct datasets.

These are my suggestions from my first read-through:
In line 58 of the README, please correct 'saples' to 'samples'.
You refer several times to 'ct' and 'cd' in variable names. Renaming them to 'cell_type' and 'crispr_dataset' (or another name) would make your code more readable.
On line 19 of Snakefile_training, you set config["results_dir"] to be an absolute path. But if this were already an absolute path, the results dir might end up looking something like /oak/stanford/groups/engreitz/Users/kaybrand/ENCODE_rE2G/oak/stanford/groups/engreitz/Users/kaybrand/ENCODE_rE2G/results. I advise checking if the path starts from the root, checking if directory exists, and creating it if it does not, then saving the path to config["results_dir"].

fix indexing error in utils

…set by two different environment variables

…ounting capability

…ypes

…ut requires 70-80% more RAM

improve efficiency of calculating num TSS btw E&G + black formatting adjustments

Maya Sheth and others added 6 commits June 6, 2025 10:07

first attempt at implementation, untested

393b751

fix some typos

a0a10c6

allow for multiple crispr datasets

fc84661

update snakefile

88396be

update to fully working

9fa5b11

update readme and example configs"

e8bc14a

kaybrand reviewed Jun 9, 2025

View reviewed changes

Maya Sheth and others added 18 commits August 7, 2025 10:29

fix pandas indexing

110e62f

Merge pull request #87 from EngreitzLab/ms_indexing_fix

e39d27e

fix indexing error in utils

fix typo

a85df94

clarify variable names

77d1e05

actually clarify variable names

6b81c64

handle crispr_dataset as dict

8cd2472

set cpus-per-task = # threads to mitigate srun: fatal: cpus-per-task …

661056f

…set by two different environment variables

improve time+memory efficiency by utilizing bedtools native overlap c…

8722750

…ounting capability

black v26.1.0 reformatting

85e1ed8

Exclude ABC submodule from black checks

009cd97

remove extraneous comments from code counting num TSS btw enh and gene

8952800

Only apply black to workflow dir

6e7b2c8

unified crispr_dataset handling

c6e004f

pulled gen_num_candidate_enh_gene.py from branch train_multiple_cellt…

7dde7c6

…ypes

vectorized approach to counting enh btw E-G runs in 4% as much time b…

c4cb574

…ut requires 70-80% more RAM

allow user to specify cluster_max_cores for jobs

e00074b

black reformatting

4083e9a

Merge pull request #92 from kaybrand/pr/multicell-clean2

d17239e

improve efficiency of calculating num TSS btw E&G + black formatting adjustments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Update training workflow to handle CRISPR data with multiple cell types#85

Update training workflow to handle CRISPR data with multiple cell types#85
mayasheth wants to merge 24 commits intodevfrom
train_multiple_celltypes

mayasheth commented Jun 9, 2025

Uh oh!

kaybrand left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

mayasheth commented Jun 9, 2025

Uh oh!

kaybrand left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants