Guidance on V3 branch

Hello, thanks so much for making pyCistopic and putting all the work into it!

I've swapped over to the v3 branch after running into memory issues on the main branch as seen in issue #179. However, a lot of the functionality has changed and while reading through the code, I think I have most of it down but I wanted to get some clarification on how certain steps should be done with the new changes in code. 

Can you verify, briefly, if my understanding is correct?
In v2 (following the tutorial), we impute accessibility, normalize_scores, find_highly_variable_features, and then find_diff_features. We can use `find_diff_features` to identify DARs for our data either celltype-specific DARs or provide contrasts to find contrasts of our own interests. The DARs can be converted to DAGs by calculating gene_activity using the `get_gene_activity` function and then run `find_diff_features` using the gene activity results.

for v2 `find_diff_features` we provide both the pyCistopic object and the imputed object for DARs while providing the gene activity information for DAGs. It seems like this has changed in v3.`find_diff_features` has become `find_diff_accessible_regions` where the imputed_accessibility is no longer an input into the function it seems?

My code is as follows (using v3 branch) where `data` is my QCed pyCistopic object:
```python
topic_region_ = data.selected_model.topic_region.to_numpy(dtype=np.float32)
cell_topic_ = data.selected_model.cell_topic.to_numpy(dtype=np.float32)
regions = data.selected_model.topic_region.index.tolist()


from pycisTopic.cistopic_class import *
from pycisTopic.diff_features import *
## Get mean and dispersion of normalized imputed accessibility per region.
(region_names_to_keep,
per_region_means_on_normalized_imputed_acc,
per_region_dispersions_on_normalized_imputed_acc,) = calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc(
          region_topic=topic_region_,
          cell_topic=cell_topic_,
          region_names=regions,
          scale_factor1 = 10**6,
          scale_factor2 = 10**4,
          regions_chunk_size=20000,)

## Find highly variable features.
var_regions = find_highly_variable_regions(
          regions=region_names_to_keep,
          per_region_means_on_normalized_imputed_acc=per_region_means_on_normalized_imputed_acc,
          per_region_dispersions_on_normalized_imputed_acc=per_region_dispersions_on_normalized_imputed_acc,
          min_disp = 0.05,
          min_mean = 0.0125,
          max_disp = float("inf"),
          max_mean = 3,
          n_bins = 20,
          n_top_features = None,
          plot = True,)

## Optional: save regions
pd.DataFrame(var_regions, columns=["regions"]).to_csv(f"{out_dir}/variable_regions.csv", index=False)

markers_dict = find_diff_accessible_regions(
    region_topic                   = topic_region_.astype("float32"),
    cell_topic                     = cell_topic_.astype("float32"),
    region_names                   = regions,  # calculated from binarize_topics()
    cell_names                     = cell_names, # from data.cell_names
    highly_variable_regions        = var_regions,
    contrasts                      = contrasts, # some defined contrasts; I don't think it can be left empty
    scale_factor1                  = 10**6
    regions_chunk_size             = 20000,
    adjusted_pvalue_threshold      = 0.05,
    log2_fold_change_threshold     = math.log2(1.5),
)
```

Additionally, I wanted to transform the DARs into DAGs, but I'm not sure I'm utilizing v3 correctly.
```python
imputed = impute_accessibility(
    cistopic_obj  = data,
    selected_cells   = data.cell_names,
    # selected_regions = region_names_to_keep,
    scale_factor     = 10**6,
    chunk_size       = 20000,
)

# copied directly from tutorial
gene_act, weigths = get_gene_activity(
    imputed,
    pr_annotation, # followed steps from tutorial
    chromsizes,
    use_gene_boundaries=True, # Whether to use the whole search space or stop when encountering another gene
    upstream=[1000, 100000], # Search space upstream. The minimum means that even if there is a gene right next to it
                             # these bp will be taken (1kbp here)
    downstream=[1000,100000], # Search space downstream
    distance_weight=True, # Whether to add a distance weight (an exponential function, the weight will decrease with distance)
    decay_rate=1, # Exponent for the distance exponential funciton (the higher the faster will be the decrease)
    extend_gene_body_upstream=10000, # Number of bp upstream immune to the distance weight (their value will be maximum for
                          #this weight)
    extend_gene_body_downstream=500, # Number of bp downstream immune to the distance weight
    gene_size_weight=False, # Whether to add a weights based on the length of the gene
    gene_size_scale_factor='median', # Dividend to calculate the gene size weigth. Default is the median value of all genes
                          #in the genome
    remove_promoters=False, # Whether to remove promoters when computing gene activity scores
    average_scores=True, # Whether to divide by the total number of region assigned to a gene when calculating the gene
                          #activity score
    scale_factor=1, # Value to multiply for the final gene activity matrix
    extend_tss=[10,10], # Space to consider a promoter
    gini_weight = True, # Whether to add a gini index weigth. The more unique the region is, the higher this weight will be
    return_weights= True, # Whether to return the final weights
    project='Gene_activity') # Project name for the gene activity object

# these are my custom steps
# 1) pull the matrix and name‐lists
mat         = gene_act.mtx                   # already (n_genes × n_cells)
genes       = gene_act.feature_names         # length = 19 059
cells       = gene_act.cell_names            # length = 85 593

# 2) sanity‐check
assert mat.shape      == (len(genes), len(cells))
assert len(genes)     == mat.shape[0]
assert len(cells)     == mat.shape[1]

# 3) build your inputs (no transpose!)
region_topic_gene = mat.astype(np.float32)                # (n_genes × n_cells)
cell_topic_gene   = np.eye(mat.shape[1], dtype=np.float32) # (n_cells × n_cells)

# 4. call the v3 function
dags = find_diff_accessible_regions(
    region_topic        = region_topic_gene,
    cell_topic          = cell_topic_gene,
    region_names        = genes,
    cell_names          = cells,
    highly_variable_regions = genes,  # or a subset if you pre-filter
    contrasts           = contrasts_celltype,
    scale_factor1       = 10**6,      # gene_act is already scaled
    adjusted_pvalue_threshold = 0.05,
    log2_fold_change_threshold = np.log2(1.5),
)
```

Does this look correct? Would the logic need to be changed, for example, if I'm looking at DARs and DAGs for a specific set of contrasts (i.e. conditions, timepoints) rather than just all celltypes alone (I did manually feed those in as a contrast).

Lastly, how do you use the topic_qcmetrics to choose the best method for binarizing a topic? 
Thanks so much for taking the time to answer my questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Guidance on V3 branch #212

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Guidance on V3 branch #212

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions