Skip to content

Guidance on V3 branch #212

@jjia1

Description

@jjia1

Hello, thanks so much for making pyCistopic and putting all the work into it!

I've swapped over to the v3 branch after running into memory issues on the main branch as seen in issue #179. However, a lot of the functionality has changed and while reading through the code, I think I have most of it down but I wanted to get some clarification on how certain steps should be done with the new changes in code.

Can you verify, briefly, if my understanding is correct?
In v2 (following the tutorial), we impute accessibility, normalize_scores, find_highly_variable_features, and then find_diff_features. We can use find_diff_features to identify DARs for our data either celltype-specific DARs or provide contrasts to find contrasts of our own interests. The DARs can be converted to DAGs by calculating gene_activity using the get_gene_activity function and then run find_diff_features using the gene activity results.

for v2 find_diff_features we provide both the pyCistopic object and the imputed object for DARs while providing the gene activity information for DAGs. It seems like this has changed in v3.find_diff_features has become find_diff_accessible_regions where the imputed_accessibility is no longer an input into the function it seems?

My code is as follows (using v3 branch) where data is my QCed pyCistopic object:

topic_region_ = data.selected_model.topic_region.to_numpy(dtype=np.float32)
cell_topic_ = data.selected_model.cell_topic.to_numpy(dtype=np.float32)
regions = data.selected_model.topic_region.index.tolist()


from pycisTopic.cistopic_class import *
from pycisTopic.diff_features import *
## Get mean and dispersion of normalized imputed accessibility per region.
(region_names_to_keep,
per_region_means_on_normalized_imputed_acc,
per_region_dispersions_on_normalized_imputed_acc,) = calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc(
          region_topic=topic_region_,
          cell_topic=cell_topic_,
          region_names=regions,
          scale_factor1 = 10**6,
          scale_factor2 = 10**4,
          regions_chunk_size=20000,)

## Find highly variable features.
var_regions = find_highly_variable_regions(
          regions=region_names_to_keep,
          per_region_means_on_normalized_imputed_acc=per_region_means_on_normalized_imputed_acc,
          per_region_dispersions_on_normalized_imputed_acc=per_region_dispersions_on_normalized_imputed_acc,
          min_disp = 0.05,
          min_mean = 0.0125,
          max_disp = float("inf"),
          max_mean = 3,
          n_bins = 20,
          n_top_features = None,
          plot = True,)

## Optional: save regions
pd.DataFrame(var_regions, columns=["regions"]).to_csv(f"{out_dir}/variable_regions.csv", index=False)

markers_dict = find_diff_accessible_regions(
    region_topic                   = topic_region_.astype("float32"),
    cell_topic                     = cell_topic_.astype("float32"),
    region_names                   = regions,  # calculated from binarize_topics()
    cell_names                     = cell_names, # from data.cell_names
    highly_variable_regions        = var_regions,
    contrasts                      = contrasts, # some defined contrasts; I don't think it can be left empty
    scale_factor1                  = 10**6
    regions_chunk_size             = 20000,
    adjusted_pvalue_threshold      = 0.05,
    log2_fold_change_threshold     = math.log2(1.5),
)

Additionally, I wanted to transform the DARs into DAGs, but I'm not sure I'm utilizing v3 correctly.

imputed = impute_accessibility(
    cistopic_obj  = data,
    selected_cells   = data.cell_names,
    # selected_regions = region_names_to_keep,
    scale_factor     = 10**6,
    chunk_size       = 20000,
)

# copied directly from tutorial
gene_act, weigths = get_gene_activity(
    imputed,
    pr_annotation, # followed steps from tutorial
    chromsizes,
    use_gene_boundaries=True, # Whether to use the whole search space or stop when encountering another gene
    upstream=[1000, 100000], # Search space upstream. The minimum means that even if there is a gene right next to it
                             # these bp will be taken (1kbp here)
    downstream=[1000,100000], # Search space downstream
    distance_weight=True, # Whether to add a distance weight (an exponential function, the weight will decrease with distance)
    decay_rate=1, # Exponent for the distance exponential funciton (the higher the faster will be the decrease)
    extend_gene_body_upstream=10000, # Number of bp upstream immune to the distance weight (their value will be maximum for
                          #this weight)
    extend_gene_body_downstream=500, # Number of bp downstream immune to the distance weight
    gene_size_weight=False, # Whether to add a weights based on the length of the gene
    gene_size_scale_factor='median', # Dividend to calculate the gene size weigth. Default is the median value of all genes
                          #in the genome
    remove_promoters=False, # Whether to remove promoters when computing gene activity scores
    average_scores=True, # Whether to divide by the total number of region assigned to a gene when calculating the gene
                          #activity score
    scale_factor=1, # Value to multiply for the final gene activity matrix
    extend_tss=[10,10], # Space to consider a promoter
    gini_weight = True, # Whether to add a gini index weigth. The more unique the region is, the higher this weight will be
    return_weights= True, # Whether to return the final weights
    project='Gene_activity') # Project name for the gene activity object

# these are my custom steps
# 1) pull the matrix and name‐lists
mat         = gene_act.mtx                   # already (n_genes × n_cells)
genes       = gene_act.feature_names         # length = 19 059
cells       = gene_act.cell_names            # length = 85 593

# 2) sanity‐check
assert mat.shape      == (len(genes), len(cells))
assert len(genes)     == mat.shape[0]
assert len(cells)     == mat.shape[1]

# 3) build your inputs (no transpose!)
region_topic_gene = mat.astype(np.float32)                # (n_genes × n_cells)
cell_topic_gene   = np.eye(mat.shape[1], dtype=np.float32) # (n_cells × n_cells)

# 4. call the v3 function
dags = find_diff_accessible_regions(
    region_topic        = region_topic_gene,
    cell_topic          = cell_topic_gene,
    region_names        = genes,
    cell_names          = cells,
    highly_variable_regions = genes,  # or a subset if you pre-filter
    contrasts           = contrasts_celltype,
    scale_factor1       = 10**6,      # gene_act is already scaled
    adjusted_pvalue_threshold = 0.05,
    log2_fold_change_threshold = np.log2(1.5),
)

Does this look correct? Would the logic need to be changed, for example, if I'm looking at DARs and DAGs for a specific set of contrasts (i.e. conditions, timepoints) rather than just all celltypes alone (I did manually feed those in as a contrast).

Lastly, how do you use the topic_qcmetrics to choose the best method for binarizing a topic?
Thanks so much for taking the time to answer my questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions