From d93381daf0b56ddc1462b1eab6a2c8d2bb5a039e Mon Sep 17 00:00:00 2001 From: German Beldorati Stark Date: Fri, 15 Jul 2022 12:16:14 -0300 Subject: [PATCH 1/3] add folder for vignettes/blogposts. and h5 vignette --- vignettes/h5_processing_blog.Rmd | 289 +++++++++++++++++++++++++++++++ 1 file changed, 289 insertions(+) create mode 100644 vignettes/h5_processing_blog.Rmd diff --git a/vignettes/h5_processing_blog.Rmd b/vignettes/h5_processing_blog.Rmd new file mode 100644 index 0000000..862f48d --- /dev/null +++ b/vignettes/h5_processing_blog.Rmd @@ -0,0 +1,289 @@ +--- +title: "Converting H5 files to upload to Cellenics" +output: + pdf_document: default + html_document: + df_print: paged + highlight: kate + theme: + version: 4 + code_font: + google: JetBrains Mono +editor_options: + chunk_output_type: console + markdown: + wrap: 72 +--- + +```{r, setup, include=FALSE} +knitr::opts_chunk$set(eval = F, message = F, warning = F) + +``` + +# Introduction + +The H5 format (short for HDF5, which stands for Hierarchical Data Format +version 5) is an increasingly common data format to store single-cell +RNA-seq data. Cellranger, for example, defaults its output in that +format. One of its advantages is the ability to store both the count +matrices and all metadata in a single file (versus using +features/barcodes/matrix files.) + +In this article we'll show how to convert H5 files to the +features/barcodes/matrix format to be able to upload and analyze your +data using Cellenics while we work on adding native support for H5 +files. + +The H5 file format is a container which can have many different things +inside. This implies that your H5 file may be different that what output +directly from Cellranger. So this article is divided in two parts: + +The first section will show how to process standard H5 files, the direct +output from +[`cellranger count`](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_ct). + +The second section will show how to take an arbitrary single-cell +RNA-seq H5 file, manually inspect its contents, pick what is necessary +and convert them to a Cellenics supported format. Be mindful that this +section is a bit more involved, and might require a bit of manual code +editing. + +# Processing standard H5 files - cellranger output + +Standard cellranger HDF5 files can be processed using functionality +already implemented in the `Seurat` package. It should work out of the +box for cellranger output. If this were to fail, refer to the +Non-Standard HDF5 file section of this document. + +## Libraries + +We need to have Seurat and DropletUtils installed. Seurat can be +installed from CRAN, and [DropletUtils is available on +Bioconductor](https://bioconductor.org/packages/release/bioc/html/DropletUtils.html) + +```{r} +library(Seurat) +library(DropletUtils) +``` + +## Processing + +Set the data_dir to the folder that contains the h5 files. After that, +we create a list of all H5 files in the directory, which will be +converted. + +```{r} +data_dir <- "./" +setwd(data_dir) +h5_files <- list.files(data_dir, pattern = "*h5$") +``` + +Create an output directory, to store the converted files. + +```{r} +output_dir <- "out" +dir.create(output_dir) +``` + +Convert the H5 files. The sample_name is going to be the folder name for +each sample, feel free to edit as desired. + +```{r} + +for (file in h5_files) { + + # make sample names, removing .h5 + sample_name <- sub("\\.h5$", "", basename(file)) + sample_path <- file.path(output_dir, sample_name) + + # to show progress + print(sample_name) + + # load the count matrices + gene_names <- rownames(Seurat::Read10X_h5(file)) + counts <- Seurat::Read10X_h5(file, use.names = F) + + # convert + DropletUtils::write10xCounts(sample_path, counts, version = "3", gene.symbol = gene_names) +} +``` + +# Processing non-standard H5 files + +Non-standard h5's should be treated with care. The general idea is to +manually inspect the file, using the `hdf5r` R package, or using the GUI +program [HDFView](https://www.hdfgroup.org/downloads/hdfview/) and take +note of the names of the slots that contain the necessary data. This is +the actual **counts**, the slot with the **genes** and the slot with the +**barcodes** (the names of the cells). The problem is that depending on +previous processing of these files, the slots could be named +differently, which means that there's no easy way to automate this, and +manual decisions have to be made. **All of them should be single columns +of integer numbers**. If there are decimals, these are not the slots +you're looking for. + +The counts slot could be called anything from "`data`", "`counts`" +"`reads`" to "`umi_corrected_reads`" (we prefer UMI corrected counts if +available). Genes and barcodes are usually named like that, "`genes`" +and "`barcodes`". + +In addition, we will need two extra slots with metadata, the gene IDs +(usually ensemblIDs, look for a vector of strings that start with "ENS" +and have a number), and the gene names (gene symbols). + +More details are provided in the **Define Parameters** section. + +## Libraries + +We need to install `hdf5r`, `data.table` and `Matrix` packages. + +```{r} +library(hdf5r) +library(Matrix) +library(DropletUtils) +``` + +## Define parameters + +Define slot names, by inspecting the H5 files to be processed, using +either `hdf5r` or HDFView. The slot names are the paths inside the H5 +file that point to different pieces of information required to convert +the files, such as the data, or the gene names. + +These slot names MUST be changed before processing. They are specific to +each non-standard h5 file. + +Lengths of the counts, genes and barcodes slots must be the same. These +three are used to build the sparse count matrix. + +- `counts_slot` should point to the actual data. +- `genes_slot` should point to an integer vector with row indices +- `barcodes_slot` should point to an integer vector with column + indices + +```{r} +counts_slot <- "umi_corrected_reads" +genes_slot <- "gene" +barcodes_slot <- "barcode" +``` + +The gene_names and ids should be the same length and most likely smaller +than the counts/genes/barcodes slots. These are the gene labels, used +when creating the 10x files. + +Like the previous slots, these MUST be renamed according to the +structure of the specific h5 file being processed. + +- `gene_ids_slot` should point to a character vector of gene ids. +- `gene_names_slot` should point to a character vector of gene symbols + +```{r} +gene_ids_slot <- "gene_ids" +gene_names_slot <- "gene_names" +``` + +## Bulk processing + +Use this section to bulk process h5 files. + +Set the data_dir to the folder that contains the h5 files. After that, +we create a list of all H5 files in the directory, which will be +converted. It's important to print the `h5_files` variable, and check if +the file names are correct, and that we're processing the h5 files that +we want to process. + +```{r} +data_dir <- "./" +setwd(data_dir) +h5_files <- list.files(data_dir, pattern = "*h5$") +``` + +Create an output directory, to store the converted files. + +```{r} +output_dir <- file.path(data_dir, "out") +dir.create(output_dir) +``` + +### Required functions + +These are the functions that do the actual work, so we need to load +them. They extract the slots we defined earlier and build the sparse +count matrix using them. + +```{r} +extract_slots <- function(h5_path) { + h5 <- H5File$new(h5_path, mode = "r") + + counts <- h5[[counts_slot]][] + genes <- h5[[genes_slot]][] + barcodes <- h5[[barcodes_slot]][] + + gene_ids <- h5[[gene_ids_slot]][] + gene_names <- h5[[gene_names_slot]][] + + r_barcodes <- data.table::frankv(barcodes, ties.method = "dense") + + if(min(genes) == 0 || min(barcodes) == 0) {index1 <- F} else {index1 <- T} + + return( + list( + "counts" = counts, + "genes" = genes, + "barcodes" = barcodes, + "r_barcodes" = r_barcodes, + "gene_ids" = gene_ids, + "gene_names" = gene_names, + "index1" = index1 + ) + ) +} + +build_sparse_matrix <- function(slots) { + sparse_matrix <- + sparseMatrix( + i = slots[["genes"]], + j = slots[["r_barcodes"]], + x = slots[["counts"]], + repr = "C", + index1 = slots[["index1"]] + ) + + return(sparse_matrix) +} + +``` + +### Processing + +This block converts all the h5 files detected and stored in the +`h5_files` variable using our previously defined parameters (slot names) +and functions. The sample_name is going to be the folder name for each +sample, feel free to edit as desired. + +```{r} +for (file in h5_files) { + print(file) + + # make sample names, removing .h5 + sample_name <- sub("\\.h5$", "", basename(file)) + sample_path <- file.path(output_dir, sample_name) + + # to show progress + print(sample_name) + + # read h5 files and build sparse matrix + slots <- extract_slots(file) + counts <- build_sparse_matrix(slots) + + # write to files. + DropletUtils::write10xCounts(sample_path, + counts, + barcodes = paste0("cell_", unique(slots[["barcodes"]])), + gene.id = slots[["gene_ids"]], + gene.symbol = slots[["gene_names"]], + version = "3") +} +``` + +Now you should be able to upload your data to Cellenics! From 618ec3f383841085c66c125e73bf8da130661d46 Mon Sep 17 00:00:00 2001 From: German Beldorati Stark Date: Fri, 15 Jul 2022 13:28:28 -0300 Subject: [PATCH 2/3] initial iteration of csv conversion vignette --- vignettes/convert_csv_files.Rmd | 268 ++++++++++++++++++++++++++++++++ 1 file changed, 268 insertions(+) create mode 100644 vignettes/convert_csv_files.Rmd diff --git a/vignettes/convert_csv_files.Rmd b/vignettes/convert_csv_files.Rmd new file mode 100644 index 0000000..0abed7b --- /dev/null +++ b/vignettes/convert_csv_files.Rmd @@ -0,0 +1,268 @@ +--- +title: "Converting csv/tsv files to upload to Cellenics" +output: + pdf_document: default + html_document: + df_print: paged + highlight: kate + theme: + version: 4 + code_font: + google: JetBrains Mono +editor_options: + chunk_output_type: console + markdown: + wrap: 72 +--- + +```{r, setup, include=FALSE} +knitr::opts_chunk$set(eval = F, message = F, warning = F) + +``` + +# Introduction + +"Comma (or Tab) Separated Value" files (CSV or TSV) are a common file type used +for the storage of tabular data. In general it is not recommended to use them, +and there are better, more robust alternatives for storing and sharing biological +data (such as H5 files), but they are very widely used and supported. + +The main issue that concerns us, with respect to uploading your data to Cellenics, +is that there is no well-defined standard as to how the single-cell RNA-seq information +is represented. The genes and barcodes might be on rows or columns, the sample information +could be represented in one file per sample (best case scenario) but it could be +encoded in many different ways (in the barcode name, in an extra column, etc). All of this +requires careful examination of the input files, to decide what the processing should be, +which could potentially involve some modification of the code presented in this document. + +We will make some generalizing assumptions: + +1. Genes are stored in rows +2. Barcodes (cells) are stored in columns +3. Sample information is encoded in the name of the barcode + +In case the sample assignment is not in the barcode (stored as different files +for example), leaving the `sample_regex` variable as `NULL` should be enough. + +# Libraries + +We need to have `data.table`, `DropletUtils` and the `Matrix` packages installed. +[DropletUtils is available on Bioconductor](https://bioconductor.org/packages/release/bioc/html/DropletUtils.html), +while both `data.table` and `Matrix` are available on CRAN. + + +```{r} +library(data.table) +library(DropletUtils) +library(Matrix) +``` + +# Function definition + +These are the functions that will do the work for us, so we have to load them. + +```{r} +#' clean original data.table CSV column names +#' +#' Removes sample information from column names. It modifies in place! +#' +#' @param dt +#' @param sample_barcode_tab +#' +clean_dt_colnames <- function(dt, clean_barcodes) { + setnames(dt, base::colnames(dt), clean_barcodes) +} + +#' make sample <-> barcode table +#' +#' Extracts sample name from "sample_barcode" encoded column names in csv table. +#' Creates table with barcode - sample association. +#' Users should manually check if the regex is correct for the particular dataset +#' being demultiplexed. +#' +#' @param dt data.table original csv/tsv dataset +#' @param sample_regex chr regex to parse column names for sample and barcodes +#' +#' @return data.table +#' +make_sample_barcode_tab <- function(dt, sample_regex = NA) { + samp_bc <- colnames(dt) + + if (!is.na(sample_regex)) { + sample_names <- gsub(sample_regex, "\\1", samp_bc) + barcodes <- gsub(sample_regex, "\\2", samp_bc) + + clean_dt_colnames(dt, barcodes) + } else { + barcodes <- samp_bc + sample_names <- rep_len("single_sample", length(barcodes)) + } + + # first var in dt is the gene_names var (data.tables don't have rownames) + data.table( + sample = sample_names[-1], + barcode = barcodes[-1] + ) +} + + +#' Create list of barcodes in samples +#' +#' @param sample_barcode_tab data.table sample/barcode table +#' +#' @return list one element per sample, with every barcode in sample +#' +list_barcodes_in_sample <- function(sample_barcode_tab) { + # nest each barcode group to separate data.table + nested_sample_dt <- sample_barcode_tab[, .(bc_list = list(.SD)), by = sample] + + # convert nested data table to list + lapply(nested_sample_dt[["bc_list"]], unlist) +} + + +#' subset data.table +#' +#' Subsets cleaned (clean_dt_colnames) data.table, provided character vector of +#' barcodes in sample. +#' Helper function to simplify lapply calls. +#' +#' @param dt data.table cleaned count csv +#' @param columns character vector +#' +#' @return data.table subsetted data.table +#' +sub_dt <- function(columns, dt) { + # subset a data table by character vector, to ease lapply + columns <- c("V1", columns) + dt[, ..columns] +} + + +#' export demultiplexed data +#' +#' exports 10X files in a folder per sample. +#' +#' @param sample_dt data.table sample <-> barcode table +#' @param sparse_matrix_list list of count matrices per sample +#' @param data_dir chr root dir to export +#' +export_demultiplexed_data <- function(sample_dt, sparse_matrix_list, data_dir) { + + nested_sample_dt <- sample_dt[, .(bc_list = list(.SD)), by = sample] + + for (row in 1:nrow(nested_sample_dt)) { + fname <- file.path(data_dir, "out", nested_sample_dt[row][["sample"]]) + + # unnest barcodes in sample + expected_barcodes_in_sample <- nested_sample_dt[row, bc_list[[1]]][["barcode"]] + + if (!identical(expected_barcodes_in_sample, colnames(sparse_matrix_list[[row]]))) { + stop("not the same barcodes") + } + + DropletUtils::write10xCounts(fname, + sparse_matrix_list[[row]], + version = "3" + ) + } +} +``` + +# Parameter definition + +## Files and Folders + +Set the data_dir to the folder that contains the CSV/TSV file or files. After that, +we create a list of all CSV/TSV files in the directory, which will be +converted. We will refer to them as CSV files, but this applies to both types. If they +are compressed, you should uncompress them beforehand. + +After creating the list of csv/tsv files to process, we should manually checked if +it contains the correct files by printing it. + +```{r} +data_dir <- "./" +setwd(data_dir) +csv_files <- list.files(data_dir, pattern = "*[ct]sv$") + +print(csv_files) +``` + +Create an output directory, to store the converted files. + +```{r} +output_dir <- file.path(data_dir, "out") +dir.create(output_dir) +``` + +## Sample Information + +If the samples are encoded in the barcode names, you should write a regular expression +(regex) that captures the sample name/id and the barcodes. For example, if the barcodes looked like +"sampleX_AAACTAGCTCGCGA" our regex should have two groups (surrounded by parentheses), and +match "sampleX" and "AAACTAGCTCGCGA". + +Explaining regex in depth is out of the scope of this document, but this should +get you started: + +The example regex has two groups, separated by an underscore: +1. The first group captures the sample ID: `(sample[[:digit:]]+)` captures the +word "sample" folowed by any number "[[:digit:]]" repeated 1 or more times "+" + +2. The second group captures the barcode, which usually is the cDNA sequence, so using +`([ACTG]+)` we match any of ACTG ("[ACTG]") that appears one or more times "+" + +3. Finally, we expect them to be separated by an underscore "_". + + +```{r} +sample_regex <- NA +# example regex: "(sample[[:digit:]]+)_([ACGT]+)" +``` + + +# Processing the files + +After we loaded our packages, sourced our functions and defined our parameters, it's +time to actually process our files, by running the next block. + +NOTE: Since CSV/TSV files can be pretty big, we have to be careful with the RAM +usage, which is why there are some calls to the `rm()` function (to remove +unnecessary objects) and `gc()` to force R's garbage collection. + +```{r} + +for (file in csv_files) { + csv_table <- fread(file) + setnames(csv_table, old = 1, new = "V1") + + sample_tab <- make_sample_barcode_tab(csv_table, sample_regex) + + gc() + # subset the original count data.table + dt_subset <- + lapply(list_barcodes_in_sample(sample_tab), sub_dt, csv_table) + rm(csv_table) + gc() + + dt_subset <- list(csv_table) + rm(csv_table) + gc() + + # convert each subsetted count data.table to count matrix + counts <- lapply(dt_subset, as.matrix, rownames = "V1") + rm(dt_subset) + gc() + + # convert each count matrix to sparse matrices + sparse_counts <- lapply(counts, Matrix, sparse = T) + rm(counts) + gc() + + export_demultiplexed_data(sample_tab, sparse_counts, data_dir) + +} + +``` + From 2f4b637104417e98881eab1184782cde76685b88 Mon Sep 17 00:00:00 2001 From: German Beldorati Stark Date: Tue, 19 Jul 2022 11:18:17 -0300 Subject: [PATCH 3/3] add more text --- vignettes/convert_csv_files.Rmd | 36 ++++++++++++++++++++++++++------- 1 file changed, 29 insertions(+), 7 deletions(-) diff --git a/vignettes/convert_csv_files.Rmd b/vignettes/convert_csv_files.Rmd index 0abed7b..ea05ed3 100644 --- a/vignettes/convert_csv_files.Rmd +++ b/vignettes/convert_csv_files.Rmd @@ -196,6 +196,29 @@ output_dir <- file.path(data_dir, "out") dir.create(output_dir) ``` +## Manual inspection + +We should read in at least one of the csv files and take a look at them. We're +especially interested in the column names, to see if they contain sample information. + +We can take a look at the output of some useful R functions, such as `str`, `colnames` + +```{r} +csv_example <- fread(csv_files[1]) + +# Look at the general structure of the matrix. +str(csv_example) + +# print the column names, usually the barcodes +colnames(csv_example) + +# print the first 20 rows of the first column (usually gene names) +head(csv_example[, 1], 20) +``` + +Looking at the column names, we should be able to tell if there's sample information +encoded, which will inform our decision in the next section. + ## Sample Information If the samples are encoded in the barcode names, you should write a regular expression @@ -240,16 +263,13 @@ for (file in csv_files) { sample_tab <- make_sample_barcode_tab(csv_table, sample_regex) gc() - # subset the original count data.table + + # subset the original count data.table, separating by samples if present dt_subset <- lapply(list_barcodes_in_sample(sample_tab), sub_dt, csv_table) rm(csv_table) gc() - - dt_subset <- list(csv_table) - rm(csv_table) - gc() - + # convert each subsetted count data.table to count matrix counts <- lapply(dt_subset, as.matrix, rownames = "V1") rm(dt_subset) @@ -260,9 +280,11 @@ for (file in csv_files) { rm(counts) gc() + # export the data to one folder per sample export_demultiplexed_data(sample_tab, sparse_counts, data_dir) - } ``` +After this, you should have an "out" folder containing all the samples in a format +compatible with Cellenics! \ No newline at end of file