From d93381daf0b56ddc1462b1eab6a2c8d2bb5a039e Mon Sep 17 00:00:00 2001
From: German Beldorati Stark <gerbeldo@gmail.com>
Date: Fri, 15 Jul 2022 12:16:14 -0300
Subject: [PATCH 1/3] add folder for vignettes/blogposts. and h5 vignette

---
 vignettes/h5_processing_blog.Rmd | 289 +++++++++++++++++++++++++++++++
 1 file changed, 289 insertions(+)
 create mode 100644 vignettes/h5_processing_blog.Rmd

diff --git a/vignettes/h5_processing_blog.Rmd b/vignettes/h5_processing_blog.Rmd
new file mode 100644
index 0000000..862f48d
--- /dev/null
+++ b/vignettes/h5_processing_blog.Rmd
@@ -0,0 +1,289 @@
+---
+title: "Converting H5 files to upload to Cellenics"
+output:
+  pdf_document: default
+  html_document:
+    df_print: paged
+    highlight: kate
+    theme:
+      version: 4
+      code_font:
+        google: JetBrains Mono
+editor_options:
+  chunk_output_type: console
+  markdown:
+    wrap: 72
+---
+
+```{r, setup, include=FALSE}
+knitr::opts_chunk$set(eval = F, message = F, warning = F)
+
+```
+
+# Introduction
+
+The H5 format (short for HDF5, which stands for Hierarchical Data Format
+version 5) is an increasingly common data format to store single-cell
+RNA-seq data. Cellranger, for example, defaults its output in that
+format. One of its advantages is the ability to store both the count
+matrices and all metadata in a single file (versus using
+features/barcodes/matrix files.)
+
+In this article we'll show how to convert H5 files to the
+features/barcodes/matrix format to be able to upload and analyze your
+data using Cellenics while we work on adding native support for H5
+files.
+
+The H5 file format is a container which can have many different things
+inside. This implies that your H5 file may be different that what output
+directly from Cellranger. So this article is divided in two parts:
+
+The first section will show how to process standard H5 files, the direct
+output from
+[`cellranger count`](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_ct).
+
+The second section will show how to take an arbitrary single-cell
+RNA-seq H5 file, manually inspect its contents, pick what is necessary
+and convert them to a Cellenics supported format. Be mindful that this
+section is a bit more involved, and might require a bit of manual code
+editing.
+
+# Processing standard H5 files - cellranger output
+
+Standard cellranger HDF5 files can be processed using functionality
+already implemented in the `Seurat` package. It should work out of the
+box for cellranger output. If this were to fail, refer to the
+Non-Standard HDF5 file section of this document.
+
+## Libraries
+
+We need to have Seurat and DropletUtils installed. Seurat can be
+installed from CRAN, and [DropletUtils is available on
+Bioconductor](https://bioconductor.org/packages/release/bioc/html/DropletUtils.html)
+
+```{r}
+library(Seurat)
+library(DropletUtils)
+```
+
+## Processing
+
+Set the data_dir to the folder that contains the h5 files. After that,
+we create a list of all H5 files in the directory, which will be
+converted.
+
+```{r}
+data_dir <- "./"
+setwd(data_dir)
+h5_files <- list.files(data_dir, pattern = "*h5$")
+```
+
+Create an output directory, to store the converted files.
+
+```{r}
+output_dir <- "out"
+dir.create(output_dir)
+```
+
+Convert the H5 files. The sample_name is going to be the folder name for
+each sample, feel free to edit as desired.
+
+```{r}
+
+for (file in h5_files) {
+  
+  # make sample names, removing .h5
+  sample_name <- sub("\\.h5$", "", basename(file))
+  sample_path <- file.path(output_dir, sample_name)
+
+  # to show progress
+  print(sample_name)
+  
+  # load the count matrices
+  gene_names <- rownames(Seurat::Read10X_h5(file))
+  counts <- Seurat::Read10X_h5(file, use.names = F)
+  
+  # convert
+  DropletUtils::write10xCounts(sample_path, counts, version = "3", gene.symbol = gene_names)
+}
+```
+
+# Processing non-standard H5 files
+
+Non-standard h5's should be treated with care. The general idea is to
+manually inspect the file, using the `hdf5r` R package, or using the GUI
+program [HDFView](https://www.hdfgroup.org/downloads/hdfview/) and take
+note of the names of the slots that contain the necessary data. This is
+the actual **counts**, the slot with the **genes** and the slot with the
+**barcodes** (the names of the cells). The problem is that depending on
+previous processing of these files, the slots could be named
+differently, which means that there's no easy way to automate this, and
+manual decisions have to be made. **All of them should be single columns
+of integer numbers**. If there are decimals, these are not the slots
+you're looking for.
+
+The counts slot could be called anything from "`data`", "`counts`"
+"`reads`" to "`umi_corrected_reads`" (we prefer UMI corrected counts if
+available). Genes and barcodes are usually named like that, "`genes`"
+and "`barcodes`".
+
+In addition, we will need two extra slots with metadata, the gene IDs
+(usually ensemblIDs, look for a vector of strings that start with "ENS"
+and have a number), and the gene names (gene symbols).
+
+More details are provided in the **Define Parameters** section.
+
+## Libraries
+
+We need to install `hdf5r`, `data.table` and `Matrix` packages.
+
+```{r}
+library(hdf5r)
+library(Matrix)
+library(DropletUtils)
+```
+
+## Define parameters
+
+Define slot names, by inspecting the H5 files to be processed, using
+either `hdf5r` or HDFView. The slot names are the paths inside the H5
+file that point to different pieces of information required to convert
+the files, such as the data, or the gene names.
+
+These slot names MUST be changed before processing. They are specific to
+each non-standard h5 file.
+
+Lengths of the counts, genes and barcodes slots must be the same. These
+three are used to build the sparse count matrix.
+
+-   `counts_slot` should point to the actual data.
+-   `genes_slot` should point to an integer vector with row indices
+-   `barcodes_slot` should point to an integer vector with column
+    indices
+
+```{r}
+counts_slot <- "umi_corrected_reads"
+genes_slot <- "gene"
+barcodes_slot <- "barcode"
+```
+
+The gene_names and ids should be the same length and most likely smaller
+than the counts/genes/barcodes slots. These are the gene labels, used
+when creating the 10x files.
+
+Like the previous slots, these MUST be renamed according to the
+structure of the specific h5 file being processed.
+
+-   `gene_ids_slot` should point to a character vector of gene ids.
+-   `gene_names_slot` should point to a character vector of gene symbols
+
+```{r}
+gene_ids_slot <- "gene_ids"
+gene_names_slot <- "gene_names"
+```
+
+## Bulk processing
+
+Use this section to bulk process h5 files.
+
+Set the data_dir to the folder that contains the h5 files. After that,
+we create a list of all H5 files in the directory, which will be
+converted. It's important to print the `h5_files` variable, and check if
+the file names are correct, and that we're processing the h5 files that
+we want to process.
+
+```{r}
+data_dir <- "./"
+setwd(data_dir)
+h5_files <- list.files(data_dir, pattern = "*h5$")
+```
+
+Create an output directory, to store the converted files.
+
+```{r}
+output_dir <- file.path(data_dir, "out")
+dir.create(output_dir)
+```
+
+### Required functions
+
+These are the functions that do the actual work, so we need to load
+them. They extract the slots we defined earlier and build the sparse
+count matrix using them.
+
+```{r}
+extract_slots <- function(h5_path) {
+  h5 <- H5File$new(h5_path, mode = "r")
+  
+  counts <- h5[[counts_slot]][]
+  genes <- h5[[genes_slot]][]
+  barcodes <- h5[[barcodes_slot]][]
+  
+  gene_ids <- h5[[gene_ids_slot]][]
+  gene_names <- h5[[gene_names_slot]][]
+  
+  r_barcodes <- data.table::frankv(barcodes, ties.method = "dense")
+  
+  if(min(genes) == 0 || min(barcodes) == 0) {index1 <- F} else {index1 <- T}
+  
+  return(
+    list(
+      "counts" = counts,
+      "genes" = genes,
+      "barcodes" = barcodes,
+      "r_barcodes" = r_barcodes,
+      "gene_ids" = gene_ids,
+      "gene_names" = gene_names,
+      "index1" = index1
+    )
+  )
+}
+
+build_sparse_matrix <- function(slots) {
+  sparse_matrix <-
+    sparseMatrix(
+      i = slots[["genes"]],
+      j = slots[["r_barcodes"]],
+      x = slots[["counts"]],
+      repr = "C",
+      index1 = slots[["index1"]]
+    )
+  
+  return(sparse_matrix)
+}
+
+```
+
+### Processing
+
+This block converts all the h5 files detected and stored in the
+`h5_files` variable using our previously defined parameters (slot names)
+and functions. The sample_name is going to be the folder name for each
+sample, feel free to edit as desired.
+
+```{r}
+for (file in h5_files) {
+  print(file)
+  
+  # make sample names, removing .h5
+  sample_name <- sub("\\.h5$", "", basename(file))
+  sample_path <- file.path(output_dir, sample_name)
+
+  # to show progress
+  print(sample_name)
+  
+  # read h5 files and build sparse matrix
+  slots <- extract_slots(file)
+  counts <- build_sparse_matrix(slots)
+  
+  # write to files.
+  DropletUtils::write10xCounts(sample_path,
+                               counts,
+                               barcodes = paste0("cell_", unique(slots[["barcodes"]])),
+                               gene.id = slots[["gene_ids"]],
+                               gene.symbol = slots[["gene_names"]],
+                               version = "3")
+}
+```
+
+Now you should be able to upload your data to Cellenics!

From 618ec3f383841085c66c125e73bf8da130661d46 Mon Sep 17 00:00:00 2001
From: German Beldorati Stark <gerbeldo@gmail.com>
Date: Fri, 15 Jul 2022 13:28:28 -0300
Subject: [PATCH 2/3] initial iteration of csv conversion vignette

---
 vignettes/convert_csv_files.Rmd | 268 ++++++++++++++++++++++++++++++++
 1 file changed, 268 insertions(+)
 create mode 100644 vignettes/convert_csv_files.Rmd

diff --git a/vignettes/convert_csv_files.Rmd b/vignettes/convert_csv_files.Rmd
new file mode 100644
index 0000000..0abed7b
--- /dev/null
+++ b/vignettes/convert_csv_files.Rmd
@@ -0,0 +1,268 @@
+---
+title: "Converting csv/tsv files to upload to Cellenics"
+output:
+  pdf_document: default
+  html_document:
+    df_print: paged
+    highlight: kate
+    theme:
+      version: 4
+      code_font:
+        google: JetBrains Mono
+editor_options:
+  chunk_output_type: console
+  markdown:
+    wrap: 72
+---
+
+```{r, setup, include=FALSE}
+knitr::opts_chunk$set(eval = F, message = F, warning = F)
+
+```
+
+# Introduction
+
+"Comma (or Tab) Separated Value" files (CSV or TSV) are a common file type used 
+for the storage of tabular data. In general it is not recommended to use them, 
+and there are better, more robust alternatives for storing and sharing biological
+data (such as H5 files), but they are very widely used and supported. 
+
+The main issue that concerns us, with respect to uploading your data to Cellenics,
+is that there is no well-defined standard as to how the single-cell RNA-seq information
+is represented. The genes and barcodes might be on rows or columns, the sample information
+could be represented in one file per sample (best case scenario) but it could be 
+encoded in many different ways (in the barcode name, in an extra column, etc). All of this
+requires careful examination of the input files, to decide what the processing should be,
+which could potentially involve some modification of the code presented in this document.
+
+We will make some generalizing assumptions:
+
+1. Genes are stored in rows
+2. Barcodes (cells) are stored in columns
+3. Sample information is encoded in the name of the barcode
+
+In case the sample assignment is not in the barcode (stored as different files 
+for example), leaving the `sample_regex` variable as `NULL` should be enough.
+
+# Libraries
+
+We need to have `data.table`, `DropletUtils` and the `Matrix` packages installed.
+[DropletUtils is available on Bioconductor](https://bioconductor.org/packages/release/bioc/html/DropletUtils.html),
+while both `data.table` and `Matrix` are available on CRAN.
+
+
+```{r}
+library(data.table)
+library(DropletUtils)
+library(Matrix)
+```
+
+# Function definition
+
+These are the functions that will do the work for us, so we have to load them.
+
+```{r}
+#' clean original data.table CSV column names
+#' 
+#' Removes sample information from column names. It modifies in place!
+#'
+#' @param dt 
+#' @param sample_barcode_tab 
+#'
+clean_dt_colnames <- function(dt, clean_barcodes) {
+  setnames(dt, base::colnames(dt), clean_barcodes)
+}
+
+#' make sample <-> barcode table
+#'
+#' Extracts sample name from "sample_barcode" encoded column names in csv table. 
+#' Creates table with barcode - sample association.
+#' Users should manually check if the regex is correct for the particular dataset
+#' being demultiplexed.
+#'
+#' @param dt data.table original csv/tsv dataset
+#' @param sample_regex chr regex to parse column names for sample and barcodes
+#'
+#' @return data.table
+#'
+make_sample_barcode_tab <- function(dt, sample_regex = NA) {
+  samp_bc <- colnames(dt)
+
+  if (!is.na(sample_regex)) {
+    sample_names <- gsub(sample_regex, "\\1", samp_bc)
+    barcodes <- gsub(sample_regex, "\\2", samp_bc)
+    
+    clean_dt_colnames(dt, barcodes)
+  } else {
+    barcodes <- samp_bc
+    sample_names <- rep_len("single_sample", length(barcodes))
+  }
+  
+  # first var in dt is the gene_names var (data.tables don't have rownames)
+  data.table(
+    sample = sample_names[-1],
+    barcode = barcodes[-1]
+  )
+}
+
+
+#' Create list of barcodes in samples
+#'
+#' @param sample_barcode_tab data.table sample/barcode table
+#'
+#' @return list one element per sample, with every barcode in sample
+#'
+list_barcodes_in_sample <- function(sample_barcode_tab) {
+  # nest each barcode group to separate data.table
+  nested_sample_dt <- sample_barcode_tab[, .(bc_list = list(.SD)), by = sample]
+
+  # convert nested data table to list
+  lapply(nested_sample_dt[["bc_list"]], unlist)
+}
+
+
+#' subset data.table
+#'
+#' Subsets cleaned (clean_dt_colnames) data.table, provided character vector of 
+#' barcodes in sample. 
+#' Helper function to simplify lapply calls. 
+#'
+#' @param dt data.table cleaned count csv
+#' @param columns character vector 
+#'
+#' @return data.table subsetted data.table
+#'
+sub_dt <- function(columns, dt) {
+  # subset a data table by character vector, to ease lapply
+  columns <- c("V1", columns)
+  dt[, ..columns]
+}
+
+
+#' export demultiplexed data
+#' 
+#' exports  10X files in a folder per sample.
+#' 
+#' @param sample_dt data.table sample <-> barcode table
+#' @param sparse_matrix_list list of count matrices per sample
+#' @param data_dir chr root dir to export
+#'
+export_demultiplexed_data <- function(sample_dt, sparse_matrix_list, data_dir) {
+
+  nested_sample_dt <- sample_dt[, .(bc_list = list(.SD)), by = sample]
+  
+  for (row in 1:nrow(nested_sample_dt)) {
+    fname <- file.path(data_dir, "out", nested_sample_dt[row][["sample"]])
+
+    # unnest barcodes in sample
+    expected_barcodes_in_sample <- nested_sample_dt[row, bc_list[[1]]][["barcode"]]
+
+    if (!identical(expected_barcodes_in_sample, colnames(sparse_matrix_list[[row]]))) {
+      stop("not the same barcodes")
+    }
+
+    DropletUtils::write10xCounts(fname,
+      sparse_matrix_list[[row]],
+      version = "3"
+    )
+  }
+}
+```
+
+# Parameter definition
+
+## Files and Folders
+
+Set the data_dir to the folder that contains the CSV/TSV file or files. After that,
+we create a list of all CSV/TSV files in the directory, which will be
+converted. We will refer to them as CSV files, but this applies to both types. If they
+are compressed, you should uncompress them beforehand.
+
+After creating the list of csv/tsv files to process, we should manually checked if
+it contains the correct files by printing it.
+
+```{r}
+data_dir <- "./"
+setwd(data_dir)
+csv_files <- list.files(data_dir, pattern = "*[ct]sv$")
+
+print(csv_files)
+```
+
+Create an output directory, to store the converted files.
+
+```{r}
+output_dir <- file.path(data_dir, "out")
+dir.create(output_dir)
+```
+
+## Sample Information
+
+If the samples are encoded in the barcode names, you should write a regular expression
+(regex) that captures the sample name/id and the barcodes. For example, if the barcodes looked like 
+"sampleX_AAACTAGCTCGCGA" our regex should have two groups (surrounded by parentheses), and
+match "sampleX" and "AAACTAGCTCGCGA". 
+
+Explaining regex in depth is out of the scope of this document, but this should
+get you started:
+
+The example regex has two groups, separated by an underscore:
+1. The first group captures the sample ID: `(sample[[:digit:]]+)` captures the
+word "sample" folowed by any number "[[:digit:]]" repeated 1 or more times "+"
+
+2. The second group captures the barcode, which usually is the cDNA sequence, so using
+`([ACTG]+)` we match any of ACTG ("[ACTG]") that appears one or more times "+"
+
+3. Finally, we expect them to be separated by an underscore "_".
+
+
+```{r}
+sample_regex <- NA
+# example regex: "(sample[[:digit:]]+)_([ACGT]+)"
+```
+
+
+# Processing the files
+
+After we loaded our packages, sourced our functions and defined our parameters, it's
+time to actually process our files, by running the next block. 
+
+NOTE: Since CSV/TSV files can be pretty big, we have to be careful with the RAM
+usage, which is why there are some calls to the `rm()` function (to remove
+unnecessary objects) and `gc()` to force R's garbage collection.
+
+```{r}
+
+for (file in csv_files) {
+  csv_table <- fread(file)
+  setnames(csv_table, old = 1, new = "V1")
+  
+  sample_tab <- make_sample_barcode_tab(csv_table, sample_regex)
+  
+  gc()
+  # subset the original count data.table
+  dt_subset <-
+    lapply(list_barcodes_in_sample(sample_tab), sub_dt, csv_table)
+  rm(csv_table)
+  gc()
+  
+  dt_subset <- list(csv_table)
+  rm(csv_table)
+  gc()
+  
+  # convert each subsetted count data.table to count matrix
+  counts <- lapply(dt_subset, as.matrix, rownames = "V1")
+  rm(dt_subset)
+  gc()
+  
+  # convert each count matrix to sparse matrices
+  sparse_counts <- lapply(counts, Matrix, sparse = T)
+  rm(counts)
+  gc()
+  
+  export_demultiplexed_data(sample_tab, sparse_counts, data_dir)
+  
+}
+
+```
+

From 2f4b637104417e98881eab1184782cde76685b88 Mon Sep 17 00:00:00 2001
From: German Beldorati Stark <gerbeldo@gmail.com>
Date: Tue, 19 Jul 2022 11:18:17 -0300
Subject: [PATCH 3/3] add more text

---
 vignettes/convert_csv_files.Rmd | 36 ++++++++++++++++++++++++++-------
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/vignettes/convert_csv_files.Rmd b/vignettes/convert_csv_files.Rmd
index 0abed7b..ea05ed3 100644
--- a/vignettes/convert_csv_files.Rmd
+++ b/vignettes/convert_csv_files.Rmd
@@ -196,6 +196,29 @@ output_dir <- file.path(data_dir, "out")
 dir.create(output_dir)
 ```
 
+## Manual inspection
+
+We should read in at least one of the csv files and take a look at them. We're
+especially interested in the column names, to see if they contain sample information.
+
+We can take a look at the output of some useful R functions, such as `str`, `colnames`
+
+```{r}
+csv_example <- fread(csv_files[1])
+
+# Look at the general structure of the matrix.
+str(csv_example)
+
+# print the column names, usually the barcodes
+colnames(csv_example)
+
+# print the first 20 rows of the first column (usually gene names)
+head(csv_example[, 1], 20)
+```
+
+Looking at the column names, we should be able to tell if there's sample information
+encoded, which will inform our decision in the next section.
+
 ## Sample Information
 
 If the samples are encoded in the barcode names, you should write a regular expression
@@ -240,16 +263,13 @@ for (file in csv_files) {
   sample_tab <- make_sample_barcode_tab(csv_table, sample_regex)
   
   gc()
-  # subset the original count data.table
+  
+  # subset the original count data.table, separating by samples if present
   dt_subset <-
     lapply(list_barcodes_in_sample(sample_tab), sub_dt, csv_table)
   rm(csv_table)
   gc()
-  
-  dt_subset <- list(csv_table)
-  rm(csv_table)
-  gc()
-  
+
   # convert each subsetted count data.table to count matrix
   counts <- lapply(dt_subset, as.matrix, rownames = "V1")
   rm(dt_subset)
@@ -260,9 +280,11 @@ for (file in csv_files) {
   rm(counts)
   gc()
   
+  # export the data to one folder per sample
   export_demultiplexed_data(sample_tab, sparse_counts, data_dir)
-  
 }
 
 ```
 
+After this, you should have an "out" folder containing all the samples in a format
+compatible with Cellenics!
\ No newline at end of file