-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
I just ran into this same issue (aertslab/pycisTopic#158) running single_cell_toolkit as part of the PUMATAC pipeline. In short, when using genomes with inconsistent string/integer chromosome names polars can incorrectly infer column 1 as integer before it is forced into categorical. Proposed fix would be to use similar code as in pycistopic where the dtypes are set in the function call.
Traceback (most recent call last):
File "/opt/single_cell_toolkit/barcard/calculate_jaccard_index_cbs.py", line 257, in <module>
main()
File "/opt/single_cell_toolkit/barcard/calculate_jaccard_index_cbs.py", line 248, in main
calculate_jaccard_index_cbs(
File "/opt/single_cell_toolkit/barcard/calculate_jaccard_index_cbs.py", line 25, in calculate_jaccard_index_cbs
pl.read_csv(
File "/opt/venv/lib/python3.9/site-packages/polars/io.py", line 348, in read_csv
df = DataFrame._read_csv(
File "/opt/venv/lib/python3.9/site-packages/polars/internals/frame.py", line 593, in _read_csv
self._df = PyDataFrame.read_csv(
exceptions.ComputeError: Could not parse `CAADRL010001205.1` as dtype Int64 at column 1.
code in single_cell_toolkit:
# Read fragments file, count number of fragments per CB and
# keep only those fragments which have a CB above or equal to min_frags_per_CB threshold.
fragments_df = (
pl.read_csv(
fragments_tsv_filename,
has_header=False,
separator="\t",
new_columns=["chrom", "start", "end", "CB", "CB_count"],
)
.with_columns(
pl.col("chrom").cast(pl.Categorical),
pl.col("CB").cast(pl.Categorical),
)
.with_columns(pl.col("CB").count().over("CB").alias("per_CB_count"))
.filter(pl.col("per_CB_count") >= min_frags_per_CB)
better code in pycistopic:
if engine == "polars":
# Read fragments BED file with Polars.
df = pl.read_csv(
fragments_bed_filename,
has_header=False,
skip_rows=skip_rows,
separator="\t",
use_pyarrow=False,
new_columns=bed_column_names[:column_count],
dtypes={
bed_column: dtype
for bed_column, dtype in {
"Chromosome": pl.Categorical,
"Start": pl.Int32,
"End": pl.Int32,
"Name": pl.Categorical,
"Strand": pl.Categorical,
}.items()
if bed_column in bed_column_names[:column_count]
},
).to_pandas()
Metadata
Metadata
Assignees
Labels
No labels