Skip to content

polars incorrectly inferring dtype #6

@camiel-m

Description

@camiel-m

I just ran into this same issue (aertslab/pycisTopic#158) running single_cell_toolkit as part of the PUMATAC pipeline. In short, when using genomes with inconsistent string/integer chromosome names polars can incorrectly infer column 1 as integer before it is forced into categorical. Proposed fix would be to use similar code as in pycistopic where the dtypes are set in the function call.

  Traceback (most recent call last):
    File "/opt/single_cell_toolkit/barcard/calculate_jaccard_index_cbs.py", line 257, in <module>
      main()
    File "/opt/single_cell_toolkit/barcard/calculate_jaccard_index_cbs.py", line 248, in main
      calculate_jaccard_index_cbs(
    File "/opt/single_cell_toolkit/barcard/calculate_jaccard_index_cbs.py", line 25, in calculate_jaccard_index_cbs
      pl.read_csv(
    File "/opt/venv/lib/python3.9/site-packages/polars/io.py", line 348, in read_csv
      df = DataFrame._read_csv(
    File "/opt/venv/lib/python3.9/site-packages/polars/internals/frame.py", line 593, in _read_csv
      self._df = PyDataFrame.read_csv(
  exceptions.ComputeError: Could not parse `CAADRL010001205.1` as dtype Int64 at column 1.

code in single_cell_toolkit:

    # Read fragments file, count number of fragments per CB and
    # keep only those fragments which have a CB above or equal to min_frags_per_CB threshold.
    fragments_df = (
        pl.read_csv(
            fragments_tsv_filename,
            has_header=False,
            separator="\t",
            new_columns=["chrom", "start", "end", "CB", "CB_count"],
        )
        .with_columns(
            pl.col("chrom").cast(pl.Categorical),
            pl.col("CB").cast(pl.Categorical),
        )
        .with_columns(pl.col("CB").count().over("CB").alias("per_CB_count"))
        .filter(pl.col("per_CB_count") >= min_frags_per_CB)

better code in pycistopic:

if engine == "polars":
        # Read fragments BED file with Polars.
        df = pl.read_csv(
            fragments_bed_filename,
            has_header=False,
            skip_rows=skip_rows,
            separator="\t",
            use_pyarrow=False,
            new_columns=bed_column_names[:column_count],
            dtypes={
                bed_column: dtype
                for bed_column, dtype in {
                    "Chromosome": pl.Categorical,
                    "Start": pl.Int32,
                    "End": pl.Int32,
                    "Name": pl.Categorical,
                    "Strand": pl.Categorical,
                }.items()
                if bed_column in bed_column_names[:column_count]
            },
        ).to_pandas()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions