polars incorrectly inferring dtype

I just ran into this same issue (https://github.com/aertslab/pycisTopic/pull/158) running single_cell_toolkit as part of the PUMATAC pipeline. In short, when using genomes with inconsistent string/integer chromosome names polars can incorrectly infer column 1 as integer before it is forced into categorical. Proposed fix would be to use similar code as in pycistopic where the dtypes are set in the function call.

```
  Traceback (most recent call last):
    File "/opt/single_cell_toolkit/barcard/calculate_jaccard_index_cbs.py", line 257, in <module>
      main()
    File "/opt/single_cell_toolkit/barcard/calculate_jaccard_index_cbs.py", line 248, in main
      calculate_jaccard_index_cbs(
    File "/opt/single_cell_toolkit/barcard/calculate_jaccard_index_cbs.py", line 25, in calculate_jaccard_index_cbs
      pl.read_csv(
    File "/opt/venv/lib/python3.9/site-packages/polars/io.py", line 348, in read_csv
      df = DataFrame._read_csv(
    File "/opt/venv/lib/python3.9/site-packages/polars/internals/frame.py", line 593, in _read_csv
      self._df = PyDataFrame.read_csv(
  exceptions.ComputeError: Could not parse `CAADRL010001205.1` as dtype Int64 at column 1.
```

code in single_cell_toolkit:
```
    # Read fragments file, count number of fragments per CB and
    # keep only those fragments which have a CB above or equal to min_frags_per_CB threshold.
    fragments_df = (
        pl.read_csv(
            fragments_tsv_filename,
            has_header=False,
            separator="\t",
            new_columns=["chrom", "start", "end", "CB", "CB_count"],
        )
        .with_columns(
            pl.col("chrom").cast(pl.Categorical),
            pl.col("CB").cast(pl.Categorical),
        )
        .with_columns(pl.col("CB").count().over("CB").alias("per_CB_count"))
        .filter(pl.col("per_CB_count") >= min_frags_per_CB)
```

better code in pycistopic:
```
if engine == "polars":
        # Read fragments BED file with Polars.
        df = pl.read_csv(
            fragments_bed_filename,
            has_header=False,
            skip_rows=skip_rows,
            separator="\t",
            use_pyarrow=False,
            new_columns=bed_column_names[:column_count],
            dtypes={
                bed_column: dtype
                for bed_column, dtype in {
                    "Chromosome": pl.Categorical,
                    "Start": pl.Int32,
                    "End": pl.Int32,
                    "Name": pl.Categorical,
                    "Strand": pl.Categorical,
                }.items()
                if bed_column in bed_column_names[:column_count]
            },
        ).to_pandas()
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

polars incorrectly inferring dtype #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

polars incorrectly inferring dtype #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions