Skip to content

OpenVariant Bug | Windows input files not parsed properly #47

@CarlosLopezElorduy

Description

@CarlosLopezElorduy

Describe the issue

I encountered an issue where two different files with the same content had different behaviours when parsed with openvariant. One is successfully being parsed while the other one isn't. After some debugging, I saw that the difference is in the OS where the actual file was created.

  • If the file was created in Linux, openvariant is succesful ✅
  • If the file was created in Windows, openvariant doesn't parse properly the input ❌

This is caused by how line jumps are being handled between different Operating Systems.

OS Line jump character
Linux \n
Windows \r\n

Image

Image

Steps to reproduce the bug

One of the input files needs to be created using a Windows machine, while the other should be created using Linux. The content should be identical.

Content of the TXT files

protein
KRAS:p.G12V
KRAS:p.G12C
KRAS:p.G12S
KRAS:p.G12F
KRAS:p.G12A
KRAS:p.G13D

Annotation file

pattern:
  - '.*\.txt$'
  - '.*\.tsv$'
  - '.*\.txt\.gz$'
  - '.*\.tsv\.gz$'

columns:
  - PROTEIN
  - TYPE

annotation:

  - type: internal
    field: GENE
    fieldSource:
      - 'gene'
      - 'Gene'
      - 'GENE'
      - 'Symbol'
      - 'SYMBOL'
      - 'GENE_SYMBOL'
  - type: internal
    field: PROTEIN
    fieldSource:
      - "protein"
      - "Protein"
      - "PROTEIN"
  - type: internal
    field: TRANSCRIPT
    fieldSource:
      - 'transcript'
      - 'TRANSCRIPT'
      - 'ENSEMBL_TRANSCRIPT'
      - 'ensembl_transcript'
      - 'MANE_SELECT'
      - 'mane_select'
      - 'MANE'
      - 'mane'
      - 'refseq_transcript'
      - 'REFSEQ_TRANSCRIPT'
      - 'refseq'
      - 'REFSEQ'
  - type: static
    field: "TYPE"
    value: "protein"

Python script(`openvariant_read.py`)

from openvariant import Annotation, Variant

dataset_file_linux = "KRAS_linux.txt"
dataset_file_windows = "KRAS_windows.txt"

annotation_file = "protein.yaml"

annotation = Annotation(annotation_path=annotation_file)
result_linux = Variant(path=dataset_file_linux, annotation=annotation)
result_windows = Variant(path=dataset_file_windows, annotation=annotation)

for n_line, line in enumerate(result_linux.read()):
    print(f'Line (linux) {n_line}: {line}')
    if n_line == 9:
        break

for n_line, line in enumerate(result_windows.read()):
    print(f'Line (windows) {n_line}: {line}')
    if n_line == 9:
        break

Error encountered or actual result

By running the previous script, this is the output that we obtain:

$ python openvariant_read.py
Line (linux) 0: {'PROTEIN': 'KRAS:p.G12V', 'TYPE': 'protein'}
Line (linux) 1: {'PROTEIN': 'KRAS:p.G12C', 'TYPE': 'protein'}
Line (linux) 2: {'PROTEIN': 'KRAS:p.G12S', 'TYPE': 'protein'}
Line (linux) 3: {'PROTEIN': 'KRAS:p.G12F', 'TYPE': 'protein'}
Line (linux) 4: {'PROTEIN': 'KRAS:p.G12A', 'TYPE': 'protein'}
Line (linux) 5: {'PROTEIN': 'KRAS:p.G13D', 'TYPE': 'protein'}
Line (windows) 0: {'PROTEIN': 'nan', 'TYPE': 'protein'}
Line (windows) 1: {'PROTEIN': 'nan', 'TYPE': 'protein'}
Line (windows) 2: {'PROTEIN': 'nan', 'TYPE': 'protein'}
Line (windows) 3: {'PROTEIN': 'nan', 'TYPE': 'protein'}
Line (windows) 4: {'PROTEIN': 'nan', 'TYPE': 'protein'}
Line (windows) 5: {'PROTEIN': 'nan', 'TYPE': 'protein'}

The protein changes in the linux input file are correctly parsed, while the windows input is returning "nan"

Expected result

The expected output should be that, independently of the OS where the input file was created, it should be parsed in the same way:

$ python openvariant_read.py
Line (linux) 0: {'PROTEIN': 'KRAS:p.G12V', 'TYPE': 'protein'}
Line (linux) 1: {'PROTEIN': 'KRAS:p.G12C', 'TYPE': 'protein'}
Line (linux) 2: {'PROTEIN': 'KRAS:p.G12S', 'TYPE': 'protein'}
Line (linux) 3: {'PROTEIN': 'KRAS:p.G12F', 'TYPE': 'protein'}
Line (linux) 4: {'PROTEIN': 'KRAS:p.G12A', 'TYPE': 'protein'}
Line (linux) 5: {'PROTEIN': 'KRAS:p.G13D', 'TYPE': 'protein'}
Line (windows) 0: {'PROTEIN': 'KRAS:p.G12V', 'TYPE': 'protein'}
Line (windows) 1: {'PROTEIN': 'KRAS:p.G12C', 'TYPE': 'protein'}
Line (windows) 2: {'PROTEIN': 'KRAS:p.G12S', 'TYPE': 'protein'}
Line (windows) 3: {'PROTEIN': 'KRAS:p.G12F', 'TYPE': 'protein'}
Line (windows) 4: {'PROTEIN': 'KRAS:p.G12A', 'TYPE': 'protein'}
Line (windows) 5: {'PROTEIN': 'KRAS:p.G13D', 'TYPE': 'protein'}

OpenVariant version

1.0.1

Python version

3.10.12

Installation method

pip install open-variant

Environment

Openvariant installed via pip in a conda environment. Ran using a regular Python script (shown in the "steps to reproduce the bug" section)

OS

Windows vs Linux

Other commentaries (optional)

No response

Contact details (optional)

No response

Metadata

Metadata

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions