-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Describe the issue
I encountered an issue where two different files with the same content had different behaviours when parsed with openvariant. One is successfully being parsed while the other one isn't. After some debugging, I saw that the difference is in the OS where the actual file was created.
- If the file was created in Linux, openvariant is succesful ✅
- If the file was created in Windows, openvariant doesn't parse properly the input ❌
This is caused by how line jumps are being handled between different Operating Systems.
| OS | Line jump character |
|---|---|
| Linux | \n |
| Windows | \r\n |
Steps to reproduce the bug
One of the input files needs to be created using a Windows machine, while the other should be created using Linux. The content should be identical.
Content of the TXT files
protein
KRAS:p.G12V
KRAS:p.G12C
KRAS:p.G12S
KRAS:p.G12F
KRAS:p.G12A
KRAS:p.G13DAnnotation file
pattern:
- '.*\.txt$'
- '.*\.tsv$'
- '.*\.txt\.gz$'
- '.*\.tsv\.gz$'
columns:
- PROTEIN
- TYPE
annotation:
- type: internal
field: GENE
fieldSource:
- 'gene'
- 'Gene'
- 'GENE'
- 'Symbol'
- 'SYMBOL'
- 'GENE_SYMBOL'
- type: internal
field: PROTEIN
fieldSource:
- "protein"
- "Protein"
- "PROTEIN"
- type: internal
field: TRANSCRIPT
fieldSource:
- 'transcript'
- 'TRANSCRIPT'
- 'ENSEMBL_TRANSCRIPT'
- 'ensembl_transcript'
- 'MANE_SELECT'
- 'mane_select'
- 'MANE'
- 'mane'
- 'refseq_transcript'
- 'REFSEQ_TRANSCRIPT'
- 'refseq'
- 'REFSEQ'
- type: static
field: "TYPE"
value: "protein"Python script(`openvariant_read.py`)
from openvariant import Annotation, Variant
dataset_file_linux = "KRAS_linux.txt"
dataset_file_windows = "KRAS_windows.txt"
annotation_file = "protein.yaml"
annotation = Annotation(annotation_path=annotation_file)
result_linux = Variant(path=dataset_file_linux, annotation=annotation)
result_windows = Variant(path=dataset_file_windows, annotation=annotation)
for n_line, line in enumerate(result_linux.read()):
print(f'Line (linux) {n_line}: {line}')
if n_line == 9:
break
for n_line, line in enumerate(result_windows.read()):
print(f'Line (windows) {n_line}: {line}')
if n_line == 9:
breakError encountered or actual result
By running the previous script, this is the output that we obtain:
$ python openvariant_read.py
Line (linux) 0: {'PROTEIN': 'KRAS:p.G12V', 'TYPE': 'protein'}
Line (linux) 1: {'PROTEIN': 'KRAS:p.G12C', 'TYPE': 'protein'}
Line (linux) 2: {'PROTEIN': 'KRAS:p.G12S', 'TYPE': 'protein'}
Line (linux) 3: {'PROTEIN': 'KRAS:p.G12F', 'TYPE': 'protein'}
Line (linux) 4: {'PROTEIN': 'KRAS:p.G12A', 'TYPE': 'protein'}
Line (linux) 5: {'PROTEIN': 'KRAS:p.G13D', 'TYPE': 'protein'}
Line (windows) 0: {'PROTEIN': 'nan', 'TYPE': 'protein'}
Line (windows) 1: {'PROTEIN': 'nan', 'TYPE': 'protein'}
Line (windows) 2: {'PROTEIN': 'nan', 'TYPE': 'protein'}
Line (windows) 3: {'PROTEIN': 'nan', 'TYPE': 'protein'}
Line (windows) 4: {'PROTEIN': 'nan', 'TYPE': 'protein'}
Line (windows) 5: {'PROTEIN': 'nan', 'TYPE': 'protein'}The protein changes in the linux input file are correctly parsed, while the windows input is returning "nan"
Expected result
The expected output should be that, independently of the OS where the input file was created, it should be parsed in the same way:
$ python openvariant_read.py
Line (linux) 0: {'PROTEIN': 'KRAS:p.G12V', 'TYPE': 'protein'}
Line (linux) 1: {'PROTEIN': 'KRAS:p.G12C', 'TYPE': 'protein'}
Line (linux) 2: {'PROTEIN': 'KRAS:p.G12S', 'TYPE': 'protein'}
Line (linux) 3: {'PROTEIN': 'KRAS:p.G12F', 'TYPE': 'protein'}
Line (linux) 4: {'PROTEIN': 'KRAS:p.G12A', 'TYPE': 'protein'}
Line (linux) 5: {'PROTEIN': 'KRAS:p.G13D', 'TYPE': 'protein'}
Line (windows) 0: {'PROTEIN': 'KRAS:p.G12V', 'TYPE': 'protein'}
Line (windows) 1: {'PROTEIN': 'KRAS:p.G12C', 'TYPE': 'protein'}
Line (windows) 2: {'PROTEIN': 'KRAS:p.G12S', 'TYPE': 'protein'}
Line (windows) 3: {'PROTEIN': 'KRAS:p.G12F', 'TYPE': 'protein'}
Line (windows) 4: {'PROTEIN': 'KRAS:p.G12A', 'TYPE': 'protein'}
Line (windows) 5: {'PROTEIN': 'KRAS:p.G13D', 'TYPE': 'protein'}OpenVariant version
1.0.1
Python version
3.10.12
Installation method
pip install open-variant
Environment
Openvariant installed via pip in a conda environment. Ran using a regular Python script (shown in the "steps to reproduce the bug" section)
OS
Windows vs Linux
Other commentaries (optional)
No response
Contact details (optional)
No response

