-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Description
This is something that tripped me up when I was getting started with BURST. BURST seems to let you build a database + accelerator from a gzipped Fasta file without an error, but the result is non-functional. I think it would be helpful if in calls to build a new database BURST checks that the file passed to -r is an unzipped Fasta and if not, exits with an error (perhaps even suggesting to the user to unzip their fasta file), rather than proceeding. (It would also be useful if BURST suggested unzipping the fasta file when using the mapping function, as the "malformed Fasta file" error is confusing if you are used to programs that take plain or gzipped Fastas)
test ❯ burst_linux_DB12 -r GCF_000011065.1_ASM1106v1_genomic.fna.gz -a db.acx -o db.edx -d DNA -s
This is BURST [v1.0 DB 12]
--> Using accelerator file db.acx
--> Creating DNA database (assuming max query length 500)
--> Shearing references longer than 500
Using up to AVX-128 with 4 threads.
Parsed 21 references.
Initiating database shearing procedure [shear 515, ov 515].
Using compressive optimization (1 partitions; ~21 refs).
[0] First pass: populated bin counters [0.218691]
--> Out of the 966143 original places, 0 are eligible.
[0] Second pass: shift count vector [0.059528]
[0] Third pass: add pointers to Ptr vector [0.000518]
[0] Fourth pass: sort segments [1.010988]
[0] Fifth pass: tally 0=-nan (0=-nan) [1.022772]
--> Max chain = 0, max sh = 0
--> Bounds: max 0 -> sh3 0 -> sh2 0 -> sh1 0
[0] Sixth pass: populate duplicates [1.035079]
-->CompDB: All sequences tallied.
-->CompDB: Beginning rebase...
Rebased [21 orig] --> [1768 rebase] --> [1819 adj]
Duplicate-guided shearing [0.003846]
Producing edb scaffold...
Avg. R Divergence: 5.521165 (0 dupes, 1819 uniq)
There are 1819 references and hence 113 clumps (+1)
Average R pack length = 1016.578947
Writing database...
--> neoEDB: Original pack table: 115890 [now 57963]
--> neoEDB: Number read: 115890, written: 57963
Database written.
Generating accelerator 'db.acx'
Note: N-penalized accelerator not usable for unpenalized alignment
Finished scanning pass [0.007691].
Number bad (initial) = 0
Filling accelerator [0 / 114]
Done scanning [0.004500]; writing...
Total accelerants stored: 0 (0 bad)
--> [Re-Accel] Writing SMALL format acx...
Wrote accelerator (0.389736).
test ❯ l
total 67M
drwxr-xr-x 2 michael michael 4.0K Sep 29 17:26 .
drwx------ 95 michael michael 12K Sep 29 17:26 ..
-rw-r--r-- 1 michael michael 65M Sep 29 17:26 db.acx
-rw-r--r-- 1 michael michael 931K Sep 29 17:26 db.edx
-rw-r--r-- 1 michael michael 1.8M Sep 29 17:25 GCF_000011065.1_ASM1106v1_genomic.fna.gz
test ❯ burst_linux_DB12 -q GCF_000011065.1_ASM1106v1_genomic.fna.gz -a db.acx -r db.edx -o output.txt
This is BURST [v1.0 DB 12]
--> Using accelerator file db.acx
Using up to AVX-128 with 4 threads.
--> [Accel] Accelerator found. Parsing...
--> [Accel] Total accelerants: 0 [bytes = 0]
--> [Accel] Reading 0 ambiguous entries
EDB database provided. Parsing...
--> EDB: Fingerprints are DISABLED
--> EDB: Parsing compressed headers
--> EDB: Sheared database (shear size = 515)
--> EDB: 1819 refs [1819 orig], 114 clumps, 1030 maxR
ERROR: Malformatted FASTA file.
test ❯ l
total 67M
drwxr-xr-x 2 michael michael 4.0K Sep 29 17:27 .
drwx------ 95 michael michael 12K Sep 29 17:27 ..
-rw-r--r-- 1 michael michael 65M Sep 29 17:26 db.acx
-rw-r--r-- 1 michael michael 931K Sep 29 17:26 db.edx
-rw-r--r-- 1 michael michael 1.8M Sep 29 17:25 GCF_000011065.1_ASM1106v1_genomic.fna.gz
-rw-r--r-- 1 michael michael 0 Sep 29 17:27 output.txt
test ❯ gunzip GCF_000011065.1_ASM1106v1_genomic.fna.gz
test ❯ burst_linux_DB12 -q GCF_000011065.1_ASM1106v1_genomic.fna -a db.acx -r db.edx -o output.txt
This is BURST [v1.0 DB 12]
--> Using accelerator file db.acx
Using up to AVX-128 with 4 threads.
--> [Accel] Accelerator found. Parsing...
--> [Accel] Total accelerants: 0 [bytes = 0]
--> [Accel] Reading 0 ambiguous entries
EDB database provided. Parsing...
--> EDB: Fingerprints are DISABLED
--> EDB: Parsing compressed headers
--> EDB: Sheared database (shear size = 515)
--> EDB: 1819 refs [1819 orig], 114 clumps, 1030 maxR
ERROR: line count != '>' * 2
Metadata
Metadata
Assignees
Labels
No labels