pandas -> polars problems + missing peptides with repeats

Hello,
was interested in using your repo and running your benchmark suite but it seems that some of it still uses pandas functions or mentions pandas: (shown below) and some of the documentation on how to run it is out of date ("-s" doesn't seem to exist)

I also tried a test sequence and found that exact matches didn't work with a subsequence of it.
repeatable script (below)
this seems to be from https://github.com/IEDB/PEPMatch/blob/master/pepmatch/matcher.py#L308 as target_kmers does not include duplicates.


```
"""
stdout:
Matching peptides:   0%|                                                                                                                                                                   | 0/1 [00:00<?, ?peptide/s]
Missing preprocessed file or table. Creating table for k=5. This may take a bit...
Matching peptides: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.88peptide/s]
None
"""
from pepmatch import Matcher
if __name__ == "__main__":
	
	seq_fn = "keratin.fasta"
	keratin_seq = """>sp|P35527|K1C9_HUMAN Keratin, type I cytoskeletal 9 OS=Homo sapiens OX=9606 GN=KRT9 PE=1 SV=3
MSCRQFSSSYLSRSGGGGGGGLGSGGSIRSSYSRFSSSGGGGGGGRFSSSSGYGGGSSRV
CGRGGGGSFGYSYGGGSGGGFSASSLGGGFGGGSRGFGGASGGGYSSSGGFGGGFGGGSG
GGFGGGYGSGFGGFGGFGGGAGGGDGGILTANEKSTMQELNSRLASYLDKVQALEEANND
LENKIQDWYDKKGPAAIQKNYSPYYNTIDDLKDQIVDLTVGNNKTLLDIDNTRMTLDDFR
IKFEMEQNLRQGVDADINGLRQVLDNLTMEKSDLEMQYETLQEELMALKKNHKEEMSQLT
GQNSGDVNVEINVAPGKDLTKTLNDMRQEYEQLIAKNRKDIENQYETQITQIEHEVSSSG
QEVQSSAKEVTQLRHGVQELEIELQSQLSKKAALEKSLEDTKNRYCGQLQMIQEQISNLE
AQITDVRQEIECQNQEYSLLLSIKMRLEKEIETYHNLLEGGQEDFESSGAGKIGLGGRGG
SGGSYGRGSRGGSGGSYGGGGSGGGYGGGSGSRGGSGGSYGGGSGSGGGSGGGYGGGSGG
GHSGGSGGGHSGGSGGNYGGGSGSGGGSGGGYGGGSGSRGGSGGSHGGGSGFGGESGGSY
GGGEEASGSGGGYGGGSGKSSHS"""
	with open(seq_fn, "w") as fh:
		fh.write(keratin_seq)
	seq = ["GGGGGGGLGSGGSIRSSY"]
	m = Matcher(query=seq, proteome_file=seq_fn, max_mismatches=0, k=5)
	print(m.match())
```



```
 % rg 'pandas|pd\.'
benchmarking/benchmarking.py
6:import pandas as pd
23:) -> pd.DataFrame:
49:  benchmark_df = pd.DataFrame(columns = columns)
115:    expected_df = pd.read_csv(inputs['expected'], sep='\t')
128:    new_df = pd.DataFrame([benchmark_stats], columns = columns)
129:    benchmark_df = pd.concat([benchmark_df, new_df], ignore_index = True)
137:def recall(results_df: pd.DataFrame, expected_df: pd.DataFrame) -> float:
142:    results: pandas dataframe with results from the benchmarking.
143:    expected_df: pandas dataframe with expected matches for the benchmarking."""
151:  matched_rows = pd.merge(results, expected, how='inner', on=columns)
187:  master_df['Searching Time (s)'] = pd.to_numeric(master_df['Searching Time (s)'])

README.md
24:- [Pandas](https://pandas.pydata.org/)
141:If specifying `dataframe`, the ```match()``` method will return a pandas dataframe which can be stored as a variable:

benchmarking/methods/blast.py
5:import pandas as pd
74:    df = pd.read_csv(
157:    return pd.DataFrame(all_matches, columns = columns)

benchmarking/methods/z.py
3:import pandas as pd
114:    return pd.DataFrame(all_matches, columns = columns)

benchmarking/methods/mmseqs2.py
6:import pandas as pd
118:    return pd.DataFrame(all_matches, columns = columns)

benchmarking/methods/horspool.py
3:import pandas as pd
106:    return pd.DataFrame(all_matches, columns = columns)

benchmarking/methods/boyer_moore.py
3:import pandas as pd
280:    return pd.DataFrame(all_matches, columns = columns)

benchmarking/methods/diamond.py
4:import pandas as pd
111:    return pd.DataFrame(all_matches, columns = columns)

benchmarking/methods/NmerMatch.py
9:import pandas as pd
254:    results_df = pd.DataFrame([s.split(',') for s in results], columns = columns)

benchmarking/methods/knuth_morris_pratt.py
3:import pandas as pd
125:    return pd.DataFrame(all_matches, columns = columns)

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas -> polars problems + missing peptides with repeats #21

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

pandas -> polars problems + missing peptides with repeats #21

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions