Skip to content

pandas -> polars problems + missing peptides with repeats #21

@danpf

Description

@danpf

Hello,
was interested in using your repo and running your benchmark suite but it seems that some of it still uses pandas functions or mentions pandas: (shown below) and some of the documentation on how to run it is out of date ("-s" doesn't seem to exist)

I also tried a test sequence and found that exact matches didn't work with a subsequence of it.
repeatable script (below)
this seems to be from https://github.com/IEDB/PEPMatch/blob/master/pepmatch/matcher.py#L308 as target_kmers does not include duplicates.

"""
stdout:
Matching peptides:   0%|                                                                                                                                                                   | 0/1 [00:00<?, ?peptide/s]
Missing preprocessed file or table. Creating table for k=5. This may take a bit...
Matching peptides: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.88peptide/s]
None
"""
from pepmatch import Matcher
if __name__ == "__main__":
	
	seq_fn = "keratin.fasta"
	keratin_seq = """>sp|P35527|K1C9_HUMAN Keratin, type I cytoskeletal 9 OS=Homo sapiens OX=9606 GN=KRT9 PE=1 SV=3
MSCRQFSSSYLSRSGGGGGGGLGSGGSIRSSYSRFSSSGGGGGGGRFSSSSGYGGGSSRV
CGRGGGGSFGYSYGGGSGGGFSASSLGGGFGGGSRGFGGASGGGYSSSGGFGGGFGGGSG
GGFGGGYGSGFGGFGGFGGGAGGGDGGILTANEKSTMQELNSRLASYLDKVQALEEANND
LENKIQDWYDKKGPAAIQKNYSPYYNTIDDLKDQIVDLTVGNNKTLLDIDNTRMTLDDFR
IKFEMEQNLRQGVDADINGLRQVLDNLTMEKSDLEMQYETLQEELMALKKNHKEEMSQLT
GQNSGDVNVEINVAPGKDLTKTLNDMRQEYEQLIAKNRKDIENQYETQITQIEHEVSSSG
QEVQSSAKEVTQLRHGVQELEIELQSQLSKKAALEKSLEDTKNRYCGQLQMIQEQISNLE
AQITDVRQEIECQNQEYSLLLSIKMRLEKEIETYHNLLEGGQEDFESSGAGKIGLGGRGG
SGGSYGRGSRGGSGGSYGGGGSGGGYGGGSGSRGGSGGSYGGGSGSGGGSGGGYGGGSGG
GHSGGSGGGHSGGSGGNYGGGSGSGGGSGGGYGGGSGSRGGSGGSHGGGSGFGGESGGSY
GGGEEASGSGGGYGGGSGKSSHS"""
	with open(seq_fn, "w") as fh:
		fh.write(keratin_seq)
	seq = ["GGGGGGGLGSGGSIRSSY"]
	m = Matcher(query=seq, proteome_file=seq_fn, max_mismatches=0, k=5)
	print(m.match())
 % rg 'pandas|pd\.'
benchmarking/benchmarking.py
6:import pandas as pd
23:) -> pd.DataFrame:
49:  benchmark_df = pd.DataFrame(columns = columns)
115:    expected_df = pd.read_csv(inputs['expected'], sep='\t')
128:    new_df = pd.DataFrame([benchmark_stats], columns = columns)
129:    benchmark_df = pd.concat([benchmark_df, new_df], ignore_index = True)
137:def recall(results_df: pd.DataFrame, expected_df: pd.DataFrame) -> float:
142:    results: pandas dataframe with results from the benchmarking.
143:    expected_df: pandas dataframe with expected matches for the benchmarking."""
151:  matched_rows = pd.merge(results, expected, how='inner', on=columns)
187:  master_df['Searching Time (s)'] = pd.to_numeric(master_df['Searching Time (s)'])

README.md
24:- [Pandas](https://pandas.pydata.org/)
141:If specifying `dataframe`, the ```match()``` method will return a pandas dataframe which can be stored as a variable:

benchmarking/methods/blast.py
5:import pandas as pd
74:    df = pd.read_csv(
157:    return pd.DataFrame(all_matches, columns = columns)

benchmarking/methods/z.py
3:import pandas as pd
114:    return pd.DataFrame(all_matches, columns = columns)

benchmarking/methods/mmseqs2.py
6:import pandas as pd
118:    return pd.DataFrame(all_matches, columns = columns)

benchmarking/methods/horspool.py
3:import pandas as pd
106:    return pd.DataFrame(all_matches, columns = columns)

benchmarking/methods/boyer_moore.py
3:import pandas as pd
280:    return pd.DataFrame(all_matches, columns = columns)

benchmarking/methods/diamond.py
4:import pandas as pd
111:    return pd.DataFrame(all_matches, columns = columns)

benchmarking/methods/NmerMatch.py
9:import pandas as pd
254:    results_df = pd.DataFrame([s.split(',') for s in results], columns = columns)

benchmarking/methods/knuth_morris_pratt.py
3:import pandas as pd
125:    return pd.DataFrame(all_matches, columns = columns)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions