Skip to content

Inconsistent keyword list in Parser.getKeywords() causes inconsistent scoring #11

@schmamps

Description

@schmamps

I expect that the issues and PRs I've posted here are going to my own fork and no further, but if anyone out there is reading, your feedback is welcome.

In Parser.getKeywords():

# ...
uniqueWords = list(set(words))

keywords = [{'word': word, 'count': words.count(word)} for word in uniqueWords] 
keywords = sorted(keywords, key=lambda x: -x['count'])

# ... returns: ...
keywords[0]  = {'word': 'foo', 'count': 100}  # rank: 1st (most common keyword)
keywords[1]  = {'word': 'eggs', 'count': 25}  # rank: 9th
keywords[2]  = {'word': 'bacon', 'count': 32} # rank: 8th
# ...
keywords[10] = {'word': 'spam', 'count': 25}  # rank: 9th
keywords[11] = {'word': 'bar', 'count': 19}   # rank: 13th
keywords[12] = {'word': 'ham', 'count': 25}   # rank: 9th
# ...

set() does not return an ordered enumerable, so uniqueWords is an unordered list. Iterating it to build keywords means this list is also unordered, so the return order of sorted() on the count key alone is inconsistent.

Down in Summarizer.summarize() it means sentence scores are also inconsistent:

# ... 
(keywords, wordCount) = self.parser.getKeywords(text)

topKeywords = self.getTopKeywords(keywords[:10], wordCount, source, category)

# ... iterating sentences ...
sbsFeature = self.sbs(words, topKeywords, keywordList)
dbsFeature = self.dbs(words, topKeywords, keywordList)

# ... calculate sentence score based on these features ...
# ... 

For consistency, when the summarize method calls getTopKeywords(), should it still pass a fixed list of ten keywords that's been sorted with more tiebreakers…

# replace: keywords = sorted(keywords, key=lambda x: -x['count'])
keywords = sorted(keywords, key=lambda x: (-x['count'], -len(x['word'], x['word']))

…every keyword that ranks 10th or better, i.e. this comprehension…

topKeywordSlice = [kw for kw in keywords if kw['count'] >= keywords[9]['count']]

… or maybe both?

Computationally, the advanced sort seems unnecessarily expensive, but I don't know if there's a rationale for exactly ten top keywords. What's the best way to make this work?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions