Skip to content

Conversation

@timminator
Copy link

@timminator timminator commented Jun 23, 2025

This PR solve issue #10 by reducing the penalty-term and by not allowing the algorithm to choose a "one-time" payment strategy. A detailed explanation can be found in the issue thread.

The penalty-term of 25 was not chosen arbitrarily - it was chosen by taking the max cost a word can get assigned to into account. The max_cost can be calculated like this:

$$\text{Cost}_{\text{max}} = \ln(N \cdot \ln (N))$$

If you set the max cost value to 25 and solve the equation, you get the result, that a dictionary would need to have more than roughly 3.3 billion entries to cross the penalty term that an unknown word would get.
A penalty term of 20 equals 28 million possible entries, but 25 is definitely on the save side, and works perfectly fine with the algorithm.

This took me quite some time to figure out, so I would appreciate it if this could be merged.

Fix #10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LanguageModel split fails when there is unrecognized characters

1 participant