Fix splitting algorithm for unknown characters #32
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR solve issue #10 by reducing the penalty-term and by not allowing the algorithm to choose a "one-time" payment strategy. A detailed explanation can be found in the issue thread.
The penalty-term of 25 was not chosen arbitrarily - it was chosen by taking the max cost a word can get assigned to into account. The max_cost can be calculated like this:
If you set the max cost value to 25 and solve the equation, you get the result, that a dictionary would need to have more than roughly 3.3 billion entries to cross the penalty term that an unknown word would get.
A penalty term of 20 equals 28 million possible entries, but 25 is definitely on the save side, and works perfectly fine with the algorithm.
This took me quite some time to figure out, so I would appreciate it if this could be merged.
Fix #10