Skip to content

Conversation

@goerlitz
Copy link

As mentioned in #29 Unicode strings are not properly handled when using Python 2 and lead to exceptions.
This fix makes sure that Unicode string encoding works for Python 2 and Python 3 as well.

data = text.encode()
# ensure proper encoding of python 3 strings
if sys.version_info.major >= 3:
text = text.encode('utf-8')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not text.encode('utf-8') for python 2?

@goerlitz
Copy link
Author

goerlitz commented May 10, 2018

Short Explanation:

Python 3 uses unicode by default while python 2 differentiates between ASCII and unicode.

Therefore, if text.encode('utf-8') is applied on an (encoded) unicode string in python 2 it will result in
UnicodeDecodeError: 'ascii' codec can't decode byte ... in position 1: ordinal not in range(128).

Longer Explanation:

In python 3, type(u'Köln') and type('Köln') both give <class 'str'> and class 'str' is encoded to class 'bytes'.
In python 2, type(u'Köln') gives <type 'unicode'>, type('Köln') gives <type 'str'>, and type 'unicode' is encoded to type 'str'.

Since function annotate() requires input of type 'str' any unicode string in python 2 has to be passed as encoded 'str' and further calls of text.encode('utf-8') will result in UnicodeDecodeError.

Consequently, text.encode('utf-8') should only be called in python 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants