Fixed Unicode string encoding issues for python 2/3. #30

goerlitz · 2018-03-18T16:02:39Z

As mentioned in #29 Unicode strings are not properly handled when using Python 2 and lead to exceptions.
This fix makes sure that Unicode string encoding works for Python 2 and Python 3 as well.

TheJefe · 2018-04-24T17:36:56Z

pycorenlp/corenlp.py

-        data = text.encode()
+        # ensure proper encoding of python 3 strings
+        if sys.version_info.major >= 3:
+            text = text.encode('utf-8')


why not text.encode('utf-8') for python 2?

goerlitz · 2018-05-10T00:49:06Z

Short Explanation:

Python 3 uses unicode by default while python 2 differentiates between ASCII and unicode.

Therefore, if text.encode('utf-8') is applied on an (encoded) unicode string in python 2 it will result in
UnicodeDecodeError: 'ascii' codec can't decode byte ... in position 1: ordinal not in range(128).

Longer Explanation:

In python 3, type(u'Köln') and type('Köln') both give <class 'str'> and class 'str' is encoded to class 'bytes'.
In python 2, type(u'Köln') gives <type 'unicode'>, type('Köln') gives <type 'str'>, and type 'unicode' is encoded to type 'str'.

Since function annotate() requires input of type 'str' any unicode string in python 2 has to be passed as encoded 'str' and further calls of text.encode('utf-8') will result in UnicodeDecodeError.

Consequently, text.encode('utf-8') should only be called in python 3.

fixed Unicode string encoding issues for python 2/3.

1f0f240

TheJefe reviewed Apr 24, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed Unicode string encoding issues for python 2/3. #30

Fixed Unicode string encoding issues for python 2/3. #30

Uh oh!

goerlitz commented Mar 18, 2018

Uh oh!

TheJefe Apr 24, 2018

Uh oh!

goerlitz commented May 10, 2018 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fixed Unicode string encoding issues for python 2/3. #30

Are you sure you want to change the base?

Fixed Unicode string encoding issues for python 2/3. #30

Uh oh!

Conversation

goerlitz commented Mar 18, 2018

Uh oh!

TheJefe Apr 24, 2018

Choose a reason for hiding this comment

Uh oh!

goerlitz commented May 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Short Explanation:

Longer Explanation:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

goerlitz commented May 10, 2018 •

edited

Loading