Skip to content

String performance #5

@walterdejong

Description

@walterdejong

The performance of the String class is rather poor. This is because the methods call utf8_decode() all the time. This is a consequence of the design decision to have the String be an UTF-8 string internally and have it present itself as a string of characters rather than bytes.

It's probably better to have both a UTF-8 byte-string String class and a UTF-32 String32 or uString class and let the programmer decide what she wants to use.
For example, like in Python:

>>> s = '普通话/普通話'
>>> s
'\xe6\x99\xae\xe9\x80\x9a\xe8\xaf\x9d/\xe6\x99\xae\xe9\x80\x9a\xe8\xa9\xb1'
>>> len(s)
19
>>> s[0]
'\xe6'
>>> s[1]
'\x99'
>>> s[2]
'\xae'

>>> us = u'普通话/普通話'
>>> len(us)
7
>>> us[0]
u'\u666e'

(This example demonstrates behavior of len and operator[]).

Note that changing the design of String is a major change that would break backwards compatibility.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions