I don’t know how Python 3.10’s string works internally. Is it choosing between 8-bit, 16-bit, and 32-bit per character in runtime?
For example:
for line in open('read1.py'):
print(line)
Can the line string be an 8-bit, 16-bit, or 32-bit character string in each iteration? Should the line be 8-bit by default and become a 32-bit string if that line has an emoji?
Python strings are UTF-8 encoded by default. UTF-8 is a variable width format where each character can be of different width.
An decoder would first check the very first character bit and if that is
0
, then it is an 8-bit ASCII character. 16-bit characters would always start with110
and the second byte would start with10
. A 24-bit character would start with1110
and the following bytes would start with10
again. And for the largest 32-bit character, it would start with11110
and, again, the following three bytes start with10
.The Wikipedia page explains and visualizes it quite nicely.
If they used UTF-8 internally, they wouldn’t need 4 versions of the split function.
case PyUnicode_1BYTE_KIND: if (PyUnicode_IS_ASCII(self)) return asciilib_split_whitespace( self, PyUnicode_1BYTE_DATA(self), len1, maxcount ); else return ucs1lib_split_whitespace( self, PyUnicode_1BYTE_DATA(self), len1, maxcount ); case PyUnicode_2BYTE_KIND: return ucs2lib_split_whitespace( self, PyUnicode_2BYTE_DATA(self), len1, maxcount ); case PyUnicode_4BYTE_KIND: return ucs4lib_split_whitespace( self, PyUnicode_4BYTE_DATA(self), len1, maxcount );