The internal representation varies from latin-1, UCS-2 to UCS-4
. UCS means that the representaion is 2 or 4 bytes long and the unicode code-units are numerically equal to the corresponding code-points. We can check this by finding where the sizes of the code units change.
To show that they range from 1 byte of latin-1 to to 4 bytes of UCS-4:
>>> getsizeof( )
49
>>> getsizeof( a ) #------------------ + 1 byte as the representaion here is latin-1
50
>>> getsizeof( U0010ffff )
80
>>> getsizeof( U0010ffffU0010ffff ) # + 4 bytes as the representation here is UCS-4
84
We can check that in the beginning representation is indeed latin-1 and not UTF-8 as the change to 2-byte code unit happens at the byte boundary and not at U0000007f
- U00000080
boundary as in UTF-8:
>>> getsizeof( U0000007f )
50
>>> getsizeof( U00000080 ) #----------The size of the string changes at x74 - x80 boundary but..
74
>>> getsizeof( U00000080U00000080 ) # ..the size of the code-unit is still one. so not UTF-8
75
>>> getsizeof( U000000ff )
74
>>> getsizeof( U000000ffU000000ff )# (+1 byte)
75
>>> getsizeof( U00000100 )
76
>>> getsizeof( U00000100U00000100 ) # Size change at byte boundary(+2 bytes). Rep is UCS-2.
78
>>> getsizeof( U0000ffff )
76
>>> getsizeof( U0000ffffU0000ffff ) # (+ 2 bytes)
78
>>> getsizeof( U00010000 )
80
>>> getsizeof( U00010000U00010000 ) # (+ 4 bytes) Thes size of the code unit changes to 4 at byte boundary again.
84