| utf8Conversion {base} | R Documentation |
Convert Integer Vectors to or from UTF-8-encoded Character Vectors
Description
Conversion of UTF-8 encoded character vectors to and from integer vectors representing a UTF-32 encoding.
Usage
utf8ToInt(x)
intToUtf8(x, multiple = FALSE, allow_surrogate_pairs = FALSE)
Arguments
x |
object to be converted. |
multiple |
logical: should the conversion be to a single character string or multiple individual characters? |
allow_surrogate_pairs |
logical: should interpretation of
surrogate pairs be attempted? (See ‘Details’.)
Only supported for |
Details
These will work in any locale, including on platforms that do not otherwise support multi-byte character sets.
Unicode defines a name and a number of all of the glyphs it
encompasses: the numbers are called code points: since RFC3629
they run from 0 to 0x10FFFF (with about 5% being
assigned by version 13.0 of the Unicode standard and 7% reserved for
‘private use’).
intToUtf8 does not by default handle surrogate pairs: inputs in
the surrogate ranges are mapped to NA. They might occur if a
UTF-16 byte stream has been read as 2-byte integers (in the correct
byte order), in which case allow_surrogate_pairs = TRUE will
try to interpret them (with unmatched surrogate values still treated
as NA).
Value
utf8ToInt converts a length-one character string encoded in
UTF-8 to an integer vector of Unicode code points.
intToUtf8 converts a numeric vector of Unicode code points
either (default) to a single character string or a character vector of
single characters. Non-integral numeric values are truncated to
integers. For output to a single character string 0 is
silently omitted: otherwise 0 is mapped to "". The
Encoding of a non-NA return value is declared as
"UTF-8".
Invalid and NA inputs are mapped to NA output.
Validity
Which code points are regarded as valid has changed over the lifetime
of UTF-8. Originally all 32-bit unsigned integers were potentially
valid and could be converted to up to 6 bytes in UTF-8. Since 2003 it
has been stated that there will never be valid code points larger than
0x10FFFF, and so valid UTF-8 encodings are never more than 4
bytes.
The code points in the surrogate-pair range 0xD800 to
0xDFFF are prohibited in UTF-8 and so are regarded as invalid
by utf8ToInt and by default by intToUtf8.
The position of ‘noncharacters’ (notably 0xFFFE and
0xFFFF) was clarified by ‘Corrigendum 9’ in 2013. These
are valid but will never be given an official interpretation. (In some
earlier versions of R utf8ToInt treated them as invalid.)
References
https://www.rfc-editor.org/rfc/rfc3629, the current standard for UTF-8.
https://www.unicode.org/versions/corrigendum9.html for non-characters.
Examples
## will only display in some locales and fonts
intToUtf8(0x03B2L) # Greek beta
utf8ToInt("bi\u00dfchen")
utf8ToInt("\xfa\xb4\xbf\xbf\x9f")
## A valid UTF-16 surrogate pair (for U+10437)
x <- c(0xD801, 0xDC37)
intToUtf8(x)
intToUtf8(x, TRUE)
(xx <- intToUtf8(x, , TRUE)) # will only display in some locales and fonts
charToRaw(xx)
## An example of how surrogate pairs might occur
x <- "\U10437"
charToRaw(x)
foo <- tempfile()
writeLines(x, file(foo, encoding = "UTF-16LE"))
## next two are OS-specific, but are mandated by POSIX
system(paste("od -x", foo)) # 2-byte units, correct on little-endian platforms
system(paste("od -t x1", foo)) # single bytes as hex
y <- readBin(foo, "integer", 2, 2, FALSE, endian = "little")
sprintf("%X", y)
intToUtf8(y, , TRUE)