UTF-8
Description
Unicode is a standardization effort to consistently use an encoding, representation, and handling of writing systems from the entire world. In Unicode, code points are expressed as “U+” followed by a hexadecimal number. As an example, U+2623 is the biohazard sign ☣. Many, but not all, of these code points map one-to-one to single characters.
Note
In the software industry, there were unfortunate misconceptions about what “Unicode” meant. Those misconceptions lasted several years and confused many people into directly connecting the word “Unicode” to mean nothing else than a limited implementation of a fixed 16-bit character encoding.
UTF-8 is part of the Unicode Standard. It is an octet-based, variable-length character encoding used to encode Unicode code points. UTF-8 is backwards compatible with 7-bit ASCII. The eighth, most significant bit set implies that the code point data was split into two to four octets depending on adjacent fixed bits, as depicted in the following figure:
Embedul.ar natively supports UTF-8 character strings as regular string literals stored in UTF-8 encoded files, as in the following example:
// Save the source code file with UTF-8 encoding to store a proper
// multi-octet sequence.
const char * utf8_string = "こんにちはせかい!";
By design, the framework only supports the Basic Multilingual Plane or BMP (code points U+0000 to U+FFFF) mainly to save memory and decrease implementation complexity. UTF-8 four-octet encoded sequences (code points above U+FFFF) will raise errors or assertions, depending on context.
API guide
Code points and UTF-8 encoded buffers
UTF-8 is a variable-octet encoding. Depending on each code point represented, it might need more than one octet per character. The following functions deal with encoding (or decoding) a code point to (or from) a UTF-8 encoded buffer.
Handling UTF-8 encoded buffer data
The variable-octet encoding nature of UTF-8 requires specialized functions to check for ill-formed data, resync the data stream to the first valid character in the event of an invalid sequence, count the number of UTF-8 characters, and remove characters without breaking the encoding.
Design and development status
Feature-complete.
Changelog
Version |
Date* |
Author |
Comment |
---|---|---|---|
1.0.0 |
2022.9.7 |
sgermino |
Initial release. |
* Date format is Year.Month.Day.
API reference
-
struct UTF8_GetCodePointResult
Information returned by
UTF8_GetCodePoint()
.-
uint16_t dataLength
Data octets used to extract
UTF8_GetCodePointResult.codepoint
.
-
uint16_t codepoint
Code point in the Basic Multilingual Plane.
-
uint16_t dataLength
-
struct UTF8_CodePointRange
Define a code point range.
-
uint16_t begin
From which code point (inclusive).
-
uint16_t end
To which code point (inclusive).
-
uint16_t begin
-
struct UTF8_CheckResult
Information returned by
UTF8_Check()
.-
uint32_t validChars
Number of valid Basic Multilingual Plane characters.
-
uint32_t invalidOctets
Number of invalid octets in the variable-width encoding.
-
_Bool rangePassed
true
if every character is inside the specified ranges,false
otherwise.
-
uint32_t validChars
-
uint8_t UTF8_SetCodePoint(uint8_t *const Data, const uint32_t Octets, const uint16_t CodePoint)
Encodes a UTF-8 character. This function will only encode a variable-width character up to three octets in length, that is, a code point from the Unicode basic multilingual plane (U+0000 through U+FFFF).
- Parameters
Data – An array of
uint8_t
containing enough space to store the encoded code point (as much as three octets).Octets – Element count in
data
.CodePoint – Unicode code point from the basic multilingual plane.
-
struct UTF8_GetCodePointResult UTF8_GetCodePoint(const uint8_t *const Data, const uint32_t Octets)
Decodes a UTF-8 character. This function will only assemble variable-width characters up to three octets in length, enough to decode all code points from the Unicode basic multilingual plane (U+0000 through U+FFFF).
- Parameters
Data – An array of
uint8_t
containing UTF-8 code points.Octets – Element count in
data
.
-
struct UTF8_CheckResult UTF8_Check(const uint8_t *const Data, const size_t Octets, const struct UTF8_CodePointRange *const Ranges, const uint32_t RangeCount)
Checks that UTF-8 encoded characters matches all code point ranges and returns the number of valid Basic Multilingual Plane characters. This function detects and assembles a well-formed UTF-8 character by using
UTF8_GetCodePoint()
.- Parameters
Data – An array of
uint8_t
elements containing UTF-8 encoded characters.Octets – Element count in
data
.Ranges – An array of
UTF8_CodePointRange
with code point ranges. An out-of-range character will fail the range test. This parameter may beNULL
, in which case there will be no range checking.RangeCount – Number of
UTF8_CodePointRange
ranges in theranges
array or zero ifranges
isNULL
.
- Returns
UTF8_CheckResult
with results.
-
uint32_t UTF8_Count(const uint8_t *const Data, const size_t Octets)
Checks a UTF-8 encoded stream and returns the number of valid Basic Multilingual Plane characters plus invalid octets. This function is a shortcut to calling
UTF8_Check()
with no range checking.- Parameters
Data – An array of
uint8_t
elements containing UTF-8 encoded characters.Octets – Element count in
data
.
- Returns
Number of valid characters plus invalid octets.
-
uint32_t UTF8_ReSync(const uint8_t *const Data, const uint32_t Octets)
Resynchronizes an ill-formed multi-octet character sequence to the next single-octet or first multi-octet character.
- Parameters
Data – An array of
uint8_t
elements.Octets – Element count in
data
.
- Returns
Offset to the next single-octet, first multi-octet character sequence or
Octets
when there is none.
-
uint32_t UTF8_RemoveChars(struct ARRAY *const A, const uint32_t Count)
Removes a number of UTF-8 characters starting from the end of an
ARRAY
.- Parameters
A – The
ARRAY
instance containing UTF-8 characters.Count – Number of UTF-8 characters to remove from the end.
- Returns
Number of characters effectively removed.