UTF-8
Description
Unicode is a standardization effort to consistently use an encoding, representation, and handling of writing systems from the entire world. In Unicode, code points are expressed as “U+” followed by a hexadecimal number. As an example, U+2623 is the biohazard sign ☣. Many, but not all, of these code points map one-to-one to single characters.
Note
In the software industry, there were unfortunate misconceptions about what “Unicode” meant. Those misconceptions lasted several years and confused many people into directly connecting the word “Unicode” to mean nothing else than a limited implementation of a fixed 16-bit character encoding.
UTF-8 is part of the Unicode Standard. It is an octet-based, variable-length character encoding used to encode Unicode code points. UTF-8 is backwards compatible with 7-bit ASCII. The eighth, most significant bit set implies that the code point data was split into two to four octets depending on adjacent fixed bits, as depicted in the following figure:
Embedul.ar natively supports UTF-8 character strings as regular string literals stored in UTF-8 encoded files, as in the following example:
// Save the source code file with UTF-8 encoding to store a proper
// multi-octet sequence.
const char * utf8_string = "こんにちはせかい!";
By design, the framework only supports the Basic Multilingual Plane or BMP (code points U+0000 to U+FFFF) mainly to save memory and decrease implementation complexity. UTF-8 four-octet encoded sequences (code points above U+FFFF) will raise errors or assertions, depending on context.
API guide
Code points and UTF-8 encoded buffers
UTF-8 is a variable-octet encoding. Depending on each code point represented, it might need more than one octet per character. The following functions deal with encoding (or decoding) a code point to (or from) a UTF-8 encoded buffer.
Handling UTF-8 encoded buffer data
The variable-octet encoding nature of UTF-8 requires specialized functions to check for ill-formed data, resync the data stream to the first valid character in the event of an invalid sequence, count the number of UTF-8 characters, and remove characters without breaking the encoding.
Design and development status
Feature-complete.
Changelog
| Version | Date* | Author | Comment | 
|---|---|---|---|
| 1.0.0 | 2022.9.7 | sgermino | Initial release. | 
* Date format is Year.Month.Day.
API reference
- 
struct UTF8_GetCodePointResult
- Information returned by - UTF8_GetCodePoint().- 
uint16_t dataLength
- Data octets used to extract - UTF8_GetCodePointResult.codepoint.
 - 
uint16_t codepoint
- Code point in the Basic Multilingual Plane. 
 
- 
uint16_t dataLength
- 
struct UTF8_CodePointRange
- Define a code point range. - 
uint16_t begin
- From which code point (inclusive). 
 - 
uint16_t end
- To which code point (inclusive). 
 
- 
uint16_t begin
- 
struct UTF8_CheckResult
- Information returned by - UTF8_Check().- 
uint32_t validChars
- Number of valid Basic Multilingual Plane characters. 
 - 
uint32_t invalidOctets
- Number of invalid octets in the variable-width encoding. 
 - 
_Bool rangePassed
- trueif every character is inside the specified ranges,- falseotherwise.
 
- 
uint32_t validChars
- 
uint8_t UTF8_SetCodePoint(uint8_t *const Data, const uint32_t Octets, const uint16_t CodePoint)
- Encodes a UTF-8 character. This function will only encode a variable-width character up to three octets in length, that is, a code point from the Unicode basic multilingual plane (U+0000 through U+FFFF). - Parameters
- Data – An array of - uint8_tcontaining enough space to store the encoded code point (as much as three octets).
- Octets – Element count in - data.
- CodePoint – Unicode code point from the basic multilingual plane. 
 
 
- 
struct UTF8_GetCodePointResult UTF8_GetCodePoint(const uint8_t *const Data, const uint32_t Octets)
- Decodes a UTF-8 character. This function will only assemble variable-width characters up to three octets in length, enough to decode all code points from the Unicode basic multilingual plane (U+0000 through U+FFFF). - Parameters
- Data – An array of - uint8_tcontaining UTF-8 code points.
- Octets – Element count in - data.
 
 
- 
struct UTF8_CheckResult UTF8_Check(const uint8_t *const Data, const size_t Octets, const struct UTF8_CodePointRange *const Ranges, const uint32_t RangeCount)
- Checks that UTF-8 encoded characters matches all code point ranges and returns the number of valid Basic Multilingual Plane characters. This function detects and assembles a well-formed UTF-8 character by using - UTF8_GetCodePoint().- Parameters
- Data – An array of - uint8_telements containing UTF-8 encoded characters.
- Octets – Element count in - data.
- Ranges – An array of - UTF8_CodePointRangewith code point ranges. An out-of-range character will fail the range test. This parameter may be- NULL, in which case there will be no range checking.
- RangeCount – Number of - UTF8_CodePointRangeranges in the- rangesarray or zero if- rangesis- NULL.
 
- Returns
- UTF8_CheckResultwith results.
 
- 
uint32_t UTF8_Count(const uint8_t *const Data, const size_t Octets)
- Checks a UTF-8 encoded stream and returns the number of valid Basic Multilingual Plane characters plus invalid octets. This function is a shortcut to calling - UTF8_Check()with no range checking.- Parameters
- Data – An array of - uint8_telements containing UTF-8 encoded characters.
- Octets – Element count in - data.
 
- Returns
- Number of valid characters plus invalid octets. 
 
- 
uint32_t UTF8_ReSync(const uint8_t *const Data, const uint32_t Octets)
- Resynchronizes an ill-formed multi-octet character sequence to the next single-octet or first multi-octet character. - Parameters
- Data – An array of - uint8_telements.
- Octets – Element count in - data.
 
- Returns
- Offset to the next single-octet, first multi-octet character sequence or - Octetswhen there is none.
 
- 
uint32_t UTF8_RemoveChars(struct ARRAY *const A, const uint32_t Count)
- Removes a number of UTF-8 characters starting from the end of an - ARRAY.- Parameters
- A – The - ARRAYinstance containing UTF-8 characters.
- Count – Number of UTF-8 characters to remove from the end. 
 
- Returns
- Number of characters effectively removed.