UTF-8

Description

Unicode is a standardization effort to consistently use an encoding, representation, and handling of writing systems from the entire world. In Unicode, code points are expressed as “U+” followed by a hexadecimal number. As an example, U+2623 is the biohazard sign ☣. Many, but not all, of these code points map one-to-one to single characters.

Note

In the software industry, there were unfortunate misconceptions about what “Unicode” meant. Those misconceptions lasted several years and confused many people into directly connecting the word “Unicode” to mean nothing else than a limited implementation of a fixed 16-bit character encoding.

UTF-8 is part of the Unicode Standard. It is an octet-based, variable-length character encoding used to encode Unicode code points. UTF-8 is backwards compatible with 7-bit ASCII. The eighth, most significant bit set implies that the code point data was split into two to four octets depending on adjacent fixed bits, as depicted in the following figure:

Embedul.ar natively supports UTF-8 character strings as regular string literals stored in UTF-8 encoded files, as in the following example:

// Save the source code file with UTF-8 encoding to store a proper
// multi-octet sequence.
const char * utf8_string = "こんにちはせかい!";

By design, the framework only supports the Basic Multilingual Plane or BMP (code points U+0000 to U+FFFF) mainly to save memory and decrease implementation complexity. UTF-8 four-octet encoded sequences (code points above U+FFFF) will raise errors or assertions, depending on context.

API guide

Code points and UTF-8 encoded buffers

UTF-8 is a variable-octet encoding. Depending on each code point represented, it might need more than one octet per character. The following functions deal with encoding (or decoding) a code point to (or from) a UTF-8 encoded buffer.

UTF8_SetCodePoint()

UTF8_GetCodePoint()

Handling UTF-8 encoded buffer data

The variable-octet encoding nature of UTF-8 requires specialized functions to check for ill-formed data, resync the data stream to the first valid character in the event of an invalid sequence, count the number of UTF-8 characters, and remove characters without breaking the encoding.

UTF8_Check()
UTF8_Count()
UTF8_ReSync()
UTF8_RemoveChars()

Design and development status

Feature-complete.

Changelog

Version	Date*	Author	Comment
1.0.0	2022.9.7	sgermino	Initial release.

* Date format is Year.Month.Day.

API reference

struct UTF8_GetCodePointResult

Information returned by UTF8_GetCodePoint().

uint16_t dataLength: Data octets used to extract UTF8_GetCodePointResult.codepoint.

uint16_t codepoint: Code point in the Basic Multilingual Plane.

struct UTF8_CodePointRange

Define a code point range.

uint16_t begin: From which code point (inclusive).

uint16_t end: To which code point (inclusive).

struct UTF8_CheckResult

Information returned by UTF8_Check().

uint32_t validChars: Number of valid Basic Multilingual Plane characters.

uint32_t invalidOctets: Number of invalid octets in the variable-width encoding.

_Bool rangePassed: true if every character is inside the specified ranges, false otherwise.

uint8_t UTF8_SetCodePoint(uint8_t *const Data, const uint32_t Octets, const uint16_t CodePoint)

Encodes a UTF-8 character. This function will only encode a variable-width character up to three octets in length, that is, a code point from the Unicode basic multilingual plane (U+0000 through U+FFFF).

Parameters

Data – An array of uint8_t containing enough space to store the encoded code point (as much as three octets).
Octets – Element count in data.
CodePoint – Unicode code point from the basic multilingual plane.

struct UTF8_GetCodePointResult UTF8_GetCodePoint(const uint8_t *const Data, const uint32_t Octets)

Decodes a UTF-8 character. This function will only assemble variable-width characters up to three octets in length, enough to decode all code points from the Unicode basic multilingual plane (U+0000 through U+FFFF).

Parameters

Data – An array of uint8_t containing UTF-8 code points.
Octets – Element count in data.

struct UTF8_CheckResult UTF8_Check(const uint8_t *const Data, const size_t Octets, const struct UTF8_CodePointRange *const Ranges, const uint32_t RangeCount)

Checks that UTF-8 encoded characters matches all code point ranges and returns the number of valid Basic Multilingual Plane characters. This function detects and assembles a well-formed UTF-8 character by using UTF8_GetCodePoint().

Parameters

Data – An array of uint8_t elements containing UTF-8 encoded characters.
Octets – Element count in data.
Ranges – An array of UTF8_CodePointRange with code point ranges. An out-of-range character will fail the range test. This parameter may be NULL, in which case there will be no range checking.
RangeCount – Number of UTF8_CodePointRange ranges in the ranges array or zero if ranges is NULL.

Returns

UTF8_CheckResult with results.

uint32_t UTF8_Count(const uint8_t *const Data, const size_t Octets)

Checks a UTF-8 encoded stream and returns the number of valid Basic Multilingual Plane characters plus invalid octets. This function is a shortcut to calling UTF8_Check() with no range checking.

Parameters

Data – An array of uint8_t elements containing UTF-8 encoded characters.
Octets – Element count in data.

Returns

Number of valid characters plus invalid octets.

uint32_t UTF8_ReSync(const uint8_t *const Data, const uint32_t Octets)

Resynchronizes an ill-formed multi-octet character sequence to the next single-octet or first multi-octet character.

Parameters

Data – An array of uint8_t elements.
Octets – Element count in data.

Returns

Offset to the next single-octet, first multi-octet character sequence or Octets when there is none.

uint32_t UTF8_RemoveChars(struct ARRAY *const A, const uint32_t Count)

Removes a number of UTF-8 characters starting from the end of an ARRAY.

Parameters

A – The ARRAY instance containing UTF-8 characters.
Count – Number of UTF-8 characters to remove from the end.

Returns

Number of characters effectively removed.