Unicode is an international standard for encoding text that assigns a unique whole numeric value, called a code point, to every character, symbol, and emoji from virtually all written languages and scripts in the world. This allows computers to process, store, and display text correctly across different platforms by providing a universal mapping for characters.
There are several standardized ways to encode the code points. The interest here is in two of them – UTF-8 and UCS-4 encoding.
UTF-8 encoding has emerged as the de-facto standard format for representing Unicode in text files on all major operating systems.
Not all code points are stored with the same number of bytes in UTF-8. The characters represented in single-byte ASCII-7 characters are represented by the same single byte in UTF-8 as well, but other characters require from two to four bytes of storage. This means ASCII-7 is a subset of UTF-8 but UTF-8 can represent far more characters. This compatibility with ASCII is a very large advantage of UTF-8 encoding over other code point encodings as a file format, contributing to it becoming a de-facto standard.
UCS-4 encoding is simpler and homogeneous. Each code point is stored as a 32-bit value, thus using the same amount of bytes for each character (unlike UTF-8). This format is often used to internally encode Unicode code points in various computing languages.
UCS-4 encoding shares the trait of constant storage size per element with all Fortran intrinsic types, making it a natural fit for the internal representation of code points in the Fortran language.
Since the release of the 2003 standard, Fortran does indeed optionally support processing of Unicode UTF-8-encoded files in this manner. Data is internally stored using UCS-4 encoding but translated to and from UCS-8 encoding during formatted I/O. This option will be referred to as the Fortran ISO_10646 standard.
A character encoded using UCS-4 or UTF-8 is often referred to as a “glyph” to differentiate it from ASCII characters. “Glyph” more technically is actually the name for the appearance of the rendering of the character via a font. But it will be used here as well as representing a Unicode “character”.
The following guides describe using UTF-8 files from Fortran codes. They not only include examples using the standard-specified ISO_10646 extension, but describe how to process UTF-8 encoded data without the extension. They include discussions concerning what is standardized and what is not, what commonly-used extensions compilers provide to address some of the current gaps in Unicode support, and what is known to be potentially non-portable but useful behavior from various compilers/processors.
The resulting methods are incorporated into Fortran Modules available via github repositories.
The selection of methods to employ breaks down along these major divides:
using the optional Fortran ISO_10646 standard.
The first guide set assumes you want to use the ISO_10646 extension and would prefer to conform as portably as reasonable to the Fortran standard; and probably avoid using UTF-8-encoded constant strings.
processing UTF-8 data without using the ISO_10646 extension.
using UTF-8-encoded source files? versus using only Fortran source files strictly adhering to the Fortran character set
M_ucs4 - A Module supporting using the ISO_10646 extension
A module supplementing the ISO_10646 extension, including
M_unicode - Processing UTF-8 data without depending on the ISO_10646 extension
A module defining a user-defined type that allows for ragged arrays of Unicode data and overlays of intrinsic functions along with many common character methods allowing additional functions such as case conversion, sorting, and padding. This is a very complete inteface for processing UTF-8 encoded data that does not require the optional ISO_10646 extension. It provides both a functional and OOP interface.