Fortran Wiki
no_iso_10646

Processing Unicode when ISO-10646 is not supported

Lesson I: converting UTF-8 codes to and from INTEGER values

If a Fortran compiler does not provide the optional ISO-10646 support you can still do more than just copy UTF-8 byte streams to and from files.

The most general approach is to convert the utf-8 data into Unicode integer code points.

Fortran also lacks an intrinsic string type of variable length.

This all points to creating a user-defined type that contains Unicode strings as an integer array, with functions similar to the Fortran CHARACTER intrinsics.

UTF-8 bytes to codes

The first thing to do with the UTF-8 encoded data is to convert it to Unicode code values; that is to find the integer value that identifies that glyph using Unicode encoding. Fortran is not aware of UTF-8 encoding except via I/O routines when the optional ISO_10646 supplement is supported. So routines need created that do the conversion of UTF-8 encoded data to and from Unicode code point values. These procedures are available in the M_unicode module as

utf8_to_codepoints()
codepoints_to_utf8()

They are public, but generally not expected to be called directly by user code.

To encapsulate this data a user-defined type called UNICODE_TYPE is defined. This allows for creating ragged arrays of character data where each element may be a different length.

Assignment is defined such that UNICODE_TYPE variables can be defined by being assigned to UTF-8 encoded streams of bytes or even an integer array containing Unicode codepoint values.

A function called CHARACTER is needed to convert the type back to a stream of bytes, for passing to other procedures or for printing as ASCII data.

Now with this type defined we can overload all the character-related intrinsics to provide a familar interface, add an OOP interface to the type and add additional functions for sorting, advanced string manipulation, and case conversion.

The result is an interface arguably simpler to use than the ISO-10646 supplement that is considerably more powerful.

Summary

M_unicode

The M_unicode github repository contains not only the module code but build methods using fpm(1), make(1), and cmake(1); a unit test; example programs for each method provided; and documentation in HTML, man-page, and flat-text formats.
TOP

Created on November 3, 2025 02:17:40 by Anonymous Coward (73.214.44.198) (2518 characters / 1.0 pages)

Fortran Wiki no_iso_10646

Processing Unicode when ISO-10646 is not supported

Lesson I: converting UTF-8 codes to and from INTEGER values

UTF-8 bytes to codes

Summary

Fortran Wiki
no_iso_10646