At its simplest, Unicode basically assigns a unique integer code to each glyph or character. Basically ASCII files are just that – composed of one-byte unsigned integer values. But this limits you to 255 characters. Unicode codes go far beyond 255.
So the ISO/IEC-10646 specification defines several ways of encoding each Unicode character as a set of computer bytes (ie., UTF-8, UTF-2/UTF-16, and UCS-4/UTF-32). Each method has pros and cons, primarily being trade-offs between the amount of data required versus how simply or efficiently the data can be processed; but including other factors such as how well it integrates and how little it conflicts with existing common text encoding such as ASCII.
To make it simple to identify which encoding a text file is using, a byte sequence was designed called the Byte Order Mark (BOM) that would not act as a character itself, but could be used to determine what encoding the file was using when placed at the start of the file.
That is why some Unicode files, particularly those originating from Windows systems, begin with a Byte Order Mark (BOM). This Unicode “noncharacter” is code point value U+FEFF. It provides a strong indicator of the encoding and byte order (endianness).
If a BOM is absent, one can attempt to decode the file using each encoding and check for validity. For instance, a sequence of bytes that is valid UTF-8 might be invalid or produce nonsensical characters when interpreted as UTF-16 or UTF-32. Because of Unicodes’ design this can be done with a high degree of reliability, decreasing the need for the BOM “Unicode Signature” defined by the “magic string” of bytes placed at the beginning of the file.
In practice UTF-8 has many advantages over the other encodings when used for file data. It is not effected by endianness, contains ASCII 8-bit characters as a subset, can avoid being misconstrued as an extended ASCII character set such as LATIN1 or LATIN2 encoding (commonly used with modern European languages), and can represent all Unicode characters but remain as compact as ASCII files when the files predominantly are composed of ASCII characters, which is still often the case.
All the Unicode encodings are sensitive to byte order except UTF-8. That alone might make UTF-8 the preferred text file format.
UTF-8 has become so dominant as the Unicode file encoding scheme the use of a BOM character is no longer even recommended unless required to work properly with particular applications. Even when a file is started with encoding=‘UTF-8’, a Byte Order Mark (BOM) is not generated automatically by any (current) Fortran compiler by default.
However, note that the NAG fortran compiler has a -bom=Asis|Remove|Insert option.
BOM characters are found most often in MicroSoft Windows environments. The BOM character as a “magic string” was used in virtually all Unicode files on MSWindows when initially introduced, partly because Microsoft supported multiple Unicode text file formats early on, before UTF-8 was seen as the de-facto text file encoding.
Note that to qualify as a BOM the string must appear at the beginning of the file, not the middle of a data stream. Unicode says it should be interpreted as a normal codepoint (namely a word joiner), not as a BOM if it does not appear first.
The Unicode Standard permits the BOM in UTF-8 files, but does not require or recommend its use.
There are references that state that if it is encountered “its presence interferes with the use of UTF-8 by software that does not expect non-ASCII bytes at the start of a file but that could otherwise handle the text stream”.
One such place might be a Fortran source file! Multi-byte characters are non-standard as part of the coding instructions, but are often handled when appearing in comments and literal quoted character strings.
Some applications may require it. The most relevant issue is that the NAG Fortran compiler has an extension where it formally supports UTF-8 source files, which are supposed to require starting with a BOM character to distinguish them from ASCII files.
The GNU/Linux or Unix command file(1) will usually identify a file starting with a BOM as UTF-8 encoded; but often determining whether a text is encoded in UTF-8, UTF-16, or UTF-32, especially without explicit metadata, often relies on analyzing the byte sequence for patterns specific to each encoding.
The Unicode standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work.
Not using a BOM allows text to be backwards-compatible with software designed for extended ASCII. For instance many non-Fortran programming languages permit non-ASCII bytes in string literals but not at the start of the file.
This program creates a Fortran source file starting with a UTF-8 BOM. Try to compile the output program to see if your compiler will compile it.
It could fail because a character is not in the Fortran character set outside of a comment or literal string
program bom_bytes
use iso_fortran_env, only : stdout => output_unit
implicit none
character(len=*),parameter :: &
& A_bom = char(int(z'EF'))// char(int(z'BB'))// char(int(z'BF'))
write(stdout,'(a)') &
'program testit ! Unicode BOM as utf-8 bytes' ,&
' write(*,*)"File starts with BOM from ""bytes"" write!"' ,&
'end program testit'
end program bom_bytes
---
This program also generates another program source with the first character the BOM character, but requires the compiler to support the optional ISO-10646 supplemental standard.
program bom_ucs4
use iso_fortran_env, only : stdout => output_unit
implicit none
intrinsic selected_char_kind
integer,parameter :: ucs4 = selected_char_kind ('ISO_10646')
character(len=*,kind=ucs4),parameter :: U_bom=char(int(z'FEFF'),kind=ucs4)
open(stdout,encoding='UTF-8')
write(stdout,'(a)',advance='no')U_bom
write(stdout,'(a)') &
ucs4_'program testit ! Unicode BOM encoded to utf-8 bytes by Fortran' ,&
ucs4_' write(*,*)"File starts with BOM from UCS-4 write!"' ,&
ucs4_'end program testit'
end program bom_ucs4
See Wikipedia entry for more information on the BOM Unicode character code, U+FEFF (aka. ZERO WIDTH NO-BREAK SPACE),