Not all Fortran compilers provide high-level ISO-10646 (ie. βUnicodeβ) support. To determine if a compiler provides support, one can attempt to compile and execute the following program:
program test_for_iso_10646
implicit none
integer, parameter :: ucs4 = selected_char_kind ('ISO_10646')
write(*,*) trim(merge('ISO-10646 SUPPORTED ', &
'ISO-10646 NOT SUPPORTED', &
ucs4>0))
end program test_for_iso_10646
If the supplemental ISO-10646 standard is supported, you want to select a terminal emulator and font and system locale so this next program prints an emoji to the screen:
program test_for_iso_10646
use iso_fortran_env, only : output_unit
implicit none
intrinsic selected_char_kind
integer, parameter :: ucs4 = selected_char_kind ('ISO_10646')
open(output_unit,encoding='utf-8')
write(output_unit,'(*(g0,1x))') & ! π
& 'Smiling face with open mouth',char(int(z'1F603'),kind=ucs4)
end program test_for_iso_10646
If that is not done, the extensions will work with files but not with standard input and output to the screen. It is likely to work by default, but if not you will generally find out that how to use UTF-8 data to the screen on your system is well documented but very system-dependent.
If the ISO-10646 supplement is not supported Unicode usage will require lower-level knowledge of byte-level Fortran processing and I/O and the hosting operating system, which is covered in a different guide. This introduction only applies to compilers providing ISO-10646 support.
Fortran Unicode support is straight-forward when just reading and writing UTF-8-encoded files. There is very little different from when processing ASCII files.
In many cases all that is required is to
Fortran will then convert the data from UTF-8 files to whichever Unicode encoding it uses internally (UTF-8, UTF-32, UTF-16, β¦) on input, and convert back to UTF-8 on output.
If standard-conforming, the internal representation will be UCS-4, as the standard description of the intrinsic SELECTED_CHAR_KIND() states:
If NAME has the value βISO_10646β, then the result has a value equal to that of the kind type parameter of the ISO_10646 character kind (corresponding to UCS-4 as specified in ISO/IEC 10646) if the processor supports such a kind; otherwise the result has the value β1.
This automatic conversion between UCS-4 (aka.UTF-32) and UTF-8 encoding is not so different from what occurs when reading and writing numeric values from ASCII files. The binary representation of the numbers (REAL, INTEGER, COMPLEX, ..) used internally by the program is very different from the human-readable ASCII representations, but Fortran makes this conversion automatically for the user also, when asked to provide formatted I/O.
Why not just use UTF-8 encoding directly? UCS-4 encoding represents each glyph or character using four bytes. That is, each character is basically represented as a 32-bit value. This makes it much easier to provide arrays and optimized intrinsics than when using UTF-8 encoding, which requires supporting multi-byte characters from one to four bytes.
So it is assumed here that βISO_10646β implies standard-conforming UCS-4 encoding internally, but the same rules apply if your compiler supports a UTF-2 encoding extension and you select it instead (except UTF-2 requires less storage, but cannot represent as wide a range of Unicode glyphs).
For the purposes of this tutorial what matters is that you know the internal representation is encoded differently than in the UTF-8 files, and that one kind cannot be converted to the other simply by copying bytes from one representation to the other.
Also note that the memory required to hold UCS-4 characters is four times greater than if they were ASCII characters, as all UCS-4 characters are 4-byte values and all ASCII characters are 1-byte.
Many useful programs can adhere to these restrictions.
A simplistic example that reads a UTF-8 file with lines up to 4096 glyphs and outputs the file prefixing each line with a glyph/character count demonstrates that very little differs from a similar program which processes ASCII files:
program count_glyphs
! @(#) read utf-8 file and write it back out prefixed with line glyph counts
use, intrinsic :: iso_fortran_env,only : stdout=>output_unit, stdin=>input_unit
implicit none
intrinsic selected_char_kind
intrinsic is_iostat_end
intrinsic len_trim
!------
! DIFFERENCE: we will be using the kind name "ucs4" for Unicode variables
integer, parameter :: ucs4 = selected_char_kind ('ISO_10646')
!------
character(len=*),parameter :: g= '(*(g0))'
integer :: length
integer :: i
integer :: iostat
!------
! DIFFERENCE: string declared with KIND=UCS4. This statement
! specifies a maximum line length of 4096 glyphs not bytes
! as this character variable is Unicode ISO_10646, not ASCII
character(len=4096,kind=ucs4) :: uline
!------
character(len=255) :: iomsg
!------
! DIFFERENCE: you can change the encoding used for a file dynamically,
! even on pre-assigned files so make sure stdin and stdout are set to
! expect to format UCS4-encoded internal data as UTF-8 encoded files:
open (stdin, encoding='UTF-8')
open (stdout, encoding='UTF-8')
!------
! copy file to stdout, prefixing each line with a glyph/character count
do
read(stdin,'(a)',iostat=iostat,iomsg=iomsg)uline
if(iostat.eq.0)then
!------
! NOTE: LEN_TRIM() works with UCS-4 just as with ASCII
length=len_trim(uline)
!------
!------
! NOTE: String substrings work just as with ASCII
write(stdout,'(i9,": ",a)')length,uline(:length)
!------
elseif(is_iostat_end(iostat))then
exit
else
!------
! NOTE:
! does the ASCII message have to be converted to UCS-4?
! This will be discussed in detail later, but for now
! remember you can change the encoding of a file dynamically
! anyway
open (stdout, encoding='DEFAULT')
!------
write(stdout,g)'<ERROR>',trim(iomsg)
stop
endif
! and the answer is that unless you are going to output a series
! of bytes in the message that do not represent an ASCII-7 or
! UCS-4 character (which would be a very unusual thing to be
! doing) you can leave the encoding set to UTF-8 and output
! traditional CHARACTER(kind=DEFAULT) variables just fine.
enddo
end program count_glyphs
So if we create a file called βupagain.utfβ
δΈθ»’γ³ε
«θ΅·γγ
θ»’γγ§γγΎγη«γ‘δΈγγγ
γγγγγ«εγεγγ¦ζ©γγ¦γγγγ
Romanization:
Nanakorobi yaoki.
Koronde mo mata tachiagaru.
Kujikezu ni mae o muite aruite ikou.
English translation:
"Fall seven times, stand up eight.
Even if you fall down, you will get up again.
Don't be discouraged, just keep walking forward."
and make sure that our terminal displays UTF-8 files properly by displaying that file to the screen, then running the program
./count_glyphs < upagain.utf
should produce
7: δΈθ»’γ³ε
«θ΅·γγ
12: θ»’γγ§γγΎγη«γ‘δΈγγγ
17: γγγγγ«εγεγγ¦ζ©γγ¦γγγγ
0:
13: Romanization:
0:
20: Nanakorobi yaoki.
30: Koronde mo mata tachiagaru.
39: Kujikezu ni mae o muite aruite ikou.
0:
19: English translation:
0:
37: "Fall seven times, stand up eight.
48: Even if you fall down, you will get up again.
52: Don't be discouraged, just keep walking forward."
That is how simple basic Unicode usage is in Fortran. The data will be converted from UTF-8 files to UCS-4 internal representation and back again transparently. The CHARACTER substring indexing and intrinsic functions such as LEN(), TRIM(), VERIFY(), INDEX(), and SCAN() are generic, and will work with Unicode as simply as with ASCII.