Clicky

Fortran Wiki
lesson3_ucs4

Introduction to Fortran Unicode support

Lesson III: mixing ASCII and UCS4 kinds as regards concatenation and assignments

Concatenation, Assignment, and automatic conversion

Assignment

Concerning assignment – the Fortran standard states

if the variable is of type character and of ISO 10646, ASCII, or default character kind, expr shall be of ISO 10646, ASCII, or default character kind, otherwise if the variable is of type character expr shall have the same kind type parameter,

For an intrinsic assignment statement where the variable is of type character, if expr has a different kind type parameter, each character “c” in expr is converted to the kind type parameter of the variable by

   ACHAR(IACHAR(c),KIND(variable)).

NOTES

For nondefault character kinds, the blank padding character is processor dependent

When assigning a character expression to a variable of a different kind, each character of the expression that is not representable in the kind of the variable is replaced by a processor-dependent character.

Unfortunately that means UTF-8 data is not recognized as such, and if you have a constant string encoded as UTF-8 in a default CHARACTER string, assigning it to a UCS-4 string will not produce proper conversion. Assigning a UCS-4 value to a ASCII variable will cause all the non-ASCII characters to be replaced with a “not represented” character.

program assignment
use iso_fortran_env, only : stdout=>output_unit, stdin=>input_unit
implicit none

intrinsic selected_char_kind

integer, parameter :: default = selected_char_kind ("default")
integer, parameter :: ascii =   selected_char_kind ("ascii")
integer, parameter :: ucs4  =   selected_char_kind ('ISO_10646')

character(len=:),allocatable           :: aline, a1, a2
character(len=:,kind=ucs4),allocatable :: uline, u1, u2
character(len=1),allocatable           :: ch(:), ch2(:)
character(len=1,kind=ucs4),allocatable :: glyph(:)
integer                                :: i
integer                                :: iostat
integer                                :: nerr
character(len=1)                       :: paws
character(len=1,kind=ucs4)             :: smiley=char(int(z'1F603'),kind=ucs4) ! 😃 Smiling face with open mouth

   open (stdout, encoding='DEFAULT')
   open (stdout, encoding='UTF-8')
   !
   ! only characters defined in the other encoding are copied on an assign

   write(stdout,'(A)')repeat(' ',80)
   write(stdout,'(A)')'assign RHS ucs4 to LHS ascii'
   uline=char(int(z'261B'),ucs4) // ucs4_'UCS-4 string' // char(int(z'261A'),ucs4)
   write(stdout,'(a)')trim(uline)
   aline=uline ! only the ASCII 7-bit characters are copied
   write(stdout,'(a)')trim(aline) // ' assigned to ASCII'

   write(stdout,'(A)')repeat(' ',80)
   write(stdout,'(A)')'assign LHS ascii to RHS ucs4'
   aline=ascii_'ASCII string'
   write(stdout,'(a)')trim(aline)
   uline=aline ! all ASCII 7-bit characters can be represented in UCS-4
   write(stdout,'(a)')trim(uline)//ucs4_' assigned to UCS4'

   write(stdout,'(A)')'round trip for all ASCII bytes'

   write(stdout,'(A)')repeat(ucs4_'=',80)
   ch=[(char(i),i=0,255)]
   open (stdout, encoding='DEFAULT')
   write(stdout,'(10(g0,1x,g0,1x))')(ch(i),i=0,255)
   open (stdout, encoding='UTF-8')
   write(stdout,'(10(g0,1x,g0,1x))')(ch(i),i=0,255)
   read(stdin,'(a)',iostat=iostat)paws

   write(stdout,'(A)')repeat(ucs4_'=',80)
   glyph=ch
   write(stdout,'(10(g0,1x,g0,1x))')(glyph(i),i=0,255)
   read(stdin,'(a)',iostat=iostat)paws

   write(stdout,'(A)')repeat(ucs4_'=',80)
   ch2=glyph
   write(stdout,'(10(g0,1x,g0,1x))')(ch2(i),i=0,255)
   read(stdin,'(a)',iostat=iostat)paws

   write(stdout,'(A)')repeat(ucs4_'=',80)

   write(stdout,'(a,L0)') 'roundrobin returned all values unchanged?',all( ch .eq. ch2)

end program assignment

The output of this example takes some study but the main lesson is simple. Basically assignment of a constant quoted string to a UCS-4 encoded variable only really works if the constant string is composed only of ASCII characters.

Fortran instructions other than READ and WRITE are unaware of any possible non-ASCII encoding of constant strings.

Concatenation

A limitation of concatenation is that all the strings have to be of the same KIND, so you cannot simply append UCS-4 and ASCII-7 strings.

And we have already seen assignment between the kinds only assigns representable characters.

But the definition of assignment includes an equivalent conversion defined in terms of ACHAR(3) and IACHAR(3):

   ACHAR(IACHAR(c),KIND(variable)).

So we can make functions that do what an assignment does to overcome the first limitation where everything concatenated must be the same kind.

We will do that in the following concatenation example; but that function will still not transfer UTF-8 encoded data properly.

program concatenate
use iso_fortran_env, only : stdout=>output_unit, stdin=>input_unit
implicit none

intrinsic selected_char_kind

integer, parameter :: default = selected_char_kind ("default")
integer, parameter :: ascii =   selected_char_kind ("ascii")
integer, parameter :: ucs4  =   selected_char_kind ('ISO_10646')

character(len=*),parameter             :: g='(*(g0))'
character(len=:),allocatable           :: aline, a1, a2
character(len=:,kind=ucs4),allocatable :: uline, u1, u2
character(len=1),allocatable           :: ch(:), ch2(:)
character(len=1,kind=ucs4),allocatable :: glyph(:)
integer                                :: i
integer                                :: iostat
integer                                :: nerr
character(len=1)                       :: paws
                                       !  😃 Smiling face with open mouth
character(len=1,kind=ucs4)             :: smiley=char(int(z'1F603'),kind=ucs4) 

   open (stdout, encoding='DEFAULT')
   open (stdout, encoding='UTF-8')
   !
   ! Concatenation:
   !
   write(stdout,'(A)')repeat('=',80)
   write(stdout,'(a)')'strings of different kinds cannot be concatenated.'
   !uline='ascii string'// smiley // 'ascii string' ! NO. Kinds must match

   write(stdout,'(a)') 'Of course constants can have their KIND specified.'
   uline=ucs4_'first UCS4 string' // smiley // ucs4_'another UCS4 string '
   write(stdout,'(A)') uline
   !
   write(stdout,'(A)')repeat('=',80)
   write(stdout,'(a)') 'you can use simple assigns to do conversions'
   ! so if I have a UCS4 string
   u1=smiley // ucs4_'UCS4 strings' // smiley // ucs4_'appended together' // smiley
   ! and an ASCII string
   a1='ascii strings' // 'appended together'
   ! the ASCII string can be converted to UCS4 with an assign
   u2=a1 ! use allocation to convert ASCII to UCS4
   ! now with a copy of everything as UCS4 the append will work
   uline=u1//u2 ! now append together the two strings which are now of the same kind
   write(stdout,'(a)') uline
   !
   write(stdout,'(A)')repeat('=',80)
   write(stdout,'(a)') 'we can make functions to convert to and from ASCII and UCS4'
   ! using the same conversions as used by an assign.
   uline=smiley // ascii_to_ucs4('ascii string') // smiley // ucs4_'unicode string' // smiley
   write(stdout,'(a)') uline
   !
   write(stdout,'(A)')'unrepresentable characters:'
   write(stdout,'(a)')'what about characters that have no equivalent in the other kind?'
   write(stdout,'(A)')'conversion by assignment'
   aline=uline
   write(stdout,g) aline,' ',len(aline),' ',len(uline)
   write(stdout,'(a)') 'conversion by ACHAR/ICHAR:'
   aline=ucs4_to_ascii(uline) ! is "smiley" replaced with a character used for errors?
   write(stdout,g) aline,' ',len(aline),' ',len(uline)
   write(stdout,'(a)') 'which character replaces the unrepresentable characters is processor-dependent'
   write(stdout,'(a)') 'and might be unprintable'
   aline=smiley
   write(stdout,'(a,i0,a)') 'ADE:',ichar(aline),' CHARACTER:',aline
   write(stdout,'(A)')repeat('=',80)

contains

function ascii_to_ucs4(astr) result(ustr)
! @(#) make the same conversion as an assignment statement from ASCII to UCS4
character(len=*,kind=ascii),intent(in) :: astr
character(len=len(astr),kind=ucs4)     :: ustr
integer                                :: i
   do i=1,len(astr)
      ustr(i:i)=achar(iachar(astr(i:i)),kind=ucs4)
   enddo
end function ascii_to_ucs4

function ucs4_to_ascii(ustr) result(astr)
! @(#) make the same conversion as an assignment statement from UCS4 to ASCII
character(len=*,kind=ucs4),intent(in)  :: ustr
character(len=len(ustr),kind=ascii)    :: astr
integer                                :: i
   do i=1,len(ustr)
      astr(i:i)=achar(iachar(ustr(i:i)),kind=ascii)
   enddo
end function ucs4_to_ascii

end program concatenate

Expected Output

================================================================================
strings of different kinds cannot be concatenated.
Of course constants can have their KIND specified.
first UCS4 string😃another UCS4 string 
================================================================================
you can use simple assigns to do conversions
😃UCS4 strings😃appended together😃ascii stringsappended together
================================================================================
we can make functions to convert to and from ASCII and UCS4
😃ascii string😃unicode string😃
unrepresentable characters:
what about characters that have no equivalent in the other kind?
conversion by assignment
?ascii string?unicode string? 29 29
conversion by ACHAR/ICHAR:
?ascii string?unicode string? 29 29
which character replaces the unrepresentable characters is processor-dependent
and might be unprintable
ADE:3 CHARACTER:
?
================================================================================

Summary

Assignment allows for easily converting ASCII-7 to UCS-4; and allows extracting ASCII-7 from UCS-4 strings. But assignment does not properly account for UTF-8 coding in any way.

Concatenation is only allowed between strings of the same KIND.

It is easy to make functions that do the same conversion as assignment performs, which can make it easier to pass INTENT(IN) values on procedure calls and statements as ASCII, which is a common need. Even if all the characters are in the ASCII-7 set, a UCS-4 encoded variable cannot be used as a filename on an OPEN(3) for example. You can assign the string to an ASCII scratch variable or call such a function as the above UCS4_TO_ASCII(3) function to resolve that and similiar issues where you are encoding your data as UCS-4 but some other procedure only expects ASCII.