Character-handling in Fortran

Version 2021 April 8

Introduction

Although Fortran programs are mostly used to process numbers, it is often necessary to handle strings of characters as well, for example to manipulate names of files or data objects, to read or write text files, or to read from the keyboard or write to the screen.

The character data type was introduced In Fortran 77, replacing the very primitive facilities of the Hollerith constant.  At the time this was regarded as a major advance especially because in Fortran a character variable could hold as long a string as you wanted, in contrast to the C language where a variable of type char could (and still can) only hold a single character.  The main limitations of Fortran's original character type are:
  1. The length of each character variable, that is the number of characters that it holds, has to be chosen when writing the program and it cannot be altered at run-time.
  2. In an assignment where the destination variable has a length that is too great then spaces are automatically appended up to the declared length. This is a considerable convenience and thes extra spaces do little harm except to increase execution time and memory usage a little.
  3. On the other hand if the destination variable is too short then the string being assigned to it is simply truncated to fit, but there is no warning or error message when this happens, you simply lose data.  This is a more serious disadvantage.
In Fortran 2003 character variables could be declared allocatable, that is to have a length that can be varied at run-time.  Even better, the length of a variable is set or re-set automatically in an assignment statement.  Allocatable strings are easier to use, generally be more efficient, and also less error-prone than fixed-length ones. Unfortunately at present it is only in scalar assignment statements that the length gets set automatically, which seriously limits their usefulness in practice. This means that Fortran programmers still need to understand the rules for both fixed length and allocatable strings. These rules are, as a result, quite complicated.  I have tried quite hard to discover an adequate description of them either online or in printed form, but could not find one, not even for ready money.  This is what prompted me to write my own description here. 

Character constants

Chracter constants may be enclosed either in a pair of double-quotes o in a pair of apostrophes (single-quotes):
    print *, "Hello, World!" 
    print *, 'Hello, World!'


If you need to use a double-quote or apostrophe character within a constant it is simplest just to use the alternative character to enclose the string.  
        print *, "Don't do that"
The other way is to double up the enclosing character where it appears within the string, which produces the same result:
    print *, 'Don''t do that'


If a literal constant is too long to fit conveniently on one line it may be continued by putting an ampersand at the end of one line and another ampersand as the first non-space character of the next line, and so on, up to the limit of 255 continuation lines.
        print *, "This constant should appear on one line &
            &even though it appears in the source-code on two lines."

There is no escape character in Fortran as there is in some other languages. Constants in source-code can only contain characters which are in the Fortran character set. This is specified in section 6.1 of the Fortran 2018 Standard, but in practice all printing ASCII characters between codes 32 and 126 are included.  Control characters, such as horizontal tab or form-feed, should not be used.  To include a control character in a string see the section below on Character Sets and Kinds.

Named Character Constants

Named constants are declared with a length and the parameter attribute, but an asterisk may be used for the length, which saves having to count the characters in the constant itself:
        character(len=*), parameter :: version = "Myprog version 1st April 2021"
If the length is instead declared with a number this must match the actual length of the string.  In all declarations the "len=" part is optional, so this could be specified using character(*) but in this document the longer form will be used for clarity.

Scalar Variables

Each variable of type character has an additional property: length which is the number of characters that it contains.  The alternatives of fixed length and allocatable length will be described separately.  The maximum length of a character variable is system-dependent, but it is likely that all modern systems will cope with strings of up to 231 (over 2 billion) characters if there is enough memory available.

Allocatable Length

This is specified by giving a colon for the length and also an attribute of allocatable.  This example shows how to join two strings together (plus an extra space between them) using the concatenation operator // 

        character(len=:), allocatable :: firstname, surname, full_name
        firstname = "Jane"
        surname = "Doe"
        full_name = firstname // " " // surname

The length of each of these variables is set or reset automatically in every assignment statement. An allocatable variable which has not yet been assigned a value does not have a length of zero, it is simply unset and cannot be used in expressions.

Fixed Length

The syntax for declaring fixed-length strings is simpler but the programmer has to choose a length for each variable that is suits the worst-case.    
        character(len=20) :: firstname, surname
        character(len=40) :: full_name
        firstname = "Jane"
        surname = "Doe"
        full_name = trim(firstname) // " " // trim(surname)

The liberal use of the trim function to remove the added padding is also required.  The rules for fixed-length strings (padding with spaces when the destination is too long, and truncation when too short) were noted earlier.
        character(len=20) :: surname
        surname = "Twistleton-Wickham-Feinnes"
        print *, "Surname = ", surname
The results of this print statement show that the length was inadequate::
 Surname = Twistleton-Wickham-F

An alternative way of declaring strings allows different fixed lengths for each one (the length given before the double-colon is the default for subsequent variables not given an explicit length.  
        character(len=20) :: firstname, surname, fullname*40,  postcode*8
If no length is given the string has a length of one.  Strings of length zero are permitted, if rarely useful, and may arise as the result of calculations. In a fixed-length declaration (but not a variable-length one) an initial value can be specified, either in the same statement:
        character(len=4) :: status = "OK  "
or with a subsequent data statement
        character(len=4) :: status
        data month / "OK" /
In the first case, if a length of 4 is needed (for example to allow the later assignment of an error code) it is necessary for the constant to have the same length as that declared for the variable by providing the right number of trailing spaces, but this padding is provided automatially if you use a data statement.  Note that in either case a variable given an initial value acquires the save attribute, that is if used in a procedure the value will be preserved from one invocation of the procedure to the next.  If it gets updated then it is the updated value which is saved; the value in the variable declaration (or data statement) is only used once, at the start of execution.

Substrings

The substring notation allows any consecutive set of characters to be extracted or replaced.  The general form is variable(i:n) where i and n are integer expressions specifying the a substring from positions i to n inclusive. Characters in strings are always counted from one at the left-most end.
        character(len=:), allocatable :: city
        city = "Washington"
        print *, city(3:6)
This should print "shin".  
The rules of substring notation are:
For example city(:4) has the value "Wash" while city(8:) has the value "ton".  

Substring notation can also be used on the left-hand side of an assignment, in which case it causes the relevant character positions to be replaced, while all other characters in the string are unaffected.  So that after executing
        city(1:4) = "Kens"
the new value of the variable will be "Kensington".    Note that a substring reference on the left-hand-side cannot change the length of an item even if allocatable.  Thus
    city = "Rome"
    city(:) = "London"
will result in city holding only the four characters "Lond" if its length is allocatable (or a fixed-length of four).

Intrinsic Functions

All the intrinsic functions in this section work equally well on fixed-length and allocatable strings.

trim(string) Returns the string with all trailing spaces removed - very useful with fixed-length strings
len_trim(string) Returns an integer with the length of the string not counting any trailing spaces
len(string) Returns in integer with the current length of the string (useful for allocatable lengths and in procedures)
adjustl(string) Returns the string with any leading spaces moved to the end
adjustr(string) Returns the string any traling spaces moved to the start
repeat(string, ntimes) Returns a string with the 1st argument repeated as many times as specified by the 2nd arg.
new_line(c) Returns the new-line character for the current operating system of the same kind as its argument.
achar(ipos) Returns the single character corresponding to integer position ipos in the ASCII code table
iachar(c) Returns the integer code for that single character in the ASCII code table (must have length of one)

The function char works like achar and ichar works like iachar but using the local character code.  These will be identical to achar and iachar if the local code is ASCII.

String Searches

There are three intrinsic functions useful in searching the contents of strings: index, scan, and verify.  All search the first argument string for a substring or sub-set which is the second argument.  There is third optional argument of type logical.  If this is .true. then the search goes backwards from the end.  A successful search will return a positive integer which is the character position of the first match found, an unsuccessful one return 0.

The index function searches for a substring which may be one or more characters long.  For example to take a filename and replace the old extension (assumed to follow the last dot in the name) with a new one, you might think of doing something like this:
    character(len=:), allocatable :: filename
    integer :: dotpos
    !
    filename = "mydata.in"

    dotpos = index(filename, ".", .true.)
    filename(dotpos:) = ".output"
But there are two things wrong with this.  Firstly if there happens to not to be any dot in the filename then the value of dotpos will be zero so that the last assignment will have an invalid substring reference likely to result in an error exit. If you were to replace it with this:
    filename(dotpos+1:) = "output"
it will not do what is required either, since the filename variable has allocatable length, and the substring reference on the left-hand side means that the result will have the same length as before, so only the first two characters will be replaced.  This might work if filename had been declared to have a suitable fixed length, but still does not seem good practice.  A slightly more robust solution would be this:
    filename = filename(1:dotpos) // 'output"
In this case if there happes to be no dot so dotpos is zero, there will not be an invalid substring but one of zero length, so the new value of filename will just be "output".  In more robust code one would obviously check for the presence of a dot before applying a new extension.

The scan function searches for the first instance of any of the characters given in the second argument. For example if given a file path which may include directory separators, which are backslashes on Windows but forward slashes on Linux and MacOS, one could find the position of the end of the path in a portable way with:
    pathend = scan(filepath, "/\", .true.)

The verify function is the inverse of scan in that it seaches for the first instance of a character which not in the set specified.  For example if reading an input string which is supposed to be a whole number so it only contains digits or a sign, one could check that with a verify which returns 0 only if all characters are valid:
    if(verify(string, "0123456789+-") /= 0) then
        print *, "Invalid integer",  string

String Comparisons

These may be needed if you are searching a set of strings or sorting them.  The simplest way do to this is to use the appropriate relational operator just as when comparing numbers.  This works equally well for fixed length or allocatable strings.  For example:
        if( string1 >= string2 ) then
The rules are:
In the past computers used many different character codes (I have encountered at least six), so Fortran has for many years provided a way of comparing strings according to the ASCII collating sequence even if that is not the native one by using instead of the relational operator one of four intrinsic functions.  These are named LLT (less that), LLE (less than or equals), LGT (greater than), LGE (greater than or equals).  Each function takes two string arguments and returns a logical result.  Now that ASCII is almost universal these are rarely needed.

Case Conversions

Fortran does not have any intrinsics procedures to change a the letters in a string from upper to lower-case or vice-versa.  This is slightly odd.as they exist in most other progamming languages. My guess is that in the good old days Fortran systems only used upper-case, while more recently it was probably thought that a good case-conversion intrinsic ought to cope with other alphabets as well, such as Greek and Cyrillic, so perhaps the whole issue was put off as being just too difficult.  Anyhow, for the special case of the Roman alphabet and US-ASCII, it is not too hard to write one's own funtion to do this.  

    function to_upper(string) result (upcase)
    character(len=*), intent(in) :: string
    character(len=len(string))   :: upcase
    integer, parameter :: offset = iachar("a") - iachar("A")
    integer :: j
    upcase = string
    do j = 1, len(upcase)
       select case(upcase(j:j))
       case("a":"z")
          upcase(j:j) = achar(iachar(upcase(j:j)) - offset)
       end select
    end do
    end function to_upper

As explained elsewhere in this note, a function that uses the passed-length notation for its input argument is more flexible as it will work on both fixed length and allocatable strings, and the length of the returned string is the same as that of its argument, so that can also be fixed in the procedure interface.  I leave as an exercise for the reader the rather similar function to convert to all lower-case.

Intrinsic Subroutines

A number of intrinsic subroutines return information to the calling unit via character arguments, but these present more of a problem as they were designed in the era of fixed-length strings so do not assign a suitable length to an allocatable string if one is used as an argument.

The date_and_time subroutine is a simple case, as the date and time strings always return exactly 8 and 10 characters respectively.  It is perhaps sensible to call them with fixed-length strings but allocatable ones can be used if a suitable length is allocated in advance.
    character(len=:), allocatable :: date, time
    allocate(character(len=10) :: date, time)  ! use same length for convenience
    call date_and_time(date, time)
    print *, 'date=', date, 'time=', time
The date string will have two trailing spaces because it was made longer than the minimum needed.

Some of the others subroutines, including get_command, get_command_argument, and get_environment_variable have an alternative call which allows the length of the string of interest to be determined in advance, and then a suitable length allocated.  Thus to get the entire command-line:
    character(len=:), allocatable :: command
    integer :: cmdlen
    call get_command(length=cmdlen)
    if(cmdlen > 0) then
        allocate(character(cmdlen) :: command)
        call get_command(command)
        print *, "Command line: ", command
    end if
This is more cumbersome than if a fixed-length string had been used, but it entirely avoids the problem of having to estimate how long that would need to be.  Similar methods can be used for get_command_argument and get_environment_variable.

Data Type Conversions

Unlike some other languages, Fortran has no built-in functions to convert numbers to character strings or vice-versa but its internal file read and write system is more flexible as this employs the power of the format specification.  An internal file write can transfer a number, or indeed any I/O list, into a string given as the data destination.  It is possinle to use an allocatable length string but the write does not itself assign a length, so in this case a suitable length has to be chosen and allocated in advance, like this:
    character(len=:), allocatable :: vstring
    real :: pi = 3.14159e2
    allocate(character(len=15) :: vstring)   
    write(vstring, "(es15.3)") pi
Note that as with all formatted output of numbers, if the format width chosen is not wide enough for the data value then the resulting string, instead of holding a number, will be filled with asterisks, a condition which does not generate an I/O error.  List-directed transfers, i.e. with an asterisk in place of the format specification, are allowed but estimating the resulting string length can be harder.

Type conversion in the other direction, i.e. reading a string of characters and interpreting it as a number, can be done with an internal file read, for example:
    character(len=:), allocatable :: vstring
    real :: rvalue
    vstring = "123.456"
    read(vstring, '(f7.0)') rvalue

Note that the presence of an explicit decimal point in the string over-rides the default number of decimals (here 0) in the format specification. Since there are many things that can go wrong when reading characters that are supposed to represent a number, it will often be sensible to include in the read statement iostat and perhaps iomsg keywords to detect errors and provide information on them.

The merge intrinsic provides a handy way of converting a logical value to a human-readable string, for example
    logical :: myvalue
    character(len=:), allocatable :: result
    myvalue = 2 > 3           ! this is false except for very large values of 2
    result = merge("yes", "no ", myvalue)
Note that the lengths of the first two arguments of merge must be the same, which is why the "no" value above has a space appended.

Procedure Interfaces

When writing your own function or subroutine which includes one or more character arguments there are several options. Firstly each dummy argument may have its direction of data-flow specified as intent (in), or (out), or (inout).  Then there are three options when declaring the string-length of each character dummy argument: it can have have allocatable length (len=:), passed length (len=*), or fixed length (len=integer-expression).  If a dummy argument has allocatable length, then the actual argument must have allocatable length too.  But the converse is not true. If a dummy argument has fixed length then the length of the actual argument to be at least as long: if it is shorter there it may generate a run-time error; if it is longer and also returns a value then then the extra characters will not have been altered.

The descriptions here cover both functions and subroutines.  It is assumed here that the procedure has what is called an explicit interface, for example that it is in a module accessed via a use statement or is an internal procedure.  With implicit interfaces, that is when you effectively compile the calling and called program units independently, the options are more limited and the risks of interfacing errors going undetected are much greater.

The only essential difference between a function and a subroutine is that a function always returns a value through its name so is used in an expression not a call statement. It is generally regarded as good practice to write a function only when all arguments have intent (in), but this is not required by the rules of Fortran (except for pure functions).

For intent(in) dummy arguments allocatable length is not very useful as the length cannot be changed within the procedure and the actual argument must always be allocatable and actually allocated, thus disallowing an actual argument which is a constant or expression.  A passed-length dummy is more useful as it means that the length of the actual argument is passed over automatically.  This actual argument may be a constant, expression or a variable of fixed or allocatable length provided it already has a defined value.  Within the procedure the length cannot be changed but it may be different from one call to another.  For example:
    program test1
    call mysub("a constant")
    call mysub("a rather longer string")
    CONTAINS
          subroutine mysub(arg1)
        character(len=*), intent(in) :: arg1
        print *,'length=', len(arg1), ' value=', arg1
        end subroutine mysub
    end program test1


For intent(out) dummy arguments where the length of the result is not predictable when the procedure is invoked as it depends on calcuations within the procedure, declaring the dummy to have allocatable length often makes sense, but then this restricts calls to ones where the corresponding actual argument also has allocatable length:
    program test2
    character(len=:), allocatable :: result
    call mysub(result)

    print *, "result=", result
    CONTAINS
        subroutine mysub(arg1)

        character(len=:), allocatable, intent(out) :: arg1
            arg1 = "result of a calculation"
        end subroutine mysub
    end program test1

The same applies to intent(inout) dummy arguments but in with these a value must be assigned to the actual argument before the procedure is called.

On the other hand if you want to have a dummy argument with intent (out) or (inout) and to allow calls where the actual argument may or may not be allocatable, then the best alternative is to use a passed-length dummy argument.  The length of the actual argument at the time of the call will be that used within the procedure, and this cannot be altered.  If the actual argument is allocatable then it must have a length set before the procedure call, in an assignment or by using an allocate statement.  Within the procedure the usual rules about string truncation or padding with spaces will apply.

Where the function name returns a character value there there is no actual argument requiring a length match so it will usually be best to declare the function name to have allocatable length.  If the value is assigned in an assignment statement nothing more is required, but if it gets a value from an I/O operation it may need to have its length allocated first (or else write first to a temporary fixed-length string).

        function int_to_string(intval) result (funcname)
    integer, intent(in)           :: intval   
    character(len=:), allocatable :: funcname
    ! initially allocate enough length for largest possible integer
    allocate(character(len=12) :: funcname)
    write(funcname, '(i0)') intval
    ! then reallocate
    funcname = trim(funcname)
    end function int_to_string

Alternatively, if the length of the function name or an output dummy argument can be computed easily from the inputs of the procedure such as values of one or more of the input arguments(or from variables accessible a host program unit, then a fixed-length string might work just as well.  An example of this is shown in the to_upper routine in an earlier section.

Input/Output Operations

Write and print statements present no problems when using either allocatable or fixed length strings in their data transfer lists, but the trim function will be useful rather often with fixed-length values.  Formatted writes using the A format descriptor work as you would expect producing a field width which is the same as the length of the data item in the I/O list.  If you use Aw to produce a field of width w then if the length of the item in the data transfer list is less than w the value will be right-justified in the field of width w, i.e. spaces are inserted on the left.  If the width of the data item is greater than w then the data string is truncated on the right after w characters.

Read statements reading into variables which are allocatable strings do not allocate a length.  A suitable length could be allocated in advance but in most cases one might as well use a string of the required fixed length.  Formatted reads using the simple A format descriptor work as you would expect: the number of characters read is the same as the current length of the variable.  If you use Aw to read a exactly w characters then if the corresponding variable has a length more than w then it is padded out on the right with spaces, but if the variable length is less then w only the w right-most characters of the field are read in, the leading ones being lost.

Iomsg: Since I/O operations are particularly subject to errors, some of them beyond the control of the programmer, all Fortran I/O statements also allow an iostat specifier to return an integer error code when an exception or error is detected.  If iostat is present then an iomsg specifier can also be provided to return a short text describing the error.  At present, however, this does not allocate a suitable length if an allocatable length character item is used with iomsg: it needs to have a suitable length pre-allocated or else use a fixed-length item with the iomsg keyword.  

The inquire statement has many optional keywords which return character values associated with a file or I/O unit, for example the name of a file, access codes, etc.  At present none of these allocate a length so one must either provide a a fixed-length character variable for them or an allocatable one where a suitable length has already been assigned.

Character Arrays

Arrays of character strings are needed from time to time, for example to hold an array of filenames or data objects, or just to split a line of text into words or tokens.  One of the long-standing princples of Fortran, however, is that in an array all elements have identical properties, and this includes the string length.  So that if you have an array of allocatable length, as soon as you set the length of any array element, this sets the length of all of them. This means that allocable length arrays are much less flexible than one might have hoped, and in practice rarely worth the trouble.

If you need an array where the length of every element can be altered at run time then the best way is to create a data structure, that is create a derived type which has a component which is an allocatable character variable.  For example:
    type string
       character(len=:), allocatable :: s
    end type string
! create an array of strings where each element is separately allocatable
    type(string) :: month(4)
    integer :: j   
    month(1)%s = 'January'   
    month(2)%s = 'February'
    month(3)%s = 'March'
    month(4)%s = 'April'
    print *, (month(j)%s, ' ', j=1,size(month))
    print *, (len(month(j)%s), j=1, size(month))

You can, of couse, get extra flexibilty by declaring such an array to have an allocatable number of elememts.

A standard module called ISO_VARYING_STRING was defined in the Fortran95 Standard together with a possible implementation, based on the idea of using a derived type contaning an allocatable array component.  This module provided overloaded definitions of many operators and intrinsic functions to allow fixed-length and varying-strings to be used interchangeably in many situations.  But it turned out that the initial implementation leaked memory in some situations.  After Fortran 2003 came out with its improved control over dynamic entities, some new implemetations were produced without the memory leaks, perhaps the best-known was by Rich Townsend.  In Fortran 2018, however, the varying-string module specification has been dropped from the Standard on the grounds that the allocatable character type does the job much more satisfactorily, if only for scalars. 

If you are declaring an array of constants, or even an array of initial values for a fixed-length character array, it may be useful to point out that the latest Standard provides a way of avoiding the tedious need to give all string constants the same length by including your own padding: you simply specify the length a second time within the array constructor:
    character(len=9), parameter :: dayname(7) = [character(len=9) :: &
      'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', &
      'Saturday', 'Sunday' ]

Character Sets and Kinds

As noted earlier, most current Fortran implementations are based on the US-ASCII character set which defines 32 control characters and 96 printing characters.  If you need to include a control character in a string, the simplest way to do this is to use the achar function with the appropriate decimal code.  The table below gives these codes for a few of the more useful control characters:
:
Character Null Bell Horizontal tab Line Feed Form Feed Carriage Return
Decimal code 0 7 9 10 12 13

Note that in Fortran the null character can be stored within a string without problems, unlike in C where nulls are used as string terminators. This example shows how control characters may be used: on many systems this statement will make the user's terminal emit a sound such as a beep.
    print '(a)', achar(7)
Using control characters in this way is somewhat system-dependent, but to produce a new-line character (which may be needed when using stream I/O) a specific intrinsic is provided:  new_line.  This should be entirely portable.

Virtually all current computers store a character in an 8-bit byte, but the ASCII codes 0 to 127 only occupy half of the number range. What the codes from 128 to 255 are used for is entirely system-dependent.  

Unicode

Fortran compilers are permitted, but not required, to support other character sets.  Some of them, including gfortran, have support for ISO 10646, which is commonly called Unicode.  Since this includes glyphs for all the languages that one can think of and a huge range of other symbols, this seems a good solution.  The program below shows the steps necessary, with numbered notes below:

    program test_unicode
    implicit none                                                          ! Notes
    integer, parameter :: ucs4  = selected_char_kind('ISO_10646')          ! (1)
    integer :: out
    character(kind=ucs4, len=:), allocatable :: demo                       ! (2)                  
    open(newunit=out, file='uni.html', status='replace', encoding='UTF-8') ! (3)
    demo = ucs4_"Demo of Unicode with Euro " // achar(8364,ucs4) // &      ! (4)
       ucs4_" and Pounds " // achar(163,ucs4) // &
       ucs4_" and maths symbols like cube root " // achar(8731,ucs4) // &
       ucs4_' and volume integral '// achar(int(z'2230'),ucs4)             ! (5)
    write(out,'(a)') demo
    close(unit=out)
    end program test_unicode
  1. Set a named constant ucs4 for the alternative (4-byte) character kind.
  2. Declare a variable of ucs4 character kind
  3. Open a file using UTF-8 encoding
  4. Create a character string which includes some Unicode symbols.  These are inserted in the string using the char function with a second argument which specifies the kind.  Note that for character constants the kind selector and underscore appears before the character constant, not after it as in constants of all other data types.  
  5. The volume integral symbol shows how to make use of a hexadecimal constant - they are only allowed as arguments of int and a few other intrinsic functions.
A web browser accessing the resulting file shows the output:
Demo of Unicode with Euro € and Pounds £ and maths symbols like cube root ∛ and volume integral ∰

A good listing of Unicode symbols with their decimal and hexadecimal codes can be found at  https://www.w3schools.com/charsets/ref_html_utf8.asp

Possible Future Improvements

It is obvious that the development of variable-length strings in Fortran is a work in progress.  Current proposals for Fortran 201x include extending the automatic assignment of allocatable string length to all intrinsic procedures, and to those I/O statements that return strings.  This will make life easier for many programmers.  Another desirable feature, being able to use an allocatable string in the data-transfer list of a read statement was regarded as very difficult to achieve while retaining compatibility with existing software, so however desirable this is not likely to happen.