The Computer Language Benchmarks Game, previously known as the Great Computer Language Shootout, attempts to compare the performance of roughly 30 languages using several benchmarks. Users can contribute better performing implementations in order to improve the score of a particular language.
https://benchmarksgame-team.pages.debian.net/benchmarksgame/
The test platform uses an Intel i5-3330 quad-core 3.0 GHz processor with 15.8 GB of RAM and a 2 TB SATA hard drive running Ubuntu 24.04 x86_64 GNU/Linux 6.8.0-35-generic.
Below are the updated links comparing Intel Fortran to other key HPC languages on the current platform:
The site’s methodology for measuring elapsed time, CPU time, and memory usage is detailed here:
Note that these figures compare implementations of flawed benchmarks and thus the numbers are subject to programmer skill as well as intrinsic language performance. More popular languages such as C enjoy higher scores in large part because the implementations have been highly tuned and take advantage of multiple threads.
With some effort, Fortran’s scores could be greatly improved. Particular benchmarks to focus on are binary-trees, fasta, and reverse-complement.
https://benchmarksgame-team.pages.debian.net/benchmarksgame/performance/binarytrees.html
I think the Fortran binary-trees implementation can be improved to better compete with the GCC version by using something along the lines of the mempool module in FLIBS. The C version quite successfully uses the memory pool functions of the Apache Portable Runtime Library Jason Blevins 30 Mar 2010 22:53 EDT.
To maximize Fortran performance on Intel Ivy Bridge architectures, implementations must mitigate L3 cache latency. For pointer-intensive benchmarks like binary-trees, standard allocation creates bottlenecks; replacing it with native region-based memory management (using contiguous Fortran arrays) achieves O(1) costs with zero external dependencies. Fortran’s strict aliasing rules and column-major semantics enable aggressive vectorization. Data locality is optimized by compressing pointers to 32-bit indices in a column-major layout, aiding the hardware prefetcher. Furthermore, a monolithic recursive function with manual unrolling (depths 0-4) eliminates call overhead to maximize IPC. Parallel scalability is secured by allocating thread-local memory arenas to eliminate false sharing and using OpenMP‘s schedule(dynamic). These techniques can achieve a reduction of more than 50% in execution time, surpassing optimized solutions in C++. Eduardo Furlan 17 Dec 2025
https://benchmarksgame-team.pages.debian.net/benchmarksgame/performance/fasta.html
https://benchmarksgame-team.pages.debian.net/benchmarksgame/performance/revcomp.html