Cross-language inlining on GPU through LLVM

Allen MacFarland, Jed Brown, Jeremy Thompson

Abstract

I introduce a method of inlining functions from an external programming language by individually manipulating the steps of LLVM compilation. Despite slightly slower compilations, it yields almost identical runtime performance to single-language alternative compilation schemes. Intended for developers who wish to integrate cross-language inlining into their own systems.

1 Introduction

I have been working on a C library called libCEED, which provides an efficient framework for matrix-free discretizations on CPU and GPU. This library requires users to write a small C function to define their mathematical operators; I changed this to allow users to write this function in Rust, and this paper will explain how this compilation scheme works in enough detail for anyone wishing to do the same between any other LLVM-based languages. While this frames why I needed to do this work, it is not the focus of this paper, which should be applicable to anyone interested in GPU compilation. For a reference implementation of the method described in this paper, please see the libCEED source code.

// Two functions before inlining...
fn calculate_position(t: i32) -> i32 {
   return square(t) + t - 5;
}
fn square(x: i32) -> i32 {
   return x * x;
}

// ...turn into one function after inlining
fn calculate_position(t: i32) -> i32 {
   return (t * t) + t - 5;
}

Figure 1: A rust-style pseudocode example of inlining

Inlining is a type of compile-time optimization where two functions can be combined into one (Figure 1). This reduces function calls, which is especially important when compiling to GPU targets, where function calls are incredibly expensive.

Typically, automatic inlining optimizations stay within language barriers, and it was generally accepted that cross-language calls should not be done in very performance-critical sections of code. However, thanks to intermediate representations (IR) of languages, it is now possible to inline across language barriers which have previously effectively blocked cross-language GPU calls.

IR was designed to be an intermediary step between languages and targets, such that not every language needed to write a compiler to every target; Languages could simply compile to LLVM, the largest IR, where they would get access to every target. However, this compiler design comes with a hidden benefit: it is possible to combine languages through LLVM IR and optimize with knowledge of the entire codebase, including multiple languages.

clang and clang++ are LLVM-based C and C++ compilers that are closely tied with development of the LLVM IR. Rust is a modern programming language with an LLVM-based compiler rustc and package manager cargo, where packages are organized as “crates”.

2 The Pipeline

$Flowchart showing the compilation pipeline from Rust and C++ to GPU code. The top path labeled Device function starts with \$
Figure 2: A diagram of the compilation scheme.

Our approach to compilation is to split the compilation into each of the traditional LLVM compilation steps and add the Rust LLVM as though it were part of a regular single-language LTO compile; see Figure 2.

In other words, the Rust and C++ are individually compiled first to LLVM with their respective compilers, then both of these LLVM files are linked with llvm-link, optimized (including inlining) with opt, and finally compiled to gpu code with llc. This produces a .ptx file which can be fed directly into CUDA.

3 Potential Pitfalls and Limitations

When working on manual LLVM compilation systems like these, there are a number of problems that special care must be taken to avoid. Three of them are described here

3.1 Generating Valid LLVM Output

Many programming languages that allow LLVM output only intended their LLVM outputs to be used for debugging, and never considered that it could instead be routed into a compilation pipeline. This may lead to unforeseen problems, depending on the programming language.

For example, in Rust, the well-documented --emit=llvm-ir API is not capable of emitting dependencies, including core or std, at least one of which is required for compilation of almost anything. Instead, developers must use the linker-plugin-lto rustflag and build-std nightly feature to generate a staticlib. Among other things, this contains the LLVM bitcode, so it must be passed to llvm-link with the ignore-non-bitcode flag.

3.2 LLVM Version Mismatches

New versions of LLVM are frequently released, and they are not backwards-compatible. This means that the entire pipeline must run the same version of LLVM, including compilers for other languages. For example, Rust nightly frequently updates their LLVM toolchain and maintains their own slightly modified branch, so the entire libCEED pipeline depends on the Rust-provided LLVM tools.

Those who wish to implement this system with another programming language should take care to ensure that the LLVM versions of all relevant tools match. Version mismatches do not trigger version mismatch errors, and may appear to be an entirely unrelated error. Additionally, it is possible to get “lucky” with LLVM versions and have LLVM generated in one version work in another. This is never recommended because small configuration changes could break code in hard-to-trace ways.

While it would be convenient for LLVM to be a stable platform to target for cross-compilation, this is not one of the goals of their project and is unlikely to change soon. Developers using LLVM in this way should be aware that this is not the intended use of the LLVM tools.

3.3 Distro Support for new LLVM Versions

Many distros ship only an outdated version of LLVM, which can cause frustrations for users with incompatible distros.

For example, Rust is typically installed with a standalone script that gets the latest version, regardless of distro, and our build method relies on the nightly release channel of Rust; at the time of writing, Ubuntu LTS only ships LLVM version 19, even though Rust uses LLVM version 20. Because we require the nightly release channel for the build-std feature (and this is never expected to land in stable), and it’s not reasonable to use an older version of Rust nightly, our solution is effectively limited to bleeding-edge distros.

4 Performance

In libCEED, GPU compilation is done Just-in-Time (JiT), so the cost of compilation is included in the cost of runtime execution. It can be compared to a reference implementation with the proprietary nvrtc compiler and to a single-language variant of the clang compile process

Line graph titled “Performance of LibCEED benchmark by language and build process.” It shows execution time (in seconds, lower is better) versus problem difficulty (in millions of unknowns). Three lines represent different build configurations: NVRTC/C++ (blue squares), Clang/C++ (red diamonds), and Clang/Rust/C++ (yellow triangles). The NVRTC/C++ line consistently achieves the lowest execution times, remaining around 10–30 seconds. The Clang/C++ and Clang/Rust/C++ lines follow similar trends but are consistently slower, staying roughly 5–10 seconds higher. The benchmark was run on an AMD EPYC 7452/NVIDIA A30 system.

Figure 3: A performance benchmark comparing the new compile and execution time of the new compilation scheme relative to 2 possible controls: a single-language compilation scheme with the same compiler, and the proprietary nvrtc

As shown in Figure 3, clang takes longer to compile, but this is only an O(1) cost, so as the problem size increases, the relative gap between all implementations decreases.

5 Conclusion

Combining languages with LLVM is a promising new compilation technique, especially on GPU targets, where inlining is essential.

The process described here for inlining Rust device functions into C++ kernels should be roughly applicable to inlining between any two LLVM-based languages. Further work could be done on implementing such integrations.

Further work could also be done on improving the pain points described in section 3 on the LLVM side. Improving LLVM error messages or committing to a more stable IR could significantly simplify development of many integrations.

6 Acknowledgments

This work was funded by the United States Department of Energy

This work was completed by an undergraduate researcher funded by the SPUR program of the University of Colorado Boulder, funded by the Engineering Excellence Fund

References

[03]: LLVM. https://llvm.org/. 2003. url: https://llvm.org/.
[LA04]: Chris Lattner and Vikram Adve. “LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation”. In: San Jose, CA, USA, Mar. 2004, pp. 75–88.
[07]: Clang. https://clang.llvm.org/. 2007. url: https://clang.llvm.org/.
[15]: The Rust Programming Language. https://www.rust-lang.org/. 2015. url: https://www.rust-lang.org/.
[Bro+21]: Jed Brown et al. “libCEED: Fast algebra for high-order element-based discretizations”. In: Journal of Open Source Software 6.63 (2021), p. 2945. doi: 10.21105/joss.02945.
[21]: libCEED development site. https://github.com/ceed/libceed. 2021. url: https://github.com/ceed/libceed.

Allen MacFarland's Website