23
$\begingroup$

Using something like LLVM when designing a compiler has a lot of advantages, since tons of work can be saved by using an existing optimizer and code generation for a variety of platforms. Even large, popular languages like Rust and Julia use LLVM, and LLVM implementations of languages like Go and Kotlin exist.

However, I'd assume some downsides exist to using an existing back-end. Compile time is one I've seen mentioned in the context of Rust, and I'd assume you lose some flexibility in how you compile and optimize your code when using a pre-existing compiler infrastructure.

What are some pitfalls/downsides of using an existing IR/optimizer/compiler infrastructure for a compiled language (as opposed to writing your own, rather than transpilation/bytecode)?

$\endgroup$
0

8 Answers 8

30
$\begingroup$

Very large and very complicated

The LLVM project has over 2 million lines, or 1.3 GB, of code. It's one of the largest projects on Github. It's so large that (at least at one point) you couldn't even push it to GitHub without workarounds.

With this, LLVM has a lot of unnecessary complexity. There are a lot of features your compiler isn't going to use unless it's C++, and a hobbyist or prototype compiler is definitely not going to use.

LLVM was created and is maintained by lots of talented developers, but nonetheless, a repo that large and complex is going to have issues. The following don't just apply to LLVM, but any large and complicated existing IR:

  • Steep learning curve: Because LLVM is so complex, it takes a lot of effort to understand how to use. See the reference for llvm::Value: there are so many methods and subclasses it's hard to figure out which ones you want and which ones to ignore. There are also many subtle details which can trip up beginners (though off the top of my head I can't think of any examples, I remember this was the case).

  • Lack of freedom: If you use LLVM as an IR you're practically required to adhere to LLVM conventions: e.g. Module/Function/BB/Instruction, using ...Builders like IRBuilder to construct values, using llvm Types and Values and Metadata. If you use your own IR, you can implement techniques and optimizations which are hard or straight-up impossible in LLVM.

  • Framework problem (AKA more lack of freedom): It's not just in the IR. LLVM provides many generic types in the ADT module, including its own string and bit-vector; though you don't have to, it can be hard avoiding StringRef and BitVector in your own, non-IR code. More likely, you'll unintentionally write your AST and other earlier phases to be easier to translate into LLVM, which may make your language less unique more like C++. Like with most big frameworks, you could accidentally make your entire project revolve around using LLVM, especially if it's a small project and the majority of its code ends up being IR generation and LLVM integration.

  • Large dependency: LLVM is 1.3 GB in source, but its install build is about 12 GB. This can probably be stripped and minified for a language which only relies on IR and assembly generation, but still, LLVM is going to take up a lot of space in your app's distribution.

  • Bugs: In over 2 million lines of code there are definitely a lot of bugs. The GitHub repo has over 20,000 open issues.

  • Hard to debug: Because LLVM is so large, when there's an internal bug or even when there's not, it's hard to figure out where the bug originated and how to fix it, because you don't understand LLVM's inner workings.

As mentioned, these problems don't just apply to LLVM, but any other large general-purpose IR. If you write your own IR, you have control: you can make a small IR which only supports your language's features, and is written in your compiler's code style and design philosophy. This IR will be easier to understand, easier to integrate, easier to extend with features which would be challenging or impossible with LLVM, and easier to debug.


Of course this skips over the many benefits of LLVM and other popular IRs, and challenges using a custom IR, like:

  • Writing your own IR is hard

  • LLVM has a very good design philosophy and encourages good practices which may be better than the ones you would use

  • LLVM has plenty of general-purpose features and optimizations which would be hard to implement yourself

  • LLVM has a team of experts fixing bugs and constantly making improvements, you won't have to maintain the IR yourself if you use LLVM

  • and so on

But there are good reasons why you may want to write your own IR. And even if you do end up using LLVM, they are good issues to prepare for and avoid.

$\endgroup$
9
  • 1
    $\begingroup$ OK, it's very large and very complicated: but what are the alternatives? $\endgroup$ Commented Jul 2, 2023 at 7:05
  • 3
    $\begingroup$ Speaking from real experience: develop your own IR or multiple IRs, and lower that to LLVM. This is what Rust (rustc-dev-guide.rust-lang.org/overview.html#mir-lowering), Zig (github.com/ziglang/zig/blob/master/src/Air.zig), and I believe most other languages do. $\endgroup$ Commented Jul 2, 2023 at 19:51
  • 1
    $\begingroup$ I just listed the downsides. In practice LLVM is usually much easier and better than generating your own assembly, hence why many production languages use it. You often just want to make sure not to be too dependent on LLVM, and simplify the IR generation so there's less room for bugs and it's easier to debug. $\endgroup$ Commented Jul 2, 2023 at 19:52
  • 2
    $\begingroup$ Thanks for that. I find myself wondering whether there's scope for an "LLVM-lite" which uses basically the same code generation stuff but has a simplified API with a less brutal learning curve. $\endgroup$ Commented Jul 2, 2023 at 20:05
  • 3
    $\begingroup$ @MarkMorganLloyd obvious alternatives to LLVM: libFIRM, Cranelift, asmJIT, libJIT. Also MLIR provides a nice middle-ground between building your own IR completely from scratch and using predefined one, but it's based on LLVM anyway. $\endgroup$ Commented Jul 3, 2023 at 9:54
9
$\begingroup$

Using existing compiler infrastructure comes with a certain degree of either buying into the assumptions it makes or working around them. It may not be obvious at first what those assumptions are unless you're an expert in the relevant infrastructure already, so it can be difficult to defend against.

LLVM is maybe not too bad in that regard (and improving), but there have still been incidents such as for example optimizing out some infinite loops (making certain assumptions about forward progress in general) or infinite recursions as a default behaviour, requiring workarounds if your language does not allow those transformations (this may have changed by now).

Memory model, pointer provenance model (even the decision whether the concept of "provenance" is used at all), and floating point model, are also candidates for "sneaking into your language" via this backdoor rather than via a conscious decision. Into a compiler, really, but you may feel pressured to then retroactively bless such semantics to avoid having to work around them.

Transpilation is even more dangerous in this regard.

$\endgroup$
3
  • $\begingroup$ Is LLVM able to express the notion that code as written does not rely upon an infinite loop blocking for correctness, without inviting an optimizer to assume that all possible responses to inputs that would cause an infinite loop should be viewed as equally acceptable? $\endgroup$ Commented Aug 3, 2023 at 22:23
  • $\begingroup$ @supercat not that I know of, AFAIK the only options are (1) not explicitly specifying anything and getting the "anything goes" semantics or (2) putting a dummy side effect (@llvm.sideeffect) in the loop, which is a bit too strong. $\endgroup$ Commented Aug 3, 2023 at 23:00
  • $\begingroup$ I'd say that's a good reason argument against using LLVM, since there are many situations where it would be useful to skip loops that compute values that end up being ignored, but only if such loops can be guaranteed not to generate extraneous side effects. Otherwise, the "optimization" strikes be as silly because all correct programs would contain dummy side effects, meaning the optimization would have no effect except in erroneous programs. Do you know of anything definitive that would say LLVM can't express such semantics? If so, I think that might be worth including in my answer. $\endgroup$ Commented Aug 4, 2023 at 16:23
8
$\begingroup$

There's not always a whole lot you can do if the backend has bugs. Consider this recent issue for Zig, which discusses using their own backend instead of LLVM. One of the advantages listed is, and I quote:

  • All our bugs are belong to us.

Further down the thread, a different user also brings up "the existence of many bugs that are LLVM's fault".

$\endgroup$
4
  • $\begingroup$ They also mention that LLVM is big, compile times are long and it doesn't support as many targets as you might hope. $\endgroup$ Commented Jul 1, 2023 at 6:55
  • $\begingroup$ Not sure it's really such an issue. LLVM is open source and got a very permissive license. If you find bugs there - just fix bugs, easy. Bring your own LLVM build with your project, it's always a bad idea to rely on some pre-packaged version from elsewhere. Getting your fixes back into upstream can be a somewhat more complicated story though, but, again, just maintaining your own fork is also not such a big deal. $\endgroup$ Commented Jul 1, 2023 at 8:26
  • 6
    $\begingroup$ @SK-logic OTOH, fixing the bugs requires you to understand how LLVM works under the hood, which is not always going to be the case (especially if you chose to use LLVM so that you didn’t need to worry about that stuff). $\endgroup$ Commented Jul 1, 2023 at 12:15
  • $\begingroup$ @AustinHemmelgarn true, but I guess writing half of LLVM on your own and knowing that code base well enough to debug still requires a bit more effort than simply learning LLVM. $\endgroup$ Commented Jul 1, 2023 at 12:44
7
$\begingroup$

Speed (and other resource usage) probably is the biggest factor in many cases. LLVM, specifically, is slow.

Unfortunately, "help speed up the slow bits" is not a realistic answer, because LLVM's whole infrastructure is slow for an extremely good reason: It was designed primarily as a research and development platform.

You see, LLVM is not a compiler backend. It is a toolkit of pieces that you can build your own compiler backend from. It is designed for flexibility and generality, so that new compilation techniques and new platforms are relatively easy to add.

This flexibility and generality has a cost, and given that compilation speed (or lack thereof) is one of the most important things to a programmer, reducing that cost can pay off in terms of user experience.

$\endgroup$
14
  • 1
    $\begingroup$ I am really baffled by all the claims of compilation speed being important. People keep saying it, but I did not see any justification or even data to back this claim.. Do you have any links? $\endgroup$ Commented Jul 1, 2023 at 6:53
  • $\begingroup$ Are you asking if people don't think LLVM is slow, or are you asking if people aren't using that as a reason to sidestep around LLVM? It's true that "slow" is relative, but it's also true that computers are several orders of magnitude than they were a couple of decades ago, but compilers even in "debug/no optimisation" mode aren't several orders of magnitude faster. $\endgroup$ Commented Jul 1, 2023 at 8:34
  • 1
    $\begingroup$ @SK-logic data point of one, but it matters so much to me that it's the second-biggest reason I don't use Rust. (First is complexity, and third is async.) I also don't think you need to have evidence on compilation times in particular; you could look up general research on feedback loop sizes, and that's clear: the smaller the feedback loop, the easier it is to make changes. Data point of one, but my quick compilation loop is one reason I refactor so much and don't have technical debt. $\endgroup$ Commented Jul 1, 2023 at 14:08
  • 1
    $\begingroup$ @GavinD.Howard I see, fair enough. Probably my approach is just way too different as I don't see much value in a very fast feedback loop - but I imagine there might be people who need it for productivity. $\endgroup$ Commented Jul 1, 2023 at 14:32
  • 1
    $\begingroup$ @SK-logic:: It depends what one is doing. If one is writing code to produce graphics, sounds, or other kinds of user experience, there's often no single "right" or "wrong" ways of doing things, but rather a continuum of better and worse ways whose quality can best be judged through experimentation. If one were stuck with overnight compilers, then maybe one could use a pen, graph paper, and stop motion camera to figure out what a program's animation woudl look like without having to wait all night for the program to build, but having a faster compiler would seem more practical. $\endgroup$ Commented Jul 2, 2023 at 17:52
7
$\begingroup$

It's hard to speak about all the IR frameworks/backends out there at once, but pretty much all of them are inflexible in some way and thus suitable only to certain types of languages and not others.

In particular, LLVM is kinda famous for being C/C++ centric, even Rust encountered a number of bugs/shortcomings/problems/inefficiencies that weren't an issue for C/C++ and thus were just hanging there for years unnoticed. All while Rust being low-level language pretty close to C/C++. The further you go the more trouble you encounter: it's well-known that adding a GC to LLVM is still hard and cumbersome. Implementing a lazy language is even harder and inefficient: LLVM backend for GHC Haskell do exist but it's slower than GCC through C--.

From the opposite side I've heard it's hard to add somewhat unconventional code generators to LLVM, like VLIW architectures. Again I've heard people more often use libFIRM for such projects (specialized microcontrollers, DSPs, etc.).

I guess you can find this or that shortcomings in any backend/framework. On the other hand, if you're implementing pretty conventional language, target common hardware and/or not aiming for ultimate performance and efficiency, any reasonable framework (LLVM, Cranelift, asmJIT, etc.) might serve you just fine and save enormous efforts.

$\endgroup$
4
  • $\begingroup$ A lot of the behaviors that were objectionable for Rust weren't appropriate in C compilers intended for low-level programming either, and some are fundamentally broken even in C. $\endgroup$ Commented Aug 3, 2023 at 22:24
  • $\begingroup$ @supercat that's true, but that's a bit different story. We're talking about bugs here, and nobody's surprised there are bugs in LLVM/Clang affecting C/C++. The fact that there are bugs that don't affect C/C++ but affect other languages is kinda more surprising. But quite expected in hindsight: you can't test LLVM for a language that doesn't yet exist. That means if you're developing a new language, you're likely to uncover new bugs like Rust did. That's the point. And the bigger the framework, the more hidden bugs it's expected to have. $\endgroup$ Commented Aug 4, 2023 at 10:10
  • $\begingroup$ The issue isn't just bugs, but deliberate assumptions that were engineered in based on either a faulty belief that they're sound under the rules of C, or a belief that any situations where the C Standard would render them unsound represent defects in the Standard, ignoring the possibility that the Standard may have been intended not to allow such assumptions in the first place. $\endgroup$ Commented Aug 4, 2023 at 16:25
  • $\begingroup$ For example, clang will assume that if two pointers P and Q compare equal, they may be used interchangeably, at the same time as it exploits assumptions that objects based on different objects won't alias, ignoring the possibility that a comparison between pointers based on different objects may report them equal. This assumption might be legitimate if clang ensured that there was always at least one padding byte between objects identified by external symbols, and specified that any other tools generating code to be linked with clang must do likewise, but clang doesn't do that. $\endgroup$ Commented Aug 4, 2023 at 16:31
5
$\begingroup$

Well, for my AEC-to-WebAssembly compiler, I decided not to use LLVM because I think WebAssembly is an even easier compilation target than LLVM is. To target LLVM, you need to understand what PHI is and what SSA is, whereas you don't need to understand those things if you are targetting WebAssembly.
Had I targeted x86 and/or ARM, I would have used LLVM, since x86 and ARM are not designed to be easy compilation targets (x86 and ARM are designed to be easy to implement in hardware, without a lot of consideration for the ease of making a compiler targeting them), but WebAssembly is designed to be an easy compilation target.

$\endgroup$
5
$\begingroup$

Languages which use a back end like LLVM will be limited to the choice of optimization semantics offered thereby. Compilers based upon LLVM, for example, seems to analyze the possibility of pointer aliasing by assuming that if some pointer X can be shown to equal some pointer Y, and Y is not based upon Z, then X isn't based upon Z. Consider the following:

int x[1],y[1];
int *volatile vp;
int test1(int *p)
{
    y[0] = 1;
    if (p==x+1)
        *p = 2;
    return y[0];
}
int test2(int *p)
{
    x[0] = 1;
    if (p==y+1)
        *p = 2;
    return x[0];
}
#include <stdio.h>
int main(void)
{
    int (*volatile vtest2)(int*) = test2;
    int rx = vtest2(x);
    printf("%d/%d\n", rx, x[0]);
}

There are two possible equally correct behaviors for this program: output 2/2 if x happens to immediately follow y or output 1/1 if it does not. Allthough test1 is never called, it seems to influence the placement of x immediately before y.

Both the above code, and an equivalent programmed in rust, would find that the passed pointer is equal to y+1 and write a value to it, without recognizing that the pointer could have been formed by taking the value of x. Perhaps the problem isn't with LLVM but a coincidental bug in both clang and the rust compiler, but it would seem more likely that the LLVM back-end semantics are involved.

$\endgroup$
18
  • $\begingroup$ A lot of projects use LLVM as middle IR and provide their own back-ends instead. So you're not really that limited, you can cherry-pick what you want from LLVM and do the rest on your own. $\endgroup$ Commented Jul 1, 2023 at 8:27
  • $\begingroup$ @SK-logic: I thought the whole idea of using LLVM was to avoid the need for a machine-dependent back-end. $\endgroup$ Commented Jul 2, 2023 at 15:05
  • 1
    $\begingroup$ This answer is very difficult to get anything out of when it simultaneously makes a claim about LLVM's technical capabilities, and an unrelated value judgement about how someone might choose to define their language semantics. $\endgroup$ Commented Jan 25 at 22:14
  • 1
    $\begingroup$ The mechanism is this unfixed bug. You can see on compiler explorer that LLVM's IPSCCPPass is using an incorrect result from canReplacePointersInUseIfEqual to rewrite if (p == x + 1) *p = 2; to if (p == x + 1) x[1] = 2;, which tricks the subsequent EarlyCSEPass. This is not exploiting any loophole or paying any attention to the uncalled function: it is simply a bug. The maintainers have not fixed it because it has deep implications and is not breaking too much in practice. $\endgroup$ Commented Jan 27 at 19:14
  • 1
    $\begingroup$ The statement "And it won't be fixed because ATM, you can show that propagating any equalities at all (pointer or not) is enough to break variants of this testcase" suggests that anyone designing a language to use LLVM as a back-end is either going to have to impose constraints on programmers that would legitimize LLVM's behavior, disable many kinds of optimizations wholesale, or accept that LLVM is not going to process the language reliably. $\endgroup$ Commented Jan 27 at 19:34
3
$\begingroup$

One downside of using LLVM specifically is that it was not designed as a portable backend for multiple languages. As long as you're compiling something with more-or-less C semantics, you're fine, but try Pascal or Algol and similar languages and you'll find most of your development time goes into working around shortcomings such as the lack of nested procedure declarations. There are a few intermediate codes that target multiple languages but they're few and far between or very old, since Pascal or Algol style languages have been out of favour for some years now, and a lot of people brought up on C style languages don't even understand why you would want nested procedures or just how useful they actually are.

$\endgroup$
1

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.