Roc code is designed to build fast and run fast...but what does "fast" mean here? And how close is Roc's current implementation to realizing that goal?
What "fast" means in embedded systems is different from what it means in games, which in turn is different from what it means on the Web. To better understand Roc's performance capabilities, let's look at the upper bound of how fast optimized Roc programs are capable of running, and the lower bound of what types of languages Roc should generally outperform.
Limiting factors: memory management and async I/O
Roc is a memory-safe language with automatic memory management. Automatic memory management has some unavoidable runtime overhead, and memory safety based on static analysis rules out certain performance optimizations—which is why unsafe Rust can outperform safe Rust. This gives Roc a lower performance ceiling than languages which support memory unsafety and manual memory management, such as C, C++, Zig, and Rust.
Another part of Roc's design is that all I/O operations are done using a lightweight state machine so that they can be asynchronous. This has potential performance benefits compared to synchronous I/O, but it also has some unavoidable overhead.
Generally faster than dynamic or gradual languages
As a general rule, Roc programs should have almost strictly less runtime overhead than equivalent programs written in languages with dynamic types and automatic memory management. This doesn't mean all Roc programs will outperform all programs in these languages, but it does mean Roc should have a higher ceiling on what performance is achievable.
This is because dynamic typing (and gradual typing) requires tracking types at runtime, which has overhead. Roc tracks types only at compile time, and tends to have minimal (often zero) runtime overhead for language constructs compared to the top performers in industry. For example, Roc's generics, records, functions, numbers, and tag unions have no more runtime overhead than they would in their Rust or C++ equivalents.
Domain-specific memory management
Roc's "platforms and applications" design means its automatic memory management can take advantage of domain-specific properties to improve performance.
For example, if you build an application on the
basic-cli platform compared to the
basic-webserver platform, each of those platforms may use a different memory management strategy under the hood that's tailored to their respective use cases. Your application's performance can benefit from this, even though building on either of those platforms feels like using ordinary automatic memory management.
This is because Roc platforms get to determine how memory gets allocated and deallocated in applications built on them. (
basic-webserver are examples of platforms, but anyone can build their own platform.) Here are some examples of how platforms can use this to improve application performance:
- A platform for noninteractive command-line scripts can skip deallocations altogether, since any allocated memory will be cheaply reclaimed by the operating system anyway once the script exits. (This strategy is domain-specific; it would not work well for a long-running, interactive program!)
- A platform for Web servers can put all allocations for each request into a particular region of memory (this is known as "arena allocation" or "bump allocation") and then deallocate the entire region in one cheap operation after the response has been sent. This would essentially drop memory reclamation times to zero. (This strategy relies on Web servers' request/response architecture, and wouldn't make sense in other use cases.
neais a platform in early development that is working towards implementing this.)
- A platform for applications that have very long-lived state could implement meshing compaction to decrease memory fragmentation. (Compaction would probably be a net negative for performance in the previous two examples.)
This talk has more information about platforms and applications, including demos and examples of other benefits they unlock besides performance.
Roc's "platforms and applications" design already works, including the domain-specific memory management. Most of Roc's data structures are already close to their theoretical limit in terms of performance, at least without changing their behavior or introducing memory unsafety. This talk explains how they're implemented under the hood.
That said, the current implementation of defunctionalization (based on this paper)—which unlocks stack-allocated closures, among other optimizations—has significant known gaps, and has a ways to go before it works across the board. (If you're interested in getting involved in that implementation, we'd love to hear from you!)
Current optimizations that are completely implemented (give or take the occasional bug) include LLVM, Morphic, Perceus, and tail recursion optimization (including modulo cons). Promising examples of potential future optimizations include closure-aware inlining, automatic deforestation, and full compile-time evaluation of top-level declarations.
We're also interested in improving the performance of the Morphic alias analysis pass itself; if contributing to that project (or any other optimization project) interests you, please let us know in the
Fast Feedback Loops
One of Roc's goals is to provide fast feedback loops by making builds normally feel "instant" except on truly enormous projects.
It's a concrete goal to have them almost always complete in under 1 second on the median computer being used to write Roc (assuming that system is not bogged down with other programs using up its resources), and ideally under the threshold at which humans typically find latency perceptible (around 100 milliseconds). In the future, hot code loading can make the feedback loop even faster, by letting you see changes without having to restart your program.
Note that although having fast "clean" builds (without the benefit of caching) is a goal, the "normally feels instant" goal refers to builds where caching was involved. After all, the main downside of build latency is that it comes up over and over in a feedback loop; a fast initial "clean" build is valuable too, but it comes up rarely by comparison.
roc check checks your code for errors (such as invalid syntax, naming errors, and type mismatches) and reports problems it finds. On typical development laptops, this usually takes well under 1 second for small projects (for very small projects, it can be around 10 milliseconds on some popular machines). To date, the largest known Roc projects have lines of code numbering in the low thousands, so there's no data yet on
roc check times for larger projects.
roc build does everything
roc check does, but it additionally builds a runnable binary of your program. You may notice that
roc build takes much longer to complete! This is because
of two projects that are underway but not completed yet:
- Development backend refers to generating machine code directly instead of asking LLVM to generate it. LLVM is great at generating optimized machine code, but it takes a long time to generate it—even if you turn off all the optimizations (and
roconly has LLVM perform optimizations when the
--optimizeflag is set). The dev backend is currently implemented for WebAssembly, which you can see in the Web REPL, and in
roc replexcept on Windows. Work is underway to implement it for
roc run, as well as macOS, Windows, and the ARM versions of all of these.
- Surgical linking refers to a fast way of combining the platform and application into one binary. Today, this works on x64 Linux, x64 Windows, and WebAssembly.
roc buildon macOS is noticeably slower because it falls back on non-surgical linking.
Here's a table summarizing the current progress:
|Target||Dev backend||Surgical linking|
|macOS ARM||repl only|
|macOS x64||repl only|
|Linux ARM||repl only|
|Linux x64||repl only||yes|
Once we have full coverage,
roc build (and
roc run and
roc test, which also perform builds) should take only a bit longer than
The next major performance improvement will be caching. Currently,
roc always builds everything from scratch. Most of the time, it could benefit from caching some of the work it had done in a previous build, but today it doesn't do that. There's a design for the caching system, but essentially none of the implementation has started yet. Hot code loading will be the next major improvement after caching, but it requires full dev backend coverage, and does not have a concrete design yet.
In addition to being fast, Roc also aims to be a friendly programming language.
What does friendly mean here?