While working on my Rust to .NET compiler, I have encountered countless bugs.

Some trivial, some a bit difficult, and some that made me want to become a wild man, living in the mountains.

Today, I would like to tell you about one such a bug.

About how one simple mistake - just one wrong line of code - can be enough to cause a multi-week hunt.

// Checks if a type is dynamically sized
!pointed_type.is_sized(tyctx, ParamEnv::reveal_all())

So, sit down as I tell you about the most bizarre bug I have ever encountered.

My formatter is corrupt

The symptom of this particular miscompilation is glaringly obvious: any Rust program will crash when attempting to format a string.

Even something as simple as this:

let msg = format!("Hi {name}!");

Would throw this exception:

Unreachable reached at STD_PATH/std/fmt/rt.rs.

You probably can see that there is some kind of issue there, but I bet you don't yet know what it is, exactly.

What is "Unreachable"

When compiling Rust code, the compiler will make some pretty straightforward assumptions about the values of certain types. For example, it will assume that a bool has a value of either 0 or 1, and not something crazy like 134.

Similarly, for enums, it assumes their discriminant (or tag) has a value corresponding to a variant of an enum.

So this enum:

enum Pet{
   Dog(Breed,Color),
   Fish(Species),
}

Which is represented by the compiler roughly like this:

union VaraintData{
  dog:(Breed,Color),
  fish:(Species),
}
struct PetTaged{
   tag:usize, // has value of 0(Dog) or 1(Fish)
   data:VariantData,
}

should have a tag of either 0 or 1. Any other value would be UB, so the compiler is free to assume such a value can't exist.

When you write a match statement like this:

match animal{
  Dog(b,c)=>bark(),
  Fish(s)=>glup(),
}

To optimize and properly compile this piece of code, the frontend of the compiler will turn it from Rust into a simplified form called MIR.

bb0: {
    _2 = discriminant(_1);
    switchInt(move _2) -> [0: bb3, 1: bb2, otherwise: bb1];
}
bb1: {
    unreachable;
}
bb2: {
    _0 = gulp() -> [return: bb4, unwind continue];
}
bb3: {
    _0 = bark() -> [return: bb4, unwind continue];
}
bb4: {
    return;
}

As you can see, the compiler frontend will tell the backend what to do if the enum animal has a tag 0 (is a dog), tag 1 (is a Fish), or has some other tag.

Normally, this "other" case may be used to match multiple variants:

_=>todo!("Unsupported animal {animal:?}"),

However, since the compiler frontend knows there may be no other variant, it will tell the backend (the part tuning MIR into the final executable) that it can assume the tag is either 0 or 1.

This information is encoded using the Unreachable Block Terminator. If a block ends with unreachable, the compiler may assume the block itself is unreachable, so it can safely remove this block.

My backend, rustc_codegen_clr is still far from being mature. So, instead of removing unreachable blocks, I replace them with a throw. So, if my compiler gets bugged out, and an unreachable is reached, it will stop and tell me something went very wrong.

Reaching unreachable

OK, so we now know that this particular issue is caused by an "impossible" value. That is helpful, but we are still far from deducting the exact cause of this problem. Knowing a bit more context would be helpful.

There is one small problem: when I tried fixing this issue, my stack traces still didn't contain source file information. This information was emitted, but it could not be used by the .NET Runtime. Why?

Cooking debug info with ILASM

There are 2 "flavors" of ILASM. The Mono one is a bit less feature-rich, but its error messages are a bit nicer.

The one bundled with CoreCLR is (or at least should be) slightly faster, and more modern. In theory, there should be no difference between the two.

Well, life is hard and theory does not equal practice. Mono ILASM supports only the standard-specified way of declaring source file info:

.line 64:5 'add.rs'

This is not shocking. It is an older tool, so I should not expect it to implement extensions to the standard. Mono debug info format(.mdb) also only works with the Mono runtime. So, in order to support debug info in the new .NET runtime, I need to use a different version of ILASM.

The CoreCLR version of ILASM is used to support the standard-specified way of providing source-file info. The key word here being "used to".

There exists an extension to the standard-specified .line directive. In CoreCLR, you can specify both the lines and columns as ranges. You can write something like this:

.line 64,65:5,6 'add.rs'

This extension, however, needed to keep backward compatibility with the standard. So, a directive like this:

.line 64:5 'add.rs'

was treated as

.line 64,63:5,5 'add.rs'

Seems fine, right? A nice, backward-compatible way of providing richer info. Well, there is just one small issue: this does not work with PDBs.

You see, PDBs mandate that, if the start line is equal to the end line (source file info contains less than one line), then the start column index must be smaller than the end column index.

All this complex techy techy-sounding stuff basically boils down to "source file info must contain at least one character". The problem is that since ilasm treats .line line: column as meaning .line line, line: column, column it creates .line directives which contain 0 characters(column start = column = column_end). So, the sequence point specified is not valid, and ILASM will refuse to assemble such a file.

This got me in a bit of a pickle: the more common, standard-compliant way of doing things was supported in Mono, but not in CoreCLR. The version of ILASM bundled with modern .NET, on the other hand, did not support what the standard mandated.

In the end, I found a very stupid, inefficient, but working solution. Before creating a .il file, I call ilasm, to get its “help” message. It contains the list of command line options, supported by that version of ilasm. If that message contains the phrase “PDB”, it means that I am using a “modern” flavor of ilasm, and need to emit the extended debug info.

This solution is not pretty and has a lot of unneeded overhead (since ILASM takes a long time to even start up), but hey - it works. \

None, Some and `15167372159`e

Alright, let's get back on track. After I got full debug info to finally work, I traced the issue to a piece of code, deep within the formatting machinery. It looked roughly like this:

  match fmt.width() {
    Some(width) => do_something(width),
    None => do_something_else(),
}

The field width has the type Option<usize>. Since this match statement fails, it seems like width has an incorrect value. In order to check the value of with, I created a mock implementation of the std::fmt::debug, which cast the Option<usize> to a pair of 2 usizes (the tag and the value). It then printed the value of those fields.

let raw_width: (usize, usize) = unsafe { std::mem::transmute(fmt.width()) };
unsafe{printf("raw_width is %p,tag is %p.\n\0".as_ptr() as *const i8,raw_width.0,raw_width.1)};

The result seemed to suggest the Formatter struct got corrupted: both the value and tag fields of an enum had nonsensical values, like 15167372159.

After a bunch more tests, I confirmed a few things. First of all, the Formatter was not corrupted before it was passed to write_fmt, meaning my issue must be somewhere in the formatting machinery. This is both a good thing and a bad thing. On one hand, I was closer to solving my problem. On the other hand, I would have to delve into the arcane depths of Rust string formatting. A place full of compiler magic and bizarre micro-optimizations, the meaning of which has been lost to time.

Procrastinating, with style!

I really don’t like debugging issues within the depths of std - for a lot of reasons. The first of them is its sheer size. I still have not fully implemented dead code elimination, a nd my solution for initiating static data is less than optimal. Because of that, the compiled std is around 5 MB of .NET CIL. Most of this is just CIL instructions.

I also can't use the standard suite of .NET debuggers, for a couple of reasons. For eaxmple: most of them are designed to deal with managed code. So, they lack tools for detecting things like unmanged heap corruption. A lot of them are also Windows-specific, and I am just not as experienced with using Windows tools as I am with Linux ones. This may be a case of a "skill issue", but in order to use those tolls effectively, I need to know them well.

While this is not too crazy, it is enough to make ILSpy choke and lag a bit. To be honest, it works surprisingly well(I expected it to just crash), taking the sheer volume of methods I emit into consideration.

Still, debugging std within .NET is not pleasant. Because of that, I decided to try some lower-effort alternatives first.

Checking checked mul

Since the problematic variable had a type of Option<usize>, I thought the issue could have something to do with checked multiplication. Implementing fast, Rust-compatible overflow detection in .NET is not easy (especially when dealing with multiplication), so I could have made a stupid mistake there.

Sadly(?) it turned out, that the issue was not there.

Automatic validation

So, the most obvious cause has been ruled out, and the problem is a bit more complicated. Now, I could try doing one of 2 things. I could attempt separating out fmt from std, and see if I can replicate the issue within my copy of fmt. Alternatively, I could try improving the sanity checks I emit, in order to catch the issue earlier.

I went with the second option since it seemed easier, and the work I did could be reused, to catch future issues.

Inserting runtime checks, while not trivial, was surprisingly far easier than I expected. I created a validate function, which took in a CILNode (a node from my AST), and inserted code checking its value in-place. Since CILNode returned by this function is evaluated to the same value as the original node, I could simply use this function when handling an assignment, and it would add the checks, without the need to change any other code.

My checks were, and still are, rather primitive. For each struct, I check if its fields are valid. For references, I check that they point to valid, readable memory. For enums, I simply check if their tag is OK. Nothing fancy, just some basic sanity checks.

The advantage of this approach is rather obvious: it is lightweight, does not change the final program too much, and can catch many errors. It also did not require too much effort to implement.

This approach, of course, is far from perfect. For example, it can’t check if a reference points to memory it should not point to. So, things like heap corruption are still not caught.

Still, thanks to this check, I had a bunch more info. The Formatter was in a good, non-corrupt state before it was passed to my Debug::fmt implementation. I also noticed other fields of Formatter, such as the alignment, had incorrect values. Weird.

This issue has become far more puzzling now. How on earth could the formatter be OK just before a function call, and get completely borked immediately after? Was I somehow overwriting stack memory? But how? The formatter was located way above the stack frame of the problematic function - so, if a function call could corrupt it, it would have happened way earlier.

Open-heart surgery

Despite my best efforts, I was unable to avoid the daunting task of debugging a part of std. Since my debugging tools are not up to the task, I had to do it in a bit of a weird way.

When debugging a part of std, I like to separate the module in question. Modifying fmt alone is far easier than trying to change the whole std. By operating on a vertical slice of std, I can minimize the impact of changes I make, and ensure I do not introduce any unrelated issues.

In general, working on std feels kind of like open-heart surgery. One wrong step and the issue decides to disappear.

Separating fmt was not easy, but, since it was my only hope, I clenched my teeth and, after a bit of work, I replicated the exact issue in my own version of fmt. After a bit of probing, I discovered that the addressof Formatter in the caller and the callee did not match. The underlying data was not corrupted - I somehow passed a wrong pointer.

At this point, I feel like I hit a brick wall. When I got the address of the formatter in the internals of the formatting machinery, everything seemed fine - I got the exact address I expected. However, as soon as a call happened - the address I received was different. It was still a valid stack address, that pointed to some piece of data, but I could not wrap my head around why.

The caller and callee were both pieces of functioning CIL. The function signatures did not match exactly, but that was expected. std sometimes does weird stuff, like transmuting a function pointer.

// SAFETY: `mem::transmute(x)` is safe because
//     1. `&'b T` keeps the lifetime it originated with `'b`
//              (so as to not have an unbounded lifetime)
//     2. `&'b T` and `&'b Opaque` have the same memory layout
//              (when `T` is `Sized`, as it is here)
// `mem::transmute(f)` is safe since `fn(&T, &mut Formatter<'_>) -> Result`
// and `fn(&Opaque, &mut Formatter<'_>) -> Result` have the same ABI
// (as long as `T` is `Sized`)
unsafe {
    Argument {
        ty: ArgumentType::Placeholder {
            formatter: mem::transmute(f),
            value: mem::transmute(x),
        },
    }
}

This lured me into a false sense of security. The arguments provided matched the specified signature, and types seemed to roughly match up. My additional, built-in type/signature checks did not report any issues, so everything seemed to be fine.

I have ruled out problem after problem and was no closer to finding a solution.

Strawberries

At this point in time, I had worked on this issue for about 2 weeks. Sure, I did work on some other stuff, like multithreading, but was mostly focused on this problem.

After a couple of hours of work, I decided I needed a change of pace. I went gardening with my sister. When picking up some strawberries, we talked a bit about what I was working on.

I told her that I was working on a particularly nasty issue.

I bet it is something ridiculously simple - and I just can’t see it. In a week, or two, when I finally fix this stupid bug, it will turn out to be so obvious, that I will wonder how I could have been so blind.

At this point in the story, you probably expect an epiphany. I see something, and the issue becomes immediately obvious. Nope. I went back home, maybe a little bit less stressed, but the problem remained. I poked and prodded the issue for a couple more hours, but - I still could not solve it.

PtrRepr and PtrComponents

The very next day, I noticed that a recent build of my project had failed. This is relatively common - it happens almost every time the nightly version of rustc gets updated. This change was a bit bigger than I expected. Besides the usual things, like changes to data layout, and some things getting renamed, this update removed two things - PtrComponents and PtrRepr.

Previously, I used PtrComponents to represent fat pointers. You may ask: What are fat pointers, and what is their purpose? A fat pointer is nothing more than a pointer with some metadata. Slices store their length as such metadata. dyn Trait objects store their VTable there. Overall, fat pointers are a pretty important part of Rust.

My reaction to this change was a bit mixed, at first. On one hand - I never intended to use PtrComponents. The only reason I used this particular type was a technical limitation. The compiler fronted assumed that assigning from &[T] to PtrComponents<T> was possible, and did not require any conversion. .NET does not allow you to assign a value of a different type without any casts. So, I was kind of forced to use PtrComponents, even though I did not want to.

On the other hand, changing my codegen to not use PtrComponents would require some effort. It was not too hard, but still - I had to spend some precious time updating, and I was not pleased about this.

After a bit of work, I finished moving away from PtrComponents. This change allowed me to delete a bit of hacky code and made debugging easier, since I could use my own name-mangling scheme.

Now instead of a slice being encoded as something like:

core.ptr.metadata.PtrComponents.hf37c9675d456df2f

It would be represented as:

FatPtru8

This is especially important in stack traces since only the last part of a type name is displayed there.

RustModule._ZN9build_std14test_fomratter17h971767c9409dc567E(hf37c966f05c9ef6f* self, he0133fba3c66f1d1* f)

Now, slices in the backtrace should look more clean, and I should be able to understand them by just looking at them.

RustModule._ZN9build_std14test_fomratter17h971767c9409dc567E(FatPtrg, hbe47ff09aa56fc6f)

Wait a second. FatPtrg? That can’t be right!

Foreign types

Well, what does FatPtrg mean? And what does it have to do with this issue? What could such a strange type have to do with this bizarre bug?

When designing my simple name mangler, I sacrificed speed and efficiency for clarity. The “g” after “FatPtr” is simply a one-letter type designation. In this case, it refers to a… foreign type?

What is a foreign type?

Foreign types in Rust are a bit of a rarity. Come to think of it, I think they are still behind a feature gate ([extern_types](https://github.com/rust-lang/rust/issues/43467)).

They are very handy since they allow you to define an external type for FFI purposes, and ensure you don’t do stupid things with them.

So you can do something like this:

extern {
    type FFIType;
}

You will be able to take references to this type, but the compiler will prevent you from dereferencing it. This is nice for FFI purposes since it allows you to safely store references to types of unknown size or layout.

Alright, that is nice, but what does this kind of type do in the formatting machinery?

Recall this piece of code:

// SAFETY: `mem::transmute(x)` is safe because
//     1. `&'b T` keeps the lifetime it originated with `'b`
//              (so as to not have an unbounded lifetime)
//     2. `&'b T` and `&'b Opaque` have the same memory layout
//              (when `T` is `Sized`, as it is here)
// `mem::transmute(f)` is safe since `fn(&T, &mut Formatter<'_>) -> Result`
// and `fn(&Opaque, &mut Formatter<'_>) -> Result` have the same ABI
// (as long as `T` is `Sized`)
unsafe {
    Argument {
        ty: ArgumentType::Placeholder {
            formatter: mem::transmute(f),
            value: mem::transmute(x),
        },
    }
}

Can you see the Opaque type there? It is a foreign type, used to store all the formatting arguments. Since the type of any formatting argument is not known, they are stored as Opaque types.

A pointer to a foreign type should not be fat. It does not require any metadata. It should be just a regular pointer - and nothing more.

So, why on earth did my codegen use a fat pointer there? And how did I not notice?

Explaining the bug

You may remember the line that caused this whole issue:

// Checks if a type is dynamically sized
!pointed_type.is_sized(tyctx, ParamEnv::reveal_all())

It makes a seemingly innocent, but fundamentally flawed assumption.

It assumes there are 2 kinds of types: those which have a statically known size, and those with dynamic sizes. So, if a type does not have a known size, then it must have a dynamic size, so it must require a FatPointer(to store its metadata).

Foreign types are not sized - neither statically, nor dynamically. Their whole purpose relies on them not having a known size.

A foreign type is not sized, but it is not a DST either.

To fix this bug, all I needed was to make this small change:

!pointed_type.is_sized(tyctx, ParamEnv::reveal_all()) && !matches!(pointed_type.kind(),TyKind::Foreign(_))

And the whole Rust formatting machinery started to work. And this opened a whole world of opportunities. You would be shocked how much of Rust relies on string formatting.

After I got string formatting to work, I could finally attempt running some more complex code.

Explaining how the bug worked

So, you now know how I found the bug, and how I fixed it, but the exact way it worked may still be a bit unclear.

Let me explain. When std transmuted values of type &T to &Opaque, my compiler emitted the following sequence of CIL ops:

   ldarga.s 0 // Load address of argument 0
  ldobj valuetype FatPtrg // Load 16 bytes at the address of argument 0(invalid)
 stloc.0 // Set the local variable to the result of the transmute

So, it created an invalid fat pointer to a foreign type.

Since this was a transmute, my CIL validator assumed this operation was OK, even though the types did not match. The same thing happened when the function pointer got transmuted - my checks were fooled into ignoring the signature mismatch.

Then the function pointer got called using this instruction:

calli  valuetype 'core.result.Result.hb90864bdb6f91c56' (valuetype 'FatPtrg', valuetype 'build_std.fmt.Formatter.hbe47ff09aa56fc6f'*)

The first argument - FatPtrg, was passed where the runtime expected a pointer to the formatter. Where the callee expected a pointer to the Formatter, the caller placed a part of FatPtrg. By coincidence, that part accidentally contained a pointer to another part of the stack. So, when the caller tried to get the fields of Formatter, it read garbage.

Project progress

I hope you enjoyed this article. In this closing section, I want to share a bit of miscellaneous progress.

I have managed to compile a simple Rust program, which used TCP to connect to example.com. It still crashes most of the time, but - it manages to grab the website 30-40% of the time. This is nothing groundbreaking, but it is the first network connection made by Rust running in .NET.

The unmodified “guessing game” from the Rust book also almost works. It successfully initiates a thread-local RNG generator (using the rand crate), but reading a whole line from std is still a bit buggy.

NOTE: The std currently only works on POSIX-compliant systems. Since a .NET-specific version of std does not exist yet, the project uses a “surrogate” version of std, which will use P/Invoke to call platform-specific APIs.

Want to contribute?

The project now has a few good-first-issues. If you are interested in contributing, please feel free to take a look. Alternatively, you may also help by providing minimized bug cases. Test a std API, find bugs, minimize and report them.

If you want to see daily updates about the progress, consider looking at its Zulip.

Small FAQ:

What is the purpose of this project?

Besides learning/being an experiment, this project also serves a real purpose. It aims to allow for easy creation of .NET wrappers around Rust crates.

Why use a Rust crate when I can use a .NET library?

Rust code uses the stack heavily, and does not use the GC. This project can help you reduce GC pauses, and improve performance in memory-bound applications. It is not a silver bulet, it will not magicaly outperform C# in most cases. It is a tool, designed for a very specific purpose.

What is the performance of the compiled code?

While reliable benchmarking is difficult, in simple, not memory-sensitive tasks, the project is very close to C#, deviating by less than 10% in the worst case.

What about more detailed benchmarks?

I have performed more complex benchmarks, but the project is still far too buggy for me to be confident in my numbers. Being wrong fast is not impressive, and I don’t want to overpromise.

Why are you wasting your time working on an open-source .NET-related project, when .NET is owned by Microsoft?

.NET is open source, licensed under MIT - like Rust. It is not any more or less free than Rust is. While Microsoft is heavily involved in .NET, this does not mean that it is closed-source.

Statically Sized, dynamically sized, and other.