PortingCToRust

So because it seemed like a good idea at the time, I decided to port the minimp3 library from C to Rust. I want a pure Rust MP3 decoder crate to exist under a permissive license, I wanted to learn a few things about the MP3 file format, and it seemed small enough to do in a single weekend. (In reality it was largely done in about a week.) I’m quite good at Rust, and I’m okay at C (but rusty; hah!), and I know nothing at all about MP3 decoding. So, it was a fun learning experience. It was very interesting seeing how C and Rust’s different feature set changed how the programs were written. minimp3 turned out to be a good choice for this, since it is standalone, pretty well-written C as far as I can tell, does nothing that needs to be unsafe, and small but not trivial. This article is an attempt to organize my thoughts, notes and observations as I went about the project, in the hopes that it will be useful or at least interesting to someone else.

I started off by translating everything by hand, but after about a day and a half of doing this I had something like 400 lines of the 2000 line library translated. It turned out to be way harder than I ever expected to translate. This was not satisfactory, and was starting to get old, so I decided to try out an automatic translator instead and just clean up the resulting code.

On C

Observations on C, jotted down as they were observed:

  • If you want to twiddle bytes in a concise and low-level way, holy shit C has got you covered. That sort of low-level memory-mongling is what C was designed for, and it is good at it.
  • This is a great way to build low-level systems, but a shitty way to engineer software as a whole. Achieving abstraction is super hard when you can just reach down into some bytes and noodle around with them instead.
  • standard min and max functions/macros don’t exist anywhere? Or are so poor that everyone seems to define their own? WTF?
  • cripes the tooling is like the heckin’ dark ages. Where’s the doc strings? What do you mean the test runner is a hand-made program? …oh, there’s no dependencies to install? Hmmm…
  • jeez it’s all the little things that bite you in the ass. The lack of real bool. signed and unsigned being easily conflated.
  • Numeric types not having sizes attached to them by default and being mostly transparently translatable even though the sizes are super important makes it really easy to pretend the size of a number doesn’t matter. No matter what you are doing you can usually just say int and it will look like it works. Until it doesn’t.
  • Bloody hell you can’t tell whether a pointer points to a single object or an array just by looking at it… I’d forgotten that was a thing. Whoever bloody thought that was a good idea in the first place? CONVENTIONALLY you pass a pointer and a size to indicate an array, but in lots of places in minimp3 the array is a fixed length that the receiver function knows about, so the size gets omitted. So then when you go back later there’s no way to figure out from the function signature whether or not the function takes a pointer to a thing or a pointer to an array of things.
  • heckin’ ternary operators, just make your if statements not suck.
  • The pre and post increment operators are just the worst damn thing in the world. They never bothered me when I was using C regularly but now I just hate them with a passion, apparently.
  • There’s like… TWO places in one function where instead of just doing some_struct foo; it does some_struct foo[1]; and then proceeds to use foo as a pointer for the whole damn function. Is… is this a thing? It seems like you could just use &foo whenever you want that behavior. I mean, that’s what it’s THERE for. Is there a reason for this?

A lot of these complaints are just, well, it’s called Progress. If we designed the C in 2018 and made it so we actually wanted to use it, then it would look very different. Probably a lot like Nim or Swift. I think that’s still a niche that is currently unfilled; a modern, powerful language that doesn’t try to provide the same guarantees as Rust and so can be much more minimalist. Minimalism can be very useful; it’s easy to write and port compilers, easy to test them, etc. (Maybe C++ was invented because there were already really good C and Pascal compilers around, and compiler writers didn’t want to go out of a job? :-P) However it may also be we’re at the point where it’s okay that this niche is unfilled and heavyweight tools like Rust are so good that the disadvantages are easy to live with. I can certainly say with some confidence there are exactly zero situations where I would rather use C than Rust, there are just some situations where something as good as Rust but smaller and simpler might be desirable.

Translating from C to Rust was way, way harder than I expected. Some observations on why this might be:

  • The idioms are different. You can totally write Rust like C, and usually port algorithms in a very straightforward way, but holy hell it feels weird. I’ve never had so many let mut’s in one function. Let alone fn foo(mut a: *mut Whatever)… I’d forgotten that was even possible in Rust.
  • Part of this is ’cause of the nature of minimp3 and how it’s written. It does things that feel wildly perilous, but gets away with them ’cause it’s a single-threaded, single-header library that is tuned both for minimalism and performance, and does everything to fixed-size static buffers without allocating memory at runtime. I’ve never seen so many pointers to static local variables returned…
  • What is it about C that makes people think L3_imdct_gr() is a perfectly good function name? It’s some sort of deep math shuffling, so if I knew the mp3 spec inside out it’d probably be obvious what it was doing, but I’m doing this to learn about mp3, soooooo…
  • When I was first learning Rust I was really annoyed that there were no transparent numerical conversions, even the ones that could never fail like i8 -> i16. These days I am extremely happy to never have to think about what numeric conversion works, what doesn’t, and what works apart from some gotchas; if I wanna convert a number, I convert a number. I think the Real Programmers of old would argue that this sort of feature is C taking the burden of doing the conversion from the programmer, so the programmer can spend time typing the more important code; I now view it the opposite way, it’s forcing the programmer to spend their brainpower remembering rules that don’t need to be there. Tedious busy-work like keeping track of what numbers are where is what the computer is for. I just want to type my code, and be able to read it later, and have the computer actually do what I tell it to instead of trying to guess “wait, this gets converted to a float here, right? …right?” It hides a startling amount of complexity.
  • Come to think of it, there’s a grand total of 13 comments in minimp3.h that aren’t part of an #ifdef annotation. I suppose the code is supposed to speak for itself.

Average rate of translation now that I’m into the meat of things: About 100 lines per 2 hours? Sheesh.

Possible bugs in minimp3 (and resolutions):

Automated tools

I knew of a few automated tools for translating C to Rust, but had never tried them out, so this seemed like a great time to start.

  • corrode: https://github.com/jameysharp/corrode – kinda unmaintained now
  • c2rust: https://github.com/immunant/c2rust – where the action appears to be

Tried corrode first. Easy to build and install. Choked on all the constant expressions in various C array declarations though, like float foo[9 * 32]; There’s a LOT of those, so gave it a pass and continued on to c2rust. c2rust, not really plug-and-play. The documentation sssumes the project you’re translating has an actual cmake or such. I don’t, and I don’t want to set up something like that just for a single-header library, so.

So back to corrode. 20 minutes of fixups gets it to translate minimp3. The resulting code has 48 unsafe’s. It made a big handful of fairly trivial syntax errors, which is a pain and a little odd but not too awful to actually fix. Some more logic errors, some due to odd usages of C (there’s a few places where minimp3 declares variables as foo[1] instead of *foo which is just weird), some due to C’s terminally lazy initialization rules.

The loop translation is weak; it should be pretty easy to translate simple C for loops into Rust ones, but no dice. Instead of a C while loop becoming a Rust while loop, it becomes loop { if condition { break; } ...} Oh well. Also for some reason it seems to clobber struct names; we get Struct4 instead of L3GrInfoT which is what I’d expect.

LOTS of complaints about possibly-uninitialized variables, since corrode can’t make idiomatic code that creates constructors or such, and so instead does the C thing of declaring an uninitalized variable and passing a pointer to it to an init function. Could easily, if unsafely, be fixed by making corrode initialize such variables with std::mem::uninitialized() or even just std::mem::zero() for most of them.

It also doesn’t like glibc’s math.h, but there’s only one function minimp3 uses from that, so we can just replace that entirely.

Okay, as an example, corrode turns this:

static int mp3d_find_frame(const uint8_t *mp3, int mp3_bytes, int *free_format_bytes, int *ptr_frame_bytes)
{
    int i, k;
    for (i = 0; i < mp3_bytes - HDR_SIZE; i++, mp3++)
    {
        if (hdr_valid(mp3))
        {
            int frame_bytes = hdr_frame_bytes(mp3, *free_format_bytes);
            int frame_and_padding = frame_bytes + hdr_padding(mp3);

            for (k = HDR_SIZE; !frame_bytes && k < MAX_FREE_FORMAT_FRAME_SIZE && i + 2*k < mp3_bytes - HDR_SIZE; k++)
            {
                if (hdr_compare(mp3, mp3 + k))
                {
                    int fb = k - hdr_padding(mp3);
                    int nextfb = fb + hdr_padding(mp3 + k);
                    if (i + k + nextfb + HDR_SIZE > mp3_bytes || !hdr_compare(mp3, mp3 + k + nextfb))
                        continue;
                    frame_and_padding = k;
                    frame_bytes = fb;
                    *free_format_bytes = fb;
                }
            }
            if ((frame_bytes && i + frame_and_padding <= mp3_bytes &&
                mp3d_match_frame(mp3, mp3_bytes - i, frame_bytes)) ||
                (!i && frame_and_padding == mp3_bytes))
            {
                *ptr_frame_bytes = frame_and_padding;
                return i;
            }
            *free_format_bytes = 0;
        }
    }
    *ptr_frame_bytes = 0;
    return i;
}

Into this:

pub unsafe extern "C" fn mp3d_find_frame(
    mut mp3: *const u8,
    mut mp3_bytes: i32,
    mut free_format_bytes: *mut i32,
    mut ptr_frame_bytes: *mut i32,
) -> i32 {
    let current_block;
    let mut i: i32;
    let mut k: i32;
    i = 0i32;
    let mut frame_bytes: i32 = hdr_frame_bytes(mp3, *free_format_bytes);
    let mut frame_and_padding: i32 = frame_bytes + hdr_padding(mp3);
    'loop1: loop {
        if !(i < mp3_bytes - 4i32) {
            current_block = 2;
            break;
        }
        if hdr_valid(mp3) != 0 {
            k = 4i32;
            'loop5: loop {
                if !(frame_bytes == 0 && (k < 2304i32) && (i + 2i32 * k < mp3_bytes - 4i32)) {
                    break;
                }
                if hdr_compare(mp3, mp3.offset(k as (isize))) != 0 {
                    let fb: i32 = k - hdr_padding(mp3);
                    let nextfb: i32 = fb + hdr_padding(mp3.offset(k as (isize)));
                    if !(i + k + nextfb + 4i32 > mp3_bytes
                        || hdr_compare(mp3, mp3.offset(k as (isize)).offset(nextfb as (isize)))
                            == 0)
                    {
                        frame_and_padding = k;
                        frame_bytes = fb;
                        *free_format_bytes = fb;
                    }
                }
                k = k + 1;
            }
            if frame_bytes != 0
                && (i + frame_and_padding <= mp3_bytes)
                && (mp3d_match_frame(mp3, mp3_bytes - i, frame_bytes) != 0)
                || i == 0 && (frame_and_padding == mp3_bytes)
            {
                current_block = 9;
                break;
            }
            *free_format_bytes = 0i32;
        }
        i = i + 1;
        mp3 = mp3.offset(1isize);
    }
    if current_block == 2 {
        *ptr_frame_bytes = 0i32;
        i
    } else {
        *ptr_frame_bytes = frame_and_padding;
        i
    }
}

Yeesh. This is indeed a quite literal translation of the C code, but there’s a number of small things that could be rather nicer than they are. Fixing it by hand won’t be TOO bad but not good either; let’s take a harder look at c2rust and see if does any better.

Turns out that CMakefile’s and all the other stuff that c2rust talks about is just a red herring, it just wants a compile_commands.json. It just also assumes you know what the heck it’s talking about. Fortunately, this nonsense is actually quite simple and is documented here: https://clang.llvm.org/docs/JSONCompilationDatabase.html. It’s a clang thing, which explains why it seems relatively popular though I’ve never heard of it. Let’s whip one up and try it out on minimp3 to see if it generates anything nicer than corrode does…

[
  { "directory": "/home/icefox/tmp/minimp3-translation/minimp3",
    "command": "/usr/bin/gcc -lm -DMINIMP3_NO_SIMD minimp3_test.c",
    "file": "minimp3_test.c" }
]

All right, with that it’s a simple matter of running c2rust/scripts/transpile.py compile_commands.json and seeing what we get. It doesn’t find rustfmt and so exits for an error for some reason, but seems to succeed apart from that. For a one-shot thing like this, having to run rustfmt by hand is no problem. And for our example function we get…

unsafe extern "C" fn mp3d_find_frame(
    mut mp3: *const uint8_t,
    mut mp3_bytes: libc::c_int,
    mut free_format_bytes: *mut libc::c_int,
    mut ptr_frame_bytes: *mut libc::c_int,
) -> libc::c_int {
    let mut i: libc::c_int = 0;
    let mut k: libc::c_int = 0;
    i = 0i32;
    while i < mp3_bytes - 4i32 {
        if 0 != hdr_valid(mp3) {
            let mut frame_bytes: libc::c_int = hdr_frame_bytes(mp3, *free_format_bytes);
            let mut frame_and_padding: libc::c_int = frame_bytes + hdr_padding(mp3);
            k = 4i32;
            while 0 == frame_bytes && k < 2304i32 && i + 2i32 * k < mp3_bytes - 4i32 {
                if 0 != hdr_compare(mp3, mp3.offset(k as isize)) {
                    let mut fb: libc::c_int = k - hdr_padding(mp3);
                    let mut nextfb: libc::c_int = fb + hdr_padding(mp3.offset(k as isize));
                    if !(i + k + nextfb + 4i32 > mp3_bytes
                        || 0 == hdr_compare(mp3, mp3.offset(k as isize).offset(nextfb as isize)))
                    {
                        frame_and_padding = k;
                        frame_bytes = fb;
                        *free_format_bytes = fb
                    }
                }
                k += 1
            }
            if 0 != frame_bytes
                && i + frame_and_padding <= mp3_bytes
                && 0 != mp3d_match_frame(mp3, mp3_bytes - i, frame_bytes)
                || 0 == i && frame_and_padding == mp3_bytes
            {
                *ptr_frame_bytes = frame_and_padding;
                return i;
            } else {
                *free_format_bytes = 0i32
            }
        }
        i += 1;
        mp3 = mp3.offset(1isize)
    }
    *ptr_frame_bytes = 0i32;
    return i;
}

So… some things are better, like using while loops instead of just loop, and some things are technically better but actually more annoying, like using c_int instead of i32. It also generates nightly-only code and appears to import and translate giant chunks of libc header files whether it needs to or not, which… well, makes rustc say things like this when trying to compile the result:

error: aborting due to 1142 previous errors

I’d already made the corrode-based code compile cleanly, so alas I decided I was just going to go with that. Even if after compilation it instantly segfaults when trying to actually run.

Actually trying to make it work

So after fixing like ten million warnings I came up with a strategy. I went through one function at a time, starting with leaf functions and functions that look easiest, and replaced all pointers in the function signatures with safe Rust types; references or slices or vec’s as appropriate. I did this while changing as little else as possible, usually resulting in calling slice::as_ptr() on several slices right at the beginning of the function, and slice::from_raw_parts() in the caller to create a pointer into a slice to pass into it. Once a function compiled with that setup, I went through it and actually replaced all the unsafe code with the safe equivalents. At the end of it, I had a safe function, which did exactly what the C code appeared to do. Doing this from the leaf functions and moving my way up the call tree worked pretty well. Most of the segfaults became panics, and it was annoying and time consuming but ultimately entirely possible to hunt down and solve them piecemeal.

I really wished I had a way to write unit tests for each of the functions, so I could have something better than “does this panic upon trying to decode one of the test vectors”. It would be even better if I could compare against the corresponding C functions somehow. That didn’t seem hopeful though, ’cause the C functions are almost all static and thus their symbols don’t get exported, so writing interface code against them won’t work. Maybe test against the translated Rust versions that have the warnings fixed but nothing else changed? (That wouldn’t work either, turns out corrode introduced bugs.) It’s tricky to compare apples to apples because either way the data model is going to change QUITE A LOT; for instance, there’s one structure that’s some kind of header that’s just a [u8;4] and that got turned into a proper struct with methods and such. Making appropriate tests for those is going to be tricky. (Or so I thought; turns out that just keeping it a [u8;4] was way simpler.) And in fact it looks like the C code’s unit tests just try to decode the test vectors and make sure the right results come out; there’s no testing of individual components of the program. How convenient. Well, I guess that’s the best I can get then.

Translating the actual data structures was pretty straightforward, a double-borrow or two aside. A lot more of the pointers are to mysterious buffers of bytes or floats, which is a bit harder, ’cause you usually don’t know what they actually are or what the ownership semantics of them should be. However, corrode does a good job of labeling them as mut or const as appropriate, and minimp3’s coding style saves us some heartache ’cause they’re almost always pointing to fixed-sized buffers in some struct or another, and it’s honestly not a terribly large program and doesn’t have a lot of complexity in the data structure. When most functions are called in only one or two places, figuring out which buffers it’s being passed is usually pretty simple. All the naming is, as far as I can tell, 100% consistent, which also makes life a lot easier.

Again I am reminded that really, even working in unsafe Rust is pretty nice. Some of the pointer conversions get pretty verbose, but you often don’t need too many of them. Turning a pointer into a reference is as simple as &*foo, which feels appropriately Rustic. And it really turns out that in the process of converting unsafe method signatures into safe ones, they get simpler and easier to understand. It’s quite satisfying. There actually are NOT very many places where the C code’s behavior conflicts with Rust’s borrowing rules. This is both somewhat surprising, because there’s no way this code was written with Rust’s borrowing semantics in mind, and also entirely sensible, since Rust’s borrowing semantics are often quite close to how you actually want your code to behave anyway.

The hard part is dealing with raw data buffers, but once you turn a *const i16 into &[i16] in a method signature, then everything starts unraveling and you just follow compiler errors to convert the whole chain of dependencies to slices. Slices are honestly brilliant and there’s not really anything that good like them in any other language I know of. C does lots of annoying things with pointers like use them as offsets into arrays and then pass those offsets to functions, and no other language I’m aware of has a really good way of representing that sort of thing, but with slices it just works. ptr++ just becomes s = s[1..]. I did NOT expect things to be that simple.

…oh, it becomes less simple than that when the slice is mutable. Darn it. Well, for now we can just hack around it a little with an unsafe code, writing a function to defeat the borrow checker and let us assign a slice to a sub-set of itself, and re-visit the issue later. The lifetimes are all actually very simple in this program; it looks complicated but really just steps through one buffer in chunks and writes to another buffer in chunks, and doesn’t do any buffer management itself outside of some fixed-sized temporary structures and read-only static tables, so I’m not too worried about it.

It occurs to me that one of the very deep design principles of C is that it assumes that modifying a pointer directly, *x = foo;, is cheaper than doing it via index, x[n] = foo;. This assumption gets very deep in the coding styles people use. Does this actually matter anymore? Are compilers smart enough to turn indexed loads and stores into direct ones, or CPU’s smart enough to make indexed loads and stores no more expensive than direct ones? I really have no idea. Aren’t there CPU’s out there where through-an-indexed-pointer is the only addressing mode? (Only one I can find right now is MIPS 1.) Hmmmm.

Fixing bugs

This is just a semi-random collection of notes I made on fixing specific things, in the hopes that it demonstrates some or the sorts of things that happen when C and Rust don’t quite match.

mp3dec_decode_frame() calls mp3d_find_frame(). This calls hdr_frame_bytes(). This calls hdr_bitrate_kbps(). If this is called with mp3[1] being 0, this results in an array underflow in HDR_GET_LAYER(h) - 1. This makes hdr_bitrate_kbps() return garbage outside the realm of its array, but since that array is static that is probably always valid memory and you don’t get a segfault. Upon inspection the C version doesn’t call hdr_bitrate_kbps() unless hdr_valid() is true but the Rust version reordered things a little; not sure if that was my goof or corrode’s. I think maybe both; corrode put a variable in a scope that was too small and I lifted it out. Easily fixed by fixing the ordering.

There was also another bug in hdr_bitrate_kbps() where !!HDR_TEST_MPEG1(h) turns a “0 is false, anything else is true” boolean into a 0-or-1 array index by negating twice. corrode appears to have only put one ! in the resulting code, and ! on an integer in Rust is bitwise complement, not binary not, so it was producing horrible output.

With those bugs fixed, a good half of the test vectors… well, don’t panic and produce results that aren’t obvious nonsense, at least! Huzzah!

There’s a number of panics from corrode not translating << and >> into wrapping_shl() and wrapping_shr(); rust << and >> panic if any bits would be lost, I believe. Well, << sure does; I’ll probably find out for >>. That’s corrode being a little derpy again.

Array index out of bounds panics. …there appear to be a LOT of them associated with the usage of the float grbuf[2][576]; (grain buffer?) struct member, which is sometimes treated as two arrays of 576 floats, but quite also often treated as a single array of 1152 floats that happens to be divided into two halves, most egregiously in l3_midside_stereo(). Siiiiiigh. Well, for now we can just make two different versions of the functions for the two cases, and maybe manage to get rid of one later.

Conclusion

In the end, the results of my work are in the rinimp3 crate, because I suck at naming things. There’s STILL some panics to hunt down, one (large, scary-looking) function remains unsafe and effectively untranslated, and there’s a lot of helper-code that could be written to make the API nicer to use. Still, it seems to produce semi-sane test output on at least SOME of the test vectors, which is frankly a win at this point. Maybe sometime soon I’ll get another week or so of free time and be able to finish it off.