PortingCToRust
So because it seemed like a good idea at the time, I decided to port
the minimp3
library from C to Rust. I want a pure Rust MP3 decoder crate to exist
under a permissive license, I wanted to learn a few things about the MP3
file format, and it seemed small enough to do in a single weekend. (In
reality it was largely done in about a week.) I’m quite good at
Rust, and I’m okay at C (but rusty; hah!), and I know nothing
at all about MP3 decoding. So, it was a fun learning experience. It was
very interesting seeing how C and Rust’s different feature set changed
how the programs were written. minimp3
turned out to be a
good choice for this, since it is standalone, pretty well-written C as
far as I can tell, does nothing that needs to be unsafe, and
small but not trivial. This article is an attempt to organize my
thoughts, notes and observations as I went about the project, in the
hopes that it will be useful or at least interesting to someone
else.
I started off by translating everything by hand, but after about a day and a half of doing this I had something like 400 lines of the 2000 line library translated. It turned out to be way harder than I ever expected to translate. This was not satisfactory, and was starting to get old, so I decided to try out an automatic translator instead and just clean up the resulting code.
On C
Observations on C, jotted down as they were observed:
- If you want to twiddle bytes in a concise and low-level way, holy shit C has got you covered. That sort of low-level memory-mongling is what C was designed for, and it is good at it.
- This is a great way to build low-level systems, but a shitty way to engineer software as a whole. Achieving abstraction is super hard when you can just reach down into some bytes and noodle around with them instead.
- standard
min
andmax
functions/macros don’t exist anywhere? Or are so poor that everyone seems to define their own? WTF? - cripes the tooling is like the heckin’ dark ages. Where’s the doc strings? What do you mean the test runner is a hand-made program? …oh, there’s no dependencies to install? Hmmm…
- jeez it’s all the little things that bite you in the ass.
The lack of real
bool
. signed and unsigned being easily conflated. - Numeric types not having sizes attached to them by default and being
mostly transparently translatable even though the sizes are super
important makes it really easy to pretend the size of a number
doesn’t matter. No matter what you are doing you can usually just say
int
and it will look like it works. Until it doesn’t. - Bloody hell you can’t tell whether a pointer points to a single
object or an array just by looking at it… I’d forgotten that was a
thing. Whoever bloody thought that was a good idea in the first place?
CONVENTIONALLY you pass a pointer and a size to indicate an array, but
in lots of places in
minimp3
the array is a fixed length that the receiver function knows about, so the size gets omitted. So then when you go back later there’s no way to figure out from the function signature whether or not the function takes a pointer to a thing or a pointer to an array of things. - heckin’ ternary operators, just make your if statements not suck.
- The pre and post increment operators are just the worst damn thing in the world. They never bothered me when I was using C regularly but now I just hate them with a passion, apparently.
- There’s like… TWO places in one function where instead of just doing
some_struct foo;
it doessome_struct foo[1];
and then proceeds to usefoo
as a pointer for the whole damn function. Is… is this a thing? It seems like you could just use&foo
whenever you want that behavior. I mean, that’s what it’s THERE for. Is there a reason for this?
A lot of these complaints are just, well, it’s called Progress. If we designed the C in 2018 and made it so we actually wanted to use it, then it would look very different. Probably a lot like Nim or Swift. I think that’s still a niche that is currently unfilled; a modern, powerful language that doesn’t try to provide the same guarantees as Rust and so can be much more minimalist. Minimalism can be very useful; it’s easy to write and port compilers, easy to test them, etc. (Maybe C++ was invented because there were already really good C and Pascal compilers around, and compiler writers didn’t want to go out of a job? :-P) However it may also be we’re at the point where it’s okay that this niche is unfilled and heavyweight tools like Rust are so good that the disadvantages are easy to live with. I can certainly say with some confidence there are exactly zero situations where I would rather use C than Rust, there are just some situations where something as good as Rust but smaller and simpler might be desirable.
Translating from C to Rust was way, way harder than I expected. Some observations on why this might be:
- The idioms are different. You can totally write Rust like
C, and usually port algorithms in a very straightforward way, but holy
hell it feels weird. I’ve never had so many
let mut
’s in one function. Let alonefn foo(mut a: *mut Whatever)
… I’d forgotten that was even possible in Rust. - Part of this is ’cause of the nature of
minimp3
and how it’s written. It does things that feel wildly perilous, but gets away with them ’cause it’s a single-threaded, single-header library that is tuned both for minimalism and performance, and does everything to fixed-size static buffers without allocating memory at runtime. I’ve never seen so many pointers to static local variables returned… - What is it about C that makes people think
L3_imdct_gr()
is a perfectly good function name? It’s some sort of deep math shuffling, so if I knew the mp3 spec inside out it’d probably be obvious what it was doing, but I’m doing this to learn about mp3, soooooo… - When I was first learning Rust I was really annoyed that there were
no transparent numerical conversions, even the ones that could
never fail like
i8
->i16
. These days I am extremely happy to never have to think about what numeric conversion works, what doesn’t, and what works apart from some gotchas; if I wanna convert a number, I convert a number. I think the Real Programmers of old would argue that this sort of feature is C taking the burden of doing the conversion from the programmer, so the programmer can spend time typing the more important code; I now view it the opposite way, it’s forcing the programmer to spend their brainpower remembering rules that don’t need to be there. Tedious busy-work like keeping track of what numbers are where is what the computer is for. I just want to type my code, and be able to read it later, and have the computer actually do what I tell it to instead of trying to guess “wait, this gets converted to a float here, right? …right?” It hides a startling amount of complexity. - Come to think of it, there’s a grand total of 13 comments in
minimp3.h
that aren’t part of an#ifdef
annotation. I suppose the code is supposed to speak for itself.
Average rate of translation now that I’m into the meat of things: About 100 lines per 2 hours? Sheesh.
Possible bugs in minimp3 (and resolutions):
- https://github.com/lieff/minimp3/blob/master/minimp3.h#L266
Is there a reason this returns
unsigned
instead ofuint8_t
? (bitrates are always handled asunsigned
, but the array of bitrates is auint8_t
anyway, presumably to save 270 bytes in the program’s data segment.) - https://github.com/lieff/minimp3/blob/master/minimp3.h#L933
This stores an
int
into an array ofuint8_t
; does it clobber the following four bytes or not? (Proooooobably not, but…) - https://github.com/lieff/minimp3/blob/master/minimp3.h#L232 appears to end by doing a bit shift with a number that’s almost certainly negative, which is undefined behavior. (Nope, because by the time that bit-shift happens the number being shifted by is -1! Gotcha!)
Automated tools
I knew of a few automated tools for translating C to Rust, but had never tried them out, so this seemed like a great time to start.
- corrode: https://github.com/jameysharp/corrode – kinda unmaintained now
- c2rust: https://github.com/immunant/c2rust – where the action appears to be
Tried corrode first. Easy to build and install. Choked on all the
constant expressions in various C array declarations though, like
float foo[9 * 32];
There’s a LOT of those, so gave it a
pass and continued on to c2rust. c2rust, not really plug-and-play. The
documentation sssumes the project you’re translating has an actual cmake
or such. I don’t, and I don’t want to set up something like that just
for a single-header library, so.
So back to corrode. 20 minutes of fixups gets it to translate
minimp3
. The resulting code has 48 unsafe
’s.
It made a big handful of fairly trivial syntax errors, which is a pain
and a little odd but not too awful to actually fix. Some more logic
errors, some due to odd usages of C (there’s a few places where minimp3
declares variables as foo[1]
instead of *foo
which is just weird), some due to C’s terminally lazy initialization
rules.
The loop translation is weak; it should be pretty easy to translate
simple C for loops into Rust ones, but no dice. Instead of a C
while
loop becoming a Rust while
loop, it
becomes loop { if condition { break; } ...}
Oh well. Also
for some reason it seems to clobber struct names; we get
Struct4
instead of L3GrInfoT
which is what I’d
expect.
LOTS of complaints about possibly-uninitialized variables, since
corrode can’t make idiomatic code that creates constructors or such, and
so instead does the C thing of declaring an uninitalized variable and
passing a pointer to it to an init function. Could easily, if unsafely,
be fixed by making corrode initialize such variables with
std::mem::uninitialized()
or even just
std::mem::zero()
for most of them.
It also doesn’t like glibc’s math.h
, but there’s only
one function minimp3 uses from that, so we can just replace that
entirely.
Okay, as an example, corrode turns this:
static int mp3d_find_frame(const uint8_t *mp3, int mp3_bytes, int *free_format_bytes, int *ptr_frame_bytes)
{
int i, k;
for (i = 0; i < mp3_bytes - HDR_SIZE; i++, mp3++)
{
if (hdr_valid(mp3))
{
int frame_bytes = hdr_frame_bytes(mp3, *free_format_bytes);
int frame_and_padding = frame_bytes + hdr_padding(mp3);
for (k = HDR_SIZE; !frame_bytes && k < MAX_FREE_FORMAT_FRAME_SIZE && i + 2*k < mp3_bytes - HDR_SIZE; k++)
{
if (hdr_compare(mp3, mp3 + k))
{
int fb = k - hdr_padding(mp3);
int nextfb = fb + hdr_padding(mp3 + k);
if (i + k + nextfb + HDR_SIZE > mp3_bytes || !hdr_compare(mp3, mp3 + k + nextfb))
continue;
frame_and_padding = k;
frame_bytes = fb;
*free_format_bytes = fb;
}
}
if ((frame_bytes && i + frame_and_padding <= mp3_bytes &&
mp3d_match_frame(mp3, mp3_bytes - i, frame_bytes)) ||
(!i && frame_and_padding == mp3_bytes))
{
*ptr_frame_bytes = frame_and_padding;
return i;
}
*free_format_bytes = 0;
}
}
*ptr_frame_bytes = 0;
return i;
}
Into this:
pub unsafe extern "C" fn mp3d_find_frame(
mut mp3: *const u8,
mut mp3_bytes: i32,
mut free_format_bytes: *mut i32,
mut ptr_frame_bytes: *mut i32,
) -> i32 {
let current_block;
let mut i: i32;
let mut k: i32;
i = 0i32;
let mut frame_bytes: i32 = hdr_frame_bytes(mp3, *free_format_bytes);
let mut frame_and_padding: i32 = frame_bytes + hdr_padding(mp3);
'loop1: loop {
if !(i < mp3_bytes - 4i32) {
current_block = 2;
break;
}
if hdr_valid(mp3) != 0 {
k = 4i32;
'loop5: loop {
if !(frame_bytes == 0 && (k < 2304i32) && (i + 2i32 * k < mp3_bytes - 4i32)) {
break;
}
if hdr_compare(mp3, mp3.offset(k as (isize))) != 0 {
let fb: i32 = k - hdr_padding(mp3);
let nextfb: i32 = fb + hdr_padding(mp3.offset(k as (isize)));
if !(i + k + nextfb + 4i32 > mp3_bytes
|| hdr_compare(mp3, mp3.offset(k as (isize)).offset(nextfb as (isize)))
== 0)
{
frame_and_padding = k;
frame_bytes = fb;
*free_format_bytes = fb;
}
}
k = k + 1;
}
if frame_bytes != 0
&& (i + frame_and_padding <= mp3_bytes)
&& (mp3d_match_frame(mp3, mp3_bytes - i, frame_bytes) != 0)
|| i == 0 && (frame_and_padding == mp3_bytes)
{
current_block = 9;
break;
}
*free_format_bytes = 0i32;
}
i = i + 1;
mp3 = mp3.offset(1isize);
}
if current_block == 2 {
*ptr_frame_bytes = 0i32;
i
} else {
*ptr_frame_bytes = frame_and_padding;
i
}
}
Yeesh. This is indeed a quite literal translation of the C code, but there’s a number of small things that could be rather nicer than they are. Fixing it by hand won’t be TOO bad but not good either; let’s take a harder look at c2rust and see if does any better.
Turns out that CMakefile
’s and all the other stuff that
c2rust talks about is just a red herring, it just wants a
compile_commands.json
. It just also assumes you know what
the heck it’s talking about. Fortunately, this nonsense is actually
quite simple and is documented here: https://clang.llvm.org/docs/JSONCompilationDatabase.html.
It’s a clang thing, which explains why it seems relatively popular
though I’ve never heard of it. Let’s whip one up and try it out on
minimp3
to see if it generates anything nicer than corrode
does…
[
{ "directory": "/home/icefox/tmp/minimp3-translation/minimp3",
"command": "/usr/bin/gcc -lm -DMINIMP3_NO_SIMD minimp3_test.c",
"file": "minimp3_test.c" }
]
All right, with that it’s a simple matter of running
c2rust/scripts/transpile.py compile_commands.json
and
seeing what we get. It doesn’t find rustfmt
and so exits
for an error for some reason, but seems to succeed apart from that. For
a one-shot thing like this, having to run rustfmt
by hand
is no problem. And for our example function we get…
unsafe extern "C" fn mp3d_find_frame(
mut mp3: *const uint8_t,
mut mp3_bytes: libc::c_int,
mut free_format_bytes: *mut libc::c_int,
mut ptr_frame_bytes: *mut libc::c_int,
) -> libc::c_int {
let mut i: libc::c_int = 0;
let mut k: libc::c_int = 0;
i = 0i32;
while i < mp3_bytes - 4i32 {
if 0 != hdr_valid(mp3) {
let mut frame_bytes: libc::c_int = hdr_frame_bytes(mp3, *free_format_bytes);
let mut frame_and_padding: libc::c_int = frame_bytes + hdr_padding(mp3);
k = 4i32;
while 0 == frame_bytes && k < 2304i32 && i + 2i32 * k < mp3_bytes - 4i32 {
if 0 != hdr_compare(mp3, mp3.offset(k as isize)) {
let mut fb: libc::c_int = k - hdr_padding(mp3);
let mut nextfb: libc::c_int = fb + hdr_padding(mp3.offset(k as isize));
if !(i + k + nextfb + 4i32 > mp3_bytes
|| 0 == hdr_compare(mp3, mp3.offset(k as isize).offset(nextfb as isize)))
{
frame_and_padding = k;
frame_bytes = fb;
*free_format_bytes = fb
}
}
k += 1
}
if 0 != frame_bytes
&& i + frame_and_padding <= mp3_bytes
&& 0 != mp3d_match_frame(mp3, mp3_bytes - i, frame_bytes)
|| 0 == i && frame_and_padding == mp3_bytes
{
*ptr_frame_bytes = frame_and_padding;
return i;
} else {
*free_format_bytes = 0i32
}
}
i += 1;
mp3 = mp3.offset(1isize)
}
*ptr_frame_bytes = 0i32;
return i;
}
So… some things are better, like using while
loops
instead of just loop
, and some things are technically
better but actually more annoying, like using c_int
instead
of i32
. It also generates nightly-only code and appears to
import and translate giant chunks of libc header files whether it needs
to or not, which… well, makes rustc
say things like this
when trying to compile the result:
error: aborting due to 1142 previous errors
I’d already made the corrode-based code compile cleanly, so alas I decided I was just going to go with that. Even if after compilation it instantly segfaults when trying to actually run.
Actually trying to make it work
So after fixing like ten million warnings I came up with a
strategy. I went through one function at a time, starting with
leaf functions and functions that look easiest, and replaced all
pointers in the function signatures with safe Rust types; references or
slices or vec’s as appropriate. I did this while changing as little else
as possible, usually resulting in calling slice::as_ptr()
on several slices right at the beginning of the function, and
slice::from_raw_parts()
in the caller to create a pointer
into a slice to pass into it. Once a function compiled with that setup,
I went through it and actually replaced all the unsafe code with the
safe equivalents. At the end of it, I had a safe function, which did
exactly what the C code appeared to do. Doing this from the
leaf functions and moving my way up the call tree worked pretty well.
Most of the segfaults became panics, and it was annoying and time
consuming but ultimately entirely possible to hunt down and solve them
piecemeal.
I really wished I had a way to write unit tests for each of the
functions, so I could have something better than “does this panic upon
trying to decode one of the test vectors”. It would be even better if I
could compare against the corresponding C functions somehow. That didn’t
seem hopeful though, ’cause the C functions are almost all
static
and thus their symbols don’t get exported, so
writing interface code against them won’t work. Maybe test against the
translated Rust versions that have the warnings fixed but nothing else
changed? (That wouldn’t work either, turns out corrode introduced bugs.)
It’s tricky to compare apples to apples because either way the data
model is going to change QUITE A LOT; for instance, there’s one
structure that’s some kind of header that’s just a [u8;4]
and that got turned into a proper struct with methods and such. Making
appropriate tests for those is going to be tricky. (Or so I thought;
turns out that just keeping it a [u8;4]
was way simpler.)
And in fact it looks like the C code’s unit tests just try to decode the
test vectors and make sure the right results come out; there’s no
testing of individual components of the program. How convenient. Well, I
guess that’s the best I can get then.
Translating the actual data structures was pretty straightforward, a
double-borrow or two aside. A lot more of the pointers are to mysterious
buffers of bytes or floats, which is a bit harder, ’cause you usually
don’t know what they actually are or what the ownership semantics of
them should be. However, corrode
does a good job of
labeling them as mut
or const
as appropriate,
and minimp3
’s coding style saves us some heartache ’cause
they’re almost always pointing to fixed-sized buffers in some struct or
another, and it’s honestly not a terribly large program and doesn’t have
a lot of complexity in the data structure. When most functions are
called in only one or two places, figuring out which buffers it’s being
passed is usually pretty simple. All the naming is, as far as I
can tell, 100% consistent, which also makes life a lot easier.
Again I am reminded that really, even working in unsafe Rust is
pretty nice. Some of the pointer conversions get pretty verbose, but you
often don’t need too many of them. Turning a pointer into a reference is
as simple as &*foo
, which feels appropriately Rustic.
And it really turns out that in the process of converting unsafe method
signatures into safe ones, they get simpler and easier to understand.
It’s quite satisfying. There actually are NOT very many places where the
C code’s behavior conflicts with Rust’s borrowing rules. This is both
somewhat surprising, because there’s no way this code was written with
Rust’s borrowing semantics in mind, and also entirely sensible, since
Rust’s borrowing semantics are often quite close to how you actually
want your code to behave anyway.
The hard part is dealing with raw data buffers, but once you turn a
*const i16
into &[i16]
in a method
signature, then everything starts unraveling and you just follow
compiler errors to convert the whole chain of dependencies to slices.
Slices are honestly brilliant and there’s not really anything
that good like them in any other language I know of. C does
lots of annoying things with pointers like use them as offsets into
arrays and then pass those offsets to functions, and no other language
I’m aware of has a really good way of representing that sort of
thing, but with slices it just works. ptr++
just becomes
s = s[1..]
. I did NOT expect things to be that simple.
…oh, it becomes less simple than that when the slice is mutable. Darn it. Well, for now we can just hack around it a little with an unsafe code, writing a function to defeat the borrow checker and let us assign a slice to a sub-set of itself, and re-visit the issue later. The lifetimes are all actually very simple in this program; it looks complicated but really just steps through one buffer in chunks and writes to another buffer in chunks, and doesn’t do any buffer management itself outside of some fixed-sized temporary structures and read-only static tables, so I’m not too worried about it.
It occurs to me that one of the very deep design principles of C is
that it assumes that modifying a pointer directly,
*x = foo;
, is cheaper than doing it via index,
x[n] = foo;
. This assumption gets very deep in the coding
styles people use. Does this actually matter anymore? Are compilers
smart enough to turn indexed loads and stores into direct ones, or CPU’s
smart enough to make indexed loads and stores no more expensive than
direct ones? I really have no idea. Aren’t there CPU’s out there where
through-an-indexed-pointer is the only addressing mode? (Only
one I can find right now is MIPS 1.) Hmmmm.
Fixing bugs
This is just a semi-random collection of notes I made on fixing specific things, in the hopes that it demonstrates some or the sorts of things that happen when C and Rust don’t quite match.
mp3dec_decode_frame()
calls
mp3d_find_frame()
. This calls
hdr_frame_bytes()
. This calls
hdr_bitrate_kbps()
. If this is called with
mp3[1]
being 0, this results in an array underflow in
HDR_GET_LAYER(h) - 1
. This makes
hdr_bitrate_kbps()
return garbage outside the realm of its
array, but since that array is static that is probably always
valid memory and you don’t get a segfault. Upon inspection the C version
doesn’t call hdr_bitrate_kbps()
unless
hdr_valid()
is true but the Rust version reordered things a
little; not sure if that was my goof or corrode
’s. I think
maybe both; corrode
put a variable in a scope that was too
small and I lifted it out. Easily fixed by fixing the ordering.
There was also another bug in hdr_bitrate_kbps()
where
!!HDR_TEST_MPEG1(h)
turns a “0 is false, anything else is
true” boolean into a 0-or-1 array index by negating twice.
corrode
appears to have only put one !
in the
resulting code, and !
on an integer in Rust is bitwise
complement, not binary not, so it was producing horrible output.
With those bugs fixed, a good half of the test vectors… well, don’t panic and produce results that aren’t obvious nonsense, at least! Huzzah!
There’s a number of panics from corrode
not translating
<<
and >>
into
wrapping_shl()
and wrapping_shr()
; rust
<<
and >>
panic if any bits would
be lost, I believe. Well, <<
sure does; I’ll probably
find out for >>
. That’s corrode being a little derpy
again.
Array index out of bounds panics. …there appear to be a LOT of them
associated with the usage of the float grbuf[2][576];
(grain buffer?) struct member, which is sometimes treated as two arrays
of 576 floats, but quite also often treated as a single array of 1152
floats that happens to be divided into two halves, most egregiously in
l3_midside_stereo()
. Siiiiiigh. Well, for now we can just
make two different versions of the functions for the two cases, and
maybe manage to get rid of one later.
Conclusion
In the end, the results of my work are in the rinimp3
crate,
because I suck at naming things. There’s STILL some panics to hunt down,
one (large, scary-looking) function remains unsafe
and
effectively untranslated, and there’s a lot of helper-code that could be
written to make the API nicer to use. Still, it seems to produce
semi-sane test output on at least SOME of the test vectors, which is
frankly a win at this point. Maybe sometime soon I’ll get another week
or so of free time and be able to finish it off.