NxNotes

Rambling refactored out of NervesNotes.

Nx Notes

Nx’s chonk-of-numbers data structure is called a tensor, which is basically just your standard homogeneous, multi-dimensional typed array. Under the hood it’s represented as an Elixir binary() type, which makes sense, but since binaries are immutable I wonder a little about mem usage. I recall reading a paper about representing Nx tensors as a tree of nested tuples, with the leaves being moderate-sized (256-byte or something) matrices, which apparently worked decently and meant that you could mutate small chunks at random without having to recreate the entire tensor. This is also interesting ’cause if you try to represent gigantic sparse matrices efficiently you often end up sticking them into some kind of tree structure anyway, so there’s some convergent evolution there. It looks like Nx tensors actually have a “backend” option as well, which lets you choose different representations for them, so you can stick your tensors on the GPU using backends provided by EXLA or TorchX. …Ok that’s pretty damn slick, if it works as well as it looks like it should. Something to play around with later.

Let’s try to make and operate on some absurdly big tensors and see how it goes! …Hmmm, how to print out memory usage? NervesMOTD.print() prints out a splash banner with some memory info, where does it get that from? Ah, runtime_info/1 is apparently an Elixir builtin(?) that gives you some nice stats, and you can ask for a few different topics; it prints out the list of the ones available. For me the full list is runtime_info([:system, :memory, :limits, :applications, :allocators]), which is nice, but what I really want is just runtime_info([:memory, :allocators]), which prints out the following:

## Memory 

Atoms                1 MB
Binaries             1 MB
Code                 30 MB
ETS                  1 MB
Processes            19 MB
Total                65 MB

## Memory allocators 

                     Block size       Carrier size   Max carrier size
Temporary                  0 KB             256 KB             256 KB
Short-lived                1 KB             256 KB             256 KB
STD                      245 KB            1280 KB            1280 KB
Long-lived             15349 KB           19456 KB           19456 KB
Erlang heap            16896 KB           23948 KB           23948 KB
ETS                     1333 KB            3200 KB            3200 KB
Fix                      130 KB             256 KB             256 KB
Literal                 5148 KB            6880 KB            6880 KB
Binary                  1686 KB            2948 KB            2948 KB
Driver                    24 KB             256 KB             256 KB
Total                  40815 KB           58736 KB           58736 KB

Showing topics:      [:memory, :allocators]
Additional topics:   [:system, :limits, :applications]

To view a specific topic call runtime_info(topic)

I don’t know enough about BEAM’s internals to know what “carrier size” and such is, probably GC stuff, but gettin the numbers in kilobytes is convenient. The VM has 128 MB of memory total, so we’re using half of it on a barebones system, even though the Nerves docs really recommend 512 MB of memory for x86_64. …Oh, the help docs for runtime_info() also suggest recon for more detailed mem usage info, another thing to play around with later.

Ok, let’s make a big tensor:

a = Nx.broadcast(3, {512, 512})
#Nx.Tensor<
  s64[512][512]
  [
    [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...],
    ...
  ]
>

Hm, is s64 an integer type? Can we make it a float instead?

a = Nx.broadcast(3.5, {512, 512})
#Nx.Tensor<
  f32[512][512]
  [
    [3.5, 3.5, 3.5, 3.5, 3.5, 3.5, 3.5, 3.5, ...],
    ...
  ]
>

Ah, yes. Hmm, how do we tell it to give us f64’s? broadcast() doesn’t seem to take a type option, does it always inherit the type of the input tensor?

a = Nx.broadcast(Nx.tensor(3, type: :f64), {512, 512})
#Nx.Tensor<
  f64[512][512]
  [
    [3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, ...],
    ...
  ]
>

Yep, yep there we go. So each of these f32 512x512 tensors is 1 megabyte at least, and the f64 or s64 ones are 2 megs. Making a few of those and then then calling runtime_info([:memory, :allocators]) does show the “binary” heap growing by pretty much that amount. But once I get more than 20 MB or so of binaries the system hangs for a while, then OOM’s and reboots itself. I guess 128 mb of memory is pretty much the lower limit of what you’d want to run Nerves on!

Odd though, I’d expect that binaries could get GC’ed? I thought they were basically refcounted blobs. Or does their refcount only decrement when a process exits or something? …I gotta torment this system more. Looks like with /bin/free Linux reports 112 MB of memory total and 78 MB of it “used” before allocating another couple megs OOM’s it, but lists 26 MB of “shared” and 32 MB of “buff/cache”, which is apparently a lot less liquid than I expected it to be. Oh, “shared” is not just shared libs, which I’d expect to be minimal since it’s busybox built with musl. It’s used for tmpfs stuff too according to the man pages for free, and the system does have about 20 megs of tmpfs between /dev, /tmp and /run. Similarly buff/cache isn’t just filesystem cache, it’s also kernel data structures. So 78 MB is being used by userland (ie, erlang), up to 26 MB allocated for tmpfs stuff, and most of the rest is used by the kernel for various stuff. Whew, the world still makes sense.

Still! Why no GC for binaries?! It’s something to do with IEx, I can run spawn(fn () -> runtime_info([:memory, :allocators]); b = Nx.broadcast(Nx.tensor(3, type: :f64), {1024, 512}); runtime_info([:memory, :allocators]) end) all day and it prints the same amount of memory used in the Binary heap before and after the Nx.broadcast(), so GC is getting triggered immediately. …Oh yeah, if I quit my SSH session and log back in again, the amount of binary memory reported by runtime_info() goes back to the baseline. Whew, the world still makes sense.

…yeah see this is why I’m bad at math. I’m like “let’s play with a math lib!” and then spend the entire time tormenting BEAM’s and Linux’s memory systems.

Right, back on topic a little. There’s like ten million tensor functions and constants, so hopefully that’s enough for you. Oh, there’s a complex number type too, c64, neat. Hmm, I wonder how dense Nx tensors really are compared to BEAM lists? Nx.to_list() can help us conveniently construct big chonky lists, though it’s a little weird ’cause apparently BEAM is better at GC’ing and dedup’ing lists than binaries. I’d expect each cons cell in the list to be 12-16 bytes though, so a factor of 3-4x larger than a dense array. Nx.to_list on a 1 MB tensor does OOM the machine, so it might be worse than that, or it might just add enough GC pressure all at once that it doesn’t have enough memory for BEAM’s copying collector to cope. Bumping up the memory available in the VM, I can make an 8 MB f64 tensor, and turning it into a list consumes about 44 MB, give or take a little. Doing it again with a 4 MB s32 tensor produces a 27 MB list. So lists are more like 5-6x less dense.

Back on topic. Nx also provides a defn macro to “define a numerical function”. It works just like def or defp to actually use, but translates the code you write with - and + and whatnot to use Nx functions… and in fact seems like it gives you a DSL that is a subset of elixir and–

defn definitions allow for building computation graph of all the individual operations and using a just-in-time (JIT) compiler to emit highly specialized native code for the desired computation unit.

…JFC it’s heckin’ CUDA. Written as an Elixir macro. That’s either absolutely brilliant or absolutely insane, and I can’t even begin to imagine all the horrible things that might be wrong with it, but I gotta try this out:

defmodule HelloNerves do
  import Nx.Defn

  defn subtract_nx(a, b) do
    a - b
  end

  def subtract_list(a, b) do
    {res, []} = Enum.reduce(a, {[], b}, fn x, {acc, [hd|tl]} -> {[x-hd|acc], tl} end)
    res
  end
end

Build the sucker, and:

iex> HelloNerves.subtract_list([1,2,3], [4,5,6])
[-3, -3, -3]

iex> HelloNerves.subtract_nx(Nx.tensor([1,2,3]), Nx.tensor([4,5,6]))
#Nx.Tensor<
  s64[3]
  [-3, -3, -3]
>

# Ok, it works.  Time it...
iex> :timer.tc(fn -> HelloNerves.subtract_list([1,2,3], [4,5,6]) end, :microsecond)
# 100ish microseconds

iex> :timer.tc(fn -> HelloNerves.subtract_nx(Nx.tensor([1,2,3]), Nx.tensor([4,5,6])) end, :microsecond)
# 3000ish microseconds

Ok that’s not a particularly good comparison ’cause it also takes time to create the tensors, but I’ll cut out the dross and just make a table:

List size List subtract Tensor subtract
512 ~2500 μs ~6000 μs
4096 ~6000 μs ~30,000 μs
16,384 ~18,000 μs ~80,000 μs
256k ~300 ms ~1500 ms
1M ~700 ms ~6000 ms
1G longer than I care to wait

This is a pretty bad benchmark, but that’s still somewhat disappointing. I expected a crossover point where Nx tensors became faster than lists, but it started out with lists being faster and the gap widening as the dataset got bigger. I tried with 2-dimensional lists/tensors too, with the same result. The main takeaway here honestly might be that BEAM is much better at optimizing math than I thought! Guess that JIT in there isn’t just for show. A few caveats though, first is that lists take up way more memory than tensors and end up taking quite a bit longer to construct in the first place. Second is that variance between invocations was a lot higher with lists than with tensors, presumably ’cause it had to do a lot more GC in the process. Third is that these are all single-dimensional vectors that are going to be dominated by the iteration time rather than the operation time, so things might look different with different operations or more complicated matrices. Try doing a transpose() and see which is faster, I guess. I wonder how its performance stacks up to numpy– no, no, no, I’m finished with this for now, let’s move on!

Not doing that

On Discord people say to use EXLA for big matrices, and also that the default backend is slower than EXLA for smol matrices but that may be due to the overhead of creating them. Apparently the default backend is not actually very smart, so even if you use defn or such it will create many intermediate matrices for its operations. Well shit. Let’s try out EXLA then.

According to the readme, EXLA is a binding atop Google’s XLA lib, “accelerated linear algebra”, which says it’s for compiling machine learning models from things like PyTorch and TensorFlow, but mostly appears to be a compiler for turning linear algebra operations into optimized machine/SIMD/GPU/NPU code. Again, I want to do robotics and game code, and there’s gonna be some design differences between crunching a 10,000x10,000 matrix of neural net weights or gene sequences, and a 100,000,000 item list of f32x4’s, which is what you want for simulating physics/graphics/vision. But you can generally turn one into the other with some work, so let’s give it a try.

Ok, add {:exla, "~> 0.7"} to my mix.exs, add config :nx, :default_backend, EXLA.Backend to my config/config.exs, set the XLA_TARGET_PLATFORM env var to uhhhhhhh x86_64-linux-gnu I suppose, run mix deps.get and mix firmware.burn -d hello_nerves.img and holy shit it appears to try to download and build the correct XLA binaries for me. It bombs out trying to find execinfo.h but that appears to be distributed in the libc6-dev apt package, let’s get that… aaaaand… no, nope, sorry, Nerves builds with musl, not glibc, so the target triple I need to give it is x86_64-linux-musl, and they don’t have prebuilt binaries for that. Oh, but it says I can build from source by setting the env var to XLA_BUILD to true. Kinda wish these were command line flags or options in mix.exs or such, in my experience env vars are a great way to sneakily hide state so you can forget about it later. Okayyyyyy clean deps and re-get them… uh, install bazel… wait for Debian Testing to unfuck its bazel package so I can install it… uh, it still tries to install a pre-built package, let’s just mash mix clean and mix deps.clean --all until it stops that; are my env vars all correct? Yes they are. Ok it downloaded the source package and is trying to build it aaaaaaaaand… It fails to build with ERROR: --experimental_link_static_libraries_once=false :: Unrecognized option: --experimental_link_static_libraries_once=false.

Yeah see this is why I was really not thrilled to try out EXLA. “let’s just bind to this big massive pile of random C++ code, what could go wrong? Well yes it’s written by a company that has approximately zero concern for anyone’s use case than its own, and not having to fuck with C++ build tools is literally the whole reason to use Nerves, but it’ll be fiiiiine! Honest!” Fucking hell grumbl grambl bitch moan okay what do I get when I google this error message? It looks like the version of bazel I have is not new enough(?) to build this. What version do I have anyway?

$ bazel --version
bazel no_version

…you know what, life’s too fucking short. I have better things to do.