NxNotes
Rambling refactored out of NervesNotes.
Nx Notes
Nx’s chonk-of-numbers data structure is called a tensor, which is
basically just your standard homogeneous, multi-dimensional typed array.
Under the hood it’s represented as an Elixir binary()
type,
which makes sense, but since binaries are immutable I wonder a little
about mem usage. I recall reading a paper about representing Nx tensors
as a tree of nested tuples, with the leaves being moderate-sized
(256-byte or something) matrices, which apparently worked
decently and meant that you could mutate small chunks at random
without having to recreate the entire tensor. This is also interesting
’cause if you try to represent gigantic sparse matrices
efficiently you often end up sticking them into some kind of tree
structure anyway, so there’s some convergent evolution there. It looks
like Nx tensors actually have a “backend” option as well, which lets you
choose different representations for them, so you can stick your tensors
on the GPU using backends provided by EXLA or TorchX. …Ok that’s
pretty damn slick, if it works as well as it looks like it should.
Something to play around with later.
Let’s try to make and operate on some absurdly big tensors and see
how it goes! …Hmmm, how to print out memory usage?
NervesMOTD.print()
prints out a splash banner with some
memory info, where does it get that from? Ah,
runtime_info/1
is apparently an Elixir builtin(?) that
gives you some nice stats, and you can ask for a few different topics;
it prints out the list of the ones available. For me the full list is
runtime_info([:system, :memory, :limits, :applications, :allocators])
,
which is nice, but what I really want is just
runtime_info([:memory, :allocators])
, which prints out the
following:
## Memory
Atoms 1 MB
Binaries 1 MB
Code 30 MB
ETS 1 MB
Processes 19 MB
Total 65 MB
## Memory allocators
Block size Carrier size Max carrier size
Temporary 0 KB 256 KB 256 KB
Short-lived 1 KB 256 KB 256 KB
STD 245 KB 1280 KB 1280 KB
Long-lived 15349 KB 19456 KB 19456 KB
Erlang heap 16896 KB 23948 KB 23948 KB
ETS 1333 KB 3200 KB 3200 KB
Fix 130 KB 256 KB 256 KB
Literal 5148 KB 6880 KB 6880 KB
Binary 1686 KB 2948 KB 2948 KB
Driver 24 KB 256 KB 256 KB
Total 40815 KB 58736 KB 58736 KB
Showing topics: [:memory, :allocators]
Additional topics: [:system, :limits, :applications]
To view a specific topic call runtime_info(topic)
I don’t know enough about BEAM’s internals to know what “carrier
size” and such is, probably GC stuff, but gettin the numbers in
kilobytes is convenient. The VM has 128 MB of memory total, so we’re
using half of it on a barebones system, even though the Nerves docs
really recommend 512 MB of memory for x86_64
. …Oh, the help
docs for runtime_info()
also suggest recon
for
more detailed mem usage info, another thing to play around with
later.
Ok, let’s make a big tensor:
a = Nx.broadcast(3, {512, 512})
#Nx.Tensor<
s64[512][512]
[
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...],
...
]
>
Hm, is s64
an integer type? Can we make it a float
instead?
a = Nx.broadcast(3.5, {512, 512})
#Nx.Tensor<
f32[512][512]
[
[3.5, 3.5, 3.5, 3.5, 3.5, 3.5, 3.5, 3.5, ...],
...
]
>
Ah, yes. Hmm, how do we tell it to give us f64
’s?
broadcast()
doesn’t seem to take a type option, does it
always inherit the type of the input tensor?
a = Nx.broadcast(Nx.tensor(3, type: :f64), {512, 512})
#Nx.Tensor<
f64[512][512]
[
[3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, ...],
...
]
>
Yep, yep there we go. So each of these f32
512x512
tensors is 1 megabyte at least, and the f64
or
s64
ones are 2 megs. Making a few of those and then then
calling runtime_info([:memory, :allocators])
does show the
“binary” heap growing by pretty much that amount. But once I get more
than 20 MB or so of binaries the system hangs for a while, then OOM’s
and reboots itself. I guess 128 mb of memory is pretty much the lower
limit of what you’d want to run Nerves on!
Odd though, I’d expect that binaries could get GC’ed? I thought they
were basically refcounted blobs. Or does their refcount only decrement
when a process exits or something? …I gotta torment this system more.
Looks like with /bin/free
Linux reports 112 MB of memory
total and 78 MB of it “used” before allocating another couple megs OOM’s
it, but lists 26 MB of “shared” and 32 MB of “buff/cache”, which is
apparently a lot less liquid than I expected it to be. Oh, “shared” is
not just shared libs, which I’d expect to be minimal since it’s busybox
built with musl. It’s used for tmpfs stuff too according to the man
pages for free
, and the system does have about 20 megs of
tmpfs between /dev
, /tmp
and
/run
. Similarly buff/cache
isn’t just
filesystem cache, it’s also kernel data structures. So 78 MB is being
used by userland (ie, erlang), up to 26 MB allocated for tmpfs
stuff, and most of the rest is used by the kernel for various
stuff. Whew, the world still makes sense.
Still! Why no GC for binaries?! It’s something to do with IEx, I can
run
spawn(fn () -> runtime_info([:memory, :allocators]); b = Nx.broadcast(Nx.tensor(3, type: :f64), {1024, 512}); runtime_info([:memory, :allocators]) end)
all day and it prints the same amount of memory used in the
Binary
heap before and after the
Nx.broadcast()
, so GC is getting triggered immediately. …Oh
yeah, if I quit my SSH session and log back in again, the amount of
binary memory reported by runtime_info()
goes back to the
baseline. Whew, the world still makes sense.
…yeah see this is why I’m bad at math. I’m like “let’s play with a math lib!” and then spend the entire time tormenting BEAM’s and Linux’s memory systems.
Right, back on topic a little. There’s like ten million tensor
functions and constants, so hopefully that’s enough for you. Oh, there’s
a complex number type too, c64
, neat. Hmm, I wonder how
dense Nx tensors really are compared to BEAM lists?
Nx.to_list()
can help us conveniently construct big chonky
lists, though it’s a little weird ’cause apparently BEAM is better at
GC’ing and dedup’ing lists than binaries. I’d expect each cons cell in
the list to be 12-16 bytes though, so a factor of 3-4x larger than a
dense array. Nx.to_list
on a 1 MB tensor does OOM the
machine, so it might be worse than that, or it might just add enough GC
pressure all at once that it doesn’t have enough memory for BEAM’s
copying collector to cope. Bumping up the memory available in the VM, I
can make an 8 MB f64 tensor, and turning it into a list consumes about
44 MB, give or take a little. Doing it again with a 4 MB s32 tensor
produces a 27 MB list. So lists are more like 5-6x less dense.
Back on topic. Nx also provides a defn
macro to
“define a numerical function”. It works just like def
or
defp
to actually use, but translates the code you write
with -
and +
and whatnot to use Nx functions…
and in fact seems like it gives you a DSL that is a subset of elixir
and–
defn
definitions allow for building computation graph of all the individual operations and using a just-in-time (JIT) compiler to emit highly specialized native code for the desired computation unit.
…JFC it’s heckin’ CUDA. Written as an Elixir macro. That’s either absolutely brilliant or absolutely insane, and I can’t even begin to imagine all the horrible things that might be wrong with it, but I gotta try this out:
defmodule HelloNerves do
import Nx.Defn
defn subtract_nx(a, b) do
a - b
end
def subtract_list(a, b) do
{res, []} = Enum.reduce(a, {[], b}, fn x, {acc, [hd|tl]} -> {[x-hd|acc], tl} end)
res
end
end
Build the sucker, and:
iex> HelloNerves.subtract_list([1,2,3], [4,5,6])
[-3, -3, -3]
iex> HelloNerves.subtract_nx(Nx.tensor([1,2,3]), Nx.tensor([4,5,6]))
#Nx.Tensor<
s64[3]
[-3, -3, -3]
>
# Ok, it works. Time it...
iex> :timer.tc(fn -> HelloNerves.subtract_list([1,2,3], [4,5,6]) end, :microsecond)
# 100ish microseconds
iex> :timer.tc(fn -> HelloNerves.subtract_nx(Nx.tensor([1,2,3]), Nx.tensor([4,5,6])) end, :microsecond)
# 3000ish microseconds
Ok that’s not a particularly good comparison ’cause it also takes time to create the tensors, but I’ll cut out the dross and just make a table:
List size | List subtract | Tensor subtract |
---|---|---|
512 | ~2500 μs | ~6000 μs |
4096 | ~6000 μs | ~30,000 μs |
16,384 | ~18,000 μs | ~80,000 μs |
256k | ~300 ms | ~1500 ms |
1M | ~700 ms | ~6000 ms |
1G | longer than I | care to wait |
This is a pretty bad benchmark, but that’s still somewhat
disappointing. I expected a crossover point where Nx tensors became
faster than lists, but it started out with lists being faster and the
gap widening as the dataset got bigger. I tried with 2-dimensional
lists/tensors too, with the same result. The main takeaway here honestly
might be that BEAM is much better at optimizing math than I thought!
Guess that JIT in there isn’t just for show. A few caveats though, first
is that lists take up way more memory than tensors and end up
taking quite a bit longer to construct in the first place. Second is
that variance between invocations was a lot higher with lists than with
tensors, presumably ’cause it had to do a lot more GC in the process.
Third is that these are all single-dimensional vectors that are going to
be dominated by the iteration time rather than the operation time, so
things might look different with different operations or more
complicated matrices. Try doing a transpose()
and see which
is faster, I guess. I wonder how its performance stacks up to
numpy
– no, no, no, I’m finished
with this for now, let’s move on!
Not doing that
On Discord people say to use EXLA for big matrices, and also that the
default backend is slower than EXLA for smol matrices but that may be
due to the overhead of creating them. Apparently the default backend is
not actually very smart, so even if you use defn
or such it
will create many intermediate matrices for its operations. Well shit.
Let’s try out EXLA then.
According to the readme, EXLA is a binding atop Google’s XLA lib, “accelerated linear algebra”, which says it’s for compiling machine learning models from things like PyTorch and TensorFlow, but mostly appears to be a compiler for turning linear algebra operations into optimized machine/SIMD/GPU/NPU code. Again, I want to do robotics and game code, and there’s gonna be some design differences between crunching a 10,000x10,000 matrix of neural net weights or gene sequences, and a 100,000,000 item list of f32x4’s, which is what you want for simulating physics/graphics/vision. But you can generally turn one into the other with some work, so let’s give it a try.
Ok, add {:exla, "~> 0.7"}
to my mix.exs
,
add config :nx, :default_backend, EXLA.Backend
to my
config/config.exs
, set the XLA_TARGET_PLATFORM
env var to uhhhhhhh x86_64-linux-gnu
I suppose, run
mix deps.get
and
mix firmware.burn -d hello_nerves.img
and holy
shit it appears to try to download and build the correct XLA
binaries for me. It bombs out trying to find execinfo.h
but
that appears to be distributed in the libc6-dev
apt
package, let’s get that… aaaaand… no, nope, sorry, Nerves builds with
musl, not glibc, so the target triple I need to give it is
x86_64-linux-musl
, and they don’t have prebuilt binaries
for that. Oh, but it says I can build from source by setting the env var
to XLA_BUILD
to true. Kinda wish these were command line
flags or options in mix.exs or such, in my experience env vars are a
great way to sneakily hide state so you can forget about it later.
Okayyyyyy clean deps and re-get them… uh, install bazel
…
wait for Debian Testing to unfuck its bazel
package so I
can install it… uh, it still tries to install a pre-built package, let’s
just mash mix clean
and mix deps.clean --all
until it stops that; are my env vars all correct? Yes they are. Ok it
downloaded the source package and is trying to build it aaaaaaaaand… It
fails to build with
ERROR: --experimental_link_static_libraries_once=false :: Unrecognized option: --experimental_link_static_libraries_once=false
.
Yeah see this is why I was really not thrilled to try out EXLA. “let’s just bind to this big massive pile of random C++ code, what could go wrong? Well yes it’s written by a company that has approximately zero concern for anyone’s use case than its own, and not having to fuck with C++ build tools is literally the whole reason to use Nerves, but it’ll be fiiiiine! Honest!” Fucking hell grumbl grambl bitch moan okay what do I get when I google this error message? It looks like the version of bazel I have is not new enough(?) to build this. What version do I have anyway?
$ bazel --version
bazel no_version
…you know what, life’s too fucking short. I have better things to do.