Dllicious

Introduction

This is entirely the fault of a friend complaining about trying to move software between different versions of RHEL. I don’t know why I’m doing this, but it seemed like a good idea at the time. Why is this page named what it is? No idea. This was done in Jan 2022.

Problem

People have Opinions about DLL’s these days. (I know Linux calls them shared objects, but that’s a dumb name, so I’ll call them DLL’s.) DLL’s add a level of complexity to writing and using software, and newer languages like Rust and Go have eschewed them, while Alpine Linux and maybe some other distributions also just don’t bother using them. On the other hand, they exist for a reason, ie sharing compiled code with a common ABI between multiple programs. This has produced a fair amount of discourse the last few years asking interesting questions: Are they necessary? Are they useful? Are they worth the trouble? Can we reinvent the linking process to make the whole system better? These conversations tend to have a lot of Opinion to them and not much actual data, so let’s start collecting data.

What data do we collect? Well, I am going to be looking at my everyday Linux system, an x86_64 desktop running Debian Bookworm. This is a quick-and-dirty survey: I want to do this analysis in like 90 minutes or so, and I will never do much with the results beyond going “hmm, that’s neat”. For the data I want it would probably be best to make a small sqlite database, import everything under the sun into it, and then have a set of SQL queries to do the actual analysis, but I don’t particularly like SQL and have to re-learn it every time I use it, so that would take Time. Next best bet would be to write a pipeline in Python or Julia or something and use either CSV or JSON files as intermediate products, but I’m bored of Python and don’t feel like learning Julia right now, so I’m not going to do that either. Hence if I can’t do something with shell scripting, I’m not going to do it.

Also note I have a bit of background in data science, but was never terribly good at it, so I’m just doing this for fun. Hence I will write this process down tutorial-style in the hopes of it being interesting to others, or in case someone dares try to reproduce the process. If you’re not interested in the process, just skip to the bottom of each section for the conclusions. The overall question we are trying to answer here is: “How useful are DLL’s?” This data will not answer that question, but may let us start measuring some pieces of it.

Count uses per DLL

Easy things first. How often is each DLL in the system actually used?

First, we find all executable files:

find / -executable -type f > files.txt

(make sure you don’t have any remote filesystems mounted unless you want to wait, or give the -mount option to find, though that won’t do what you want if you have say / and /home on separate drives.

How many executable files do we have?

wc files.txt
  368435 ...

Ok. Do we have any duplicates?

sort files.txt | uniq | wc
 368435 ...

We don’t, good.

Run ldd on all files in parallel and save the results:

xargs -P 0 -a files.txt ldd > raw-dlls.txt

If the file is not an ELF executable then it will output not a dynamic executable to stderr, not stdout, so we’re gucci, that won’t go into our data file. If the file is an ELF executable that uses no DLL’s, it will output statically linked to stdout, so we can keep track of those too if we want.

The raw-dlls.txt file looks like this:

/usr/share/pnm2ppa/update-magicfilter:
/usr/share/pixmaps/com.visualstudio.code.png:
/usr/lib/wine/wineserver32:
/srv/chroot/debcargo-unstable-amd64-sbuild/var/lib/dpkg/info/bash.prerm:
/usr/share/flatpak/triggers/gtk-icon-cache.trigger:
        linux-gate.so.1 (0xf7f0c000)
        libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xf7c57000)
        /lib/ld-linux.so.2 (0xf7f0e000)
...

Snip off everything that doesn’t start with a space:

grep ' .*' raw-dlls.txt > processed-dlls0.txt

Now it looks like this:

        linux-gate.so.1 (0xf7f0c000)
        libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xf7c57000)
        /lib/ld-linux.so.2 (0xf7f0e000)
        linux-vdso.so.1 (0x00007ffe879cb000)
        libcrypto.so.1.1 => /lib/x86_64-linux-gnu/libcrypto.so.1.1 (0x00007f43f72fc000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f43f72df000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f43f7116000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f43f710f000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f43f70ee000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f43f7698000)
        linux-vdso.so.1 (0x00007fff50fb5000)

We only care about the DLL symbol name, so chop off everything after the first non-leading space:

awk -e '{print $1;}' processed-dlls0.txt > processed-dlls1.txt

Ok, how many lines are in there?

wc processed-dlls1.txt
 208544 ...

So all the DLL’s on the system put together are used 208,000 times. This is a sort of weird measurement, it’s “the sum of the count of the DLL’s used by each executable”. Let’s turn it into something more handy, a frequency count of how many times each DLL is used.

sort processed-dlls1.txt | uniq -c | sort -n > dll-counts.txt
wc dll-counts.txt
 1654 ...

So there are 1654 separate DLL’s used on this system. Eyeballing the dll-counts.txt file, the top of it looks like this:

      1 2
      1 5.so.3
      1 6
      1 bselinux.so.1
      1 /home/icefox/.local/share/flatpak/repo/objects/09/6fc1f3300c1252e2810e0f5c91c1543f484493bbd3c651adc39cec4a4cd335.file:
      1 /home/icefox/.local/share/flatpak/repo/objects/46/ba66c86aee21c094f1904485426db3720602307eb8084cef162ef321c8e341.file:
      1 /home/icefox/.local/share/flatpak/repo/objects/60/743c0e414503ba991f49a846bf1bf822c4f12867d4f9ee0aa7095a578560f3.file:
      1 /home/icefox/.local/share/flatpak/repo/objects/87/eea9d125b5b515c1b5a8e36583c98a5cb07abe8251526c78bf6a681a953e0e.file:
      1 /home/icefox/.local/share/flatpak/repo/objects/ee/60298a49f27ca7c49a2446e896abf7a91ab7595eecf4f0fd965343178893c6.file:
      1 /home/icefox/.local/share/flatpak/repo/objects/fc/1cd1247a60f732cbfc4da57b6c4b857cc358a82746d378e66830cc7a459785.file:
      1 /home/icefox/.local/share/flatpak/runtime/org.gnome.Platform/x86_64/master/54b532eafb6153eef6d192fa84fbd4b6138c1e0c9a88
6c3883a317e484f251a1/files/bin/dbus-daemon:
      1 /home/icefox/.local/share/flatpak/runtime/org.gnome.Platform/x86_64/master/54b532eafb6153eef6d192fa84fbd4b6138c1e0c9a88
6c3883a317e484f251a1/files/bin/dbus-launch:
      1 /home/icefox/.local/share/flatpak/runtime/org.gnome.Platform/x86_64/master/54b532eafb6153eef6d192fa84fbd4b6138c1e0c9a88
6c3883a317e484f251a1/files/bin/dbus-monitor:
      1 /home/icefox/.local/share/flatpak/runtime/org.gnome.Platform/x86_64/master/54b532eafb6153eef6d192fa84fbd4b6138c1e0c9a88
6c3883a317e484f251a1/files/bin/dbus-send:
      1 /home/icefox/.local/share/flatpak/runtime/org.gnome.Platform/x86_64/master/54b532eafb6153eef6d192fa84fbd4b6138c1e0c9a88
6c3883a317e484f251a1/files/bin/dbus-test-tool:
      1 /home/icefox/.local/share/flatpak/runtime/org.gnome.Platform/x86_64/master/54b532eafb6153eef6d192fa84fbd4b6138c1e0c9a88
6c3883a317e484f251a1/files/bin/dbus-uuidgen:
      1 /home/icefox/.local/share/flatpak/runtime/org.gnome.Sdk/x86_64/master/0f2f5e7e0c844db78be17aa3ac1c2f8b34de2415de242fa32
8da51ae5c37d4ed/files/bin/dbus-daemon:
      1 /home/icefox/.local/share/flatpak/runtime/org.gnome.Sdk/x86_64/master/0f2f5e7e0c844db78be17aa3ac1c2f8b34de2415de242fa32
8da51ae5c37d4ed/files/bin/dbus-launch:
      1 /home/icefox/.local/share/flatpak/runtime/org.gnome.Sdk/x86_64/master/0f2f5e7e0c844db78be17aa3ac1c2f8b34de2415de242fa32
8da51ae5c37d4ed/files/bin/dbus-monitor:
      1 /home/icefox/.local/share/flatpak/runtime/org.gnome.Sdk/x86_64/master/0f2f5e7e0c844db78be17aa3ac1c2f8b34de2415de242fa32
8da51ae5c37d4ed/files/bin/dbus-send:
      1 /home/icefox/.local/share/flatpak/runtime/org.gnome.Sdk/x86_64/master/0f2f5e7e0c844db78be17aa3ac1c2f8b34de2415de242fa32
8da51ae5c37d4ed/files/bin/dbus-test-tool:
      1 /home/icefox/.local/share/flatpak/runtime/org.gnome.Sdk/x86_64/master/0f2f5e7e0c844db78be17aa3ac1c2f8b34de2415de242fa32
8da51ae5c37d4ed/files/bin/dbus-uuidgen:
      1 ibxml2.so.2
      1 inux-gnu/libxcb-render.so.0
      1 libaa.so.1
      1 libaccountsservice.so.0
      1 libads.so.0
      1 libakonadi-filestore.so.5
      1 libanl.so.1
      1 libann.so.0
      1 libao.so.4
      1 libart.so.0
      1 libatasmart.so.4
      1 libatopology.so.2
      1 libaudcore.so.5
...

And the bottom of it looks like this:

...
    802 libXcursor.so.1
    809 libXrandr.so.2
    818 libdw.so.1
    837 libcairo.so.2
    848 libXi.so.6
    855 libxcb-render.so.0
    857 libxcb-shm.so.0
    873 libXfixes.so.3
    875 libelf.so.1
    877 libpixman-1.so.0
    928 libgdk_pixbuf-2.0.so.0
   1009 libharfbuzz.so.0
   1016 libgraphite2.so.3
   1027 libXrender.so.1
   1035 libdbus-1.so.3
   1060 libfontconfig.so.1
   1061 libbz2.so.1.0
   1113 libjpeg.so.62
   1158 libsystemd.so.0
   1191 libXext.so.6
   1244 libfreetype.so.6
   1256 libexpat.so.1
   1257 libuuid.so.1
   1298 libcap.so.2
   1310 liblz4.so.1
   1362 libbrotlidec.so.1
   1371 libbrotlicommon.so.1
   1443 libgcrypt.so.20
   1482 libgpg-error.so.0
   1507 libpng16.so.16
   1527 libzstd.so.1
   1569 libX11.so.6
   1595 libgio-2.0.so.0
   1693 libxcb.so.1
   1694 libXdmcp.so.6
   1702 libXau.so.6
   1803 libmount.so.1
   1871 libblkid.so.1
   1951 libbsd.so.0
   1952 libmd.so.0
   2035 libselinux.so.1
   2059 libpcre2-8.so.0
   2106 librt.so.1
   2217 libresolv.so.2
   2224 libstdc++.so.6
   2258 libgmodule-2.0.so.0
   2302 libgobject-2.0.so.0
   2506 liblzma.so.5
   2685 libglib-2.0.so.0
   2693 libpcre.so.3
   2953 libffi.so.8
   3851 libgcc_s.so.1
   3956 libz.so.1
   5764 libm.so.6
   7065 libdl.so.2
   7314 libpthread.so.0
  13320 /lib64/ld-linux-x86-64.so.2
  13435 libc.so.6
  13452 linux-vdso.so.1

So we now know how many times each DLL is used on the system. There are some artifacts in there, I’m not sure happened to give us a DLL named 2 or 5.so.3; obviously something in our text processing step mangled some names. Taking 30 seconds to eyeball the file doesn’t turn up too many other implausible-looking things, so I don’t care. This is our quick-and-dirty pass after all.

Another anomaly worth keeping track of is statically linked executables. Our lame string-processing approach means that the statically linked output of ldd gets conveniently preserved and tallied up like any other DLL:

...
     79 statically
...

I didn’t expect many statically linked executables, but I did expect more than that.

Anyway, not terribly surprisingly, it appears that there’s a handful of DLL’s that are used by almost everything, a pretty steep decline, and then it tapers off into a long tail. I’d expect it to be something like an exponential dropoff, because most frequency counts are. We might dig a bit more into this data set later, but I tend to prefer to do a breadth-first search on these sorts of problems. Touch a lot of little things lightly, then go back and decide what to dig further into.

Count DLL’s per file

Ok, so we now know how many times each DLL is used. Let’s do the inverse as well, and find out how many DLL’s each executable uses. To do this we have to go allll the way back to files.txt. This is why, whenever you do data science, you either script the whole process at once, or you keep every intermediate data product and write down exactly how it’s produced like I’m doing right now.

We can’t feed a shell pipeline into xargs, so I can’t just do xargs ... 'ldd | wc' or something. Easy way around is to make a shell script that outputs the data we are interested in. So:

#!/bin/bash
# count-dlls.sh
COUNT=$(ldd $1 | wc)
echo "$1 $COUNT"

Then we just run

> xargs -P 0 -a files.txt ./count-dlls.sh

and it bombs out early for some reason. A little digging through the output turns up the error message xargs: unmatched single quote; by default quotes are special to xargs unless you use the -0 option, so unsurprisingly a single-quote in a file name somewhere is totally hosing our shell script. This is where a sane person would drop bash like the live grenade it is; a few minutes of trying to make the xargs -0 option cooperate results in it stubbornly saying xargs: argument line too long, which is just so helpful. Eyeballing the data file however, we can find that most instances of single or double quotes are music or data files that are accidentally marked executable, so we can just get rid of them:

> sed -e 's/.*\'.*//' files.txt > files-unquoted.txt

Sorry ~/.steam/steamapps/common/Hatoful Boyfriend/Collector's Edition Extra Content/Yearbook/St PigeoNation Yearbook.pdf, we will not find out how many DLL’s you need linked into your program space when started.

Ok but my bash script is now breaking on files that have spaces in their names, which is quite a lot of them. OF COURSE. Most of those files are probably unimportant, but there’s enough of them in odd places that I don’t want to filter them out or spot check them or such. FINE, we’ll do it the STUPID and SLOW way, since I’m sick of Python in my life and I don’t feel like spending the afternoon learning Julia and parallel might or might not be able to fix the problem but this works dammit:

# Note, this is fish shell, not bash
for x in (cat files-unquoted.txt)
    ./count-dlls.sh $x >> file-counts.txt
end

Great, now the file looks like this:

/usr/share/pnm2ppa/update-magicfilter       0       0       0
/usr/share/pixmaps/com.visualstudio.code.png       0       0       0
/usr/share/flatpak/triggers/gtk-icon-cache.trigger       0       0       0
/usr/share/flatpak/triggers/desktop-database.trigger       0       0       0
/usr/share/flatpak/triggers/mime-database.trigger       0       0       0
/usr/share/man-db/chconfig       0       0       0
/usr/share/kio_info/kde-info2html       0       0       0
/usr/share/popularity-contest/popcon-upload       0       0       0
/usr/share/discord/postinst.sh       0       0       0
/usr/share/discord/Discord     104     412    8057
/usr/share/discord/chrome-sandbox       4      12     234
/usr/share/mono/MonoGetAssemblyName.exe       0       0       0
/usr/share/initramfs-tools/scripts/local-bottom/ntfs_3g       0       0       0
/usr/share/initramfs-tools/scripts/init-top/all_generic_ide       0       0       0
/usr/share/initramfs-tools/scripts/init-top/udev       0       0       0
/usr/share/initramfs-tools/scripts/init-top/blacklist       0       0       0

The first number is the number of dll’s, and 0 means “not an ELF executable”. So we snip out everything ending with 0 0 0 and remove empty newlines:

> sed -e 's/.*0 *0 *0//' file-counts.txt | tr -s '\n' > file-counts-nonzero.txt

Great, now we have this:

/home/icefox/games/starsector/jre_linux/lib/jexec       3       8     155
/home/icefox/games/SkyRogue/skyrogue.x86_64       9      32     591
/home/icefox/games/SkyRogue/skyrogue_Data/Plugins/x86/libsteam_api.so       5      16     248
/home/icefox/games/SkyRogue/skyrogue_Data/Plugins/x86/ScreenSelector.so      24      92    1496
/home/icefox/games/SkyRogue/skyrogue_Data/Plugins/x86_64/libsteam_api.so       5      16     303
/home/icefox/games/SkyRogue/skyrogue_Data/Plugins/x86_64/ScreenSelector.so      59     232    4541
/home/icefox/games/SkyRogue/skyrogue_Data/Mono/x86/libMonoPosixHelper.so       7      24     362
/home/icefox/games/SkyRogue/skyrogue_Data/Mono/x86/libmono.so       8      28     429
/home/icefox/games/SkyRogue/skyrogue_Data/Mono/x86_64/libMonoPosixHelper.so       7      24     437
/home/icefox/games/SkyRogue/skyrogue_Data/Mono/x86_64/libmono.so       7      24     439
...

Let’s use awk to just turn it into count filename so we can sort it sanely:

> awk -e '{print $2 $1;}' file-counts-nonzero.txt

FAK that ALSO fucks up on filenames with spaces in them. This is why you shouldn’t use shell for anything fancy, folks. Okay, let’s change our count-dlls.sh script to do the filtering itself, and output the count first because we know that will just be a number with no heckin’ spaces or quotes or any other BS in it:

#!/bin/bash
# count-dlls.sh

COUNT=$(ldd "$1" | wc | awk -e '{print $1}')
echo "$COUNT $1"

GREAT now it works and we’ve also eliminated an extra step. Fucking hell. Let’s run our slow and terrible brute force loop again:

rm file-counts.txt
# Note, this is fish shell, not bash
for x in (cat files-unquoted.txt)
    ./count-dlls.sh $x >> file-counts.txt
end

Well that seems to be working, but pretty slowly. While it’s running I might as well try to make parallel handle the stupid thing to see if I can make it use all my cores. If I can figure out the correct parallel invocation before the for loop finishes, it wins.

parallel ./count-dlls.sh < files-unquoted.txt > file-counts2.txt

Hey, that was easier than expected. However, this is still the part where you go get a cup of tea, maybe a sandwich, and possibly do some push-ups. The first xargs -P 0 -a files.txt ldd > raw-dlls.txt run processed a similar amount of data in similar ways, but the pipeline and such in count-dlls.sh apparently adds enough overhead to make it go from a couple minutes to 15-20, even when using parallel on a 16-core machine. The bottleneck is never where you expect it.

About 300,000 not a dynamic executable’s later, as well as some fleeting error messages from ldd about being unable to parse files correctly that may or may not be interesting to someone someday, we have some results. Unsurprisingly, of course, the parallel version finishes first and file-counts2.txt looks something like this:

0 /usr/share/pnm2ppa/update-magicfilter
0 /usr/share/pixmaps/com.visualstudio.code.png
0 /usr/share/flatpak/triggers/gtk-icon-cache.trigger
0 /usr/share/flatpak/triggers/desktop-database.trigger
0 /usr/share/flatpak/triggers/mime-database.trigger
0 /usr/share/man-db/chconfig
0 /usr/share/kio_info/kde-info2html
0 /usr/share/popularity-contest/popcon-upload
0 /usr/share/discord/postinst.sh
4 /usr/share/discord/chrome-sandbox
0 /usr/share/mono/MonoGetAssemblyName.exe
0 /usr/share/initramfs-tools/scripts/local-bottom/ntfs_3g
0 /usr/share/initramfs-tools/scripts/init-top/all_generic_ide
104 /usr/share/discord/Discord
0 /usr/share/initramfs-tools/scripts/init-top/udev
0 /usr/share/initramfs-tools/scripts/init-top/blacklist
0 /usr/share/initramfs-tools/scripts/init-top/keymap
0 /usr/share/initramfs-tools/scripts/local-block/lvm2
0 /usr/share/initramfs-tools/scripts/panic/plymouth
0 /usr/share/initramfs-tools/scripts/local-premount/resume
0 /usr/share/initramfs-tools/scripts/local-premount/ntfs_3g
...

Some spot-checking looks correct, so we can just sort it and have our DLL usage counts:

> sort -n file-counts2.txt > dlls-per-file.txt

The bottom of it looks like this:

...
208 /var/lib/flatpak/repo/objects/d0/5974b48975868fa024999a070dccfa886a9d8a68d9d2e1c2af9716fc01e38d.file
208 /var/lib/flatpak/runtime/org.freedesktop.Platform.html5-codecs/x86_64/18.08/b7006caaf6a7705c4e899520794e1f58cbe5d62c5a70423195dde2f740743a8c/files/bin/ffmpeg
208 /var/lib/flatpak/runtime/org.freedesktop.Platform/x86_64/20.08/a4f37bb933cf4792d472db2055886427d55074cdaf5bd8e4397dc3e2bc0305e1/files/bin/ffmpeg
208 /var/lib/flatpak/runtime/org.freedesktop.Platform/x86_64/20.08/a4f37bb933cf4792d472db2055886427d55074cdaf5bd8e4397dc3e2bc0305e1/files/bin/ffplay
208 /var/lib/flatpak/runtime/org.freedesktop.Platform/x86_64/20.08/a4f37bb933cf4792d472db2055886427d55074cdaf5bd8e4397dc3e2bc0305e1/files/bin/ffprobe
208 /var/lib/flatpak/runtime/org.kde.Platform/x86_64/5.12/7d07b24fa47ad7f8d22df1e522f56ed659bf4330fa62e652945c0ec0593dddfd/files/bin/ffmpeg
219 /usr/bin/akonadi_tomboynotes_resource
232 /usr/bin/sieveeditor
240 /usr/bin/contactprintthemeeditor
241 /usr/bin/akonadi_ews_resource
255 /usr/bin/headerthemeeditor
287 /usr/bin/mboximporter
288 /usr/bin/pimdataexporterconsole
291 /usr/bin/akonadi_sendlater_agent
292 /usr/bin/akonadi_unifiedmailbox_agent
292 /usr/bin/pimdataexporter
294 /usr/bin/akonadi_archivemail_agent
294 /usr/bin/akonadi_mailfilter_agent
298 /usr/bin/kmail

So, the most DLL-hungry program on the system uses 298 DLL’s, and there appears to be another exponential-dropoff-ish frequency distribution to it. Great. Now, hmmm…

> awk -e '{print $1;}' dlls-per-file.txt | sort -n | uniq -c > dlls-per-file-frequency.txt

This produces a file like this:

    131 1
    161 2
   5372 3
   5651 4
   1470 5
   4869 6
   1994 7
    720 8
    810 9
    949 10
    370 11
    262 12
    472 13
    251 14
    118 15
    133 16
    138 17
    479 18
    403 19
    227 20
    664 21
    336 22
    161 23
    195 24
    180 25
    255 26
    160 27
...

The first column is how many exe files use that number of DLL’s, and the second column is how many DLL’s it uses. So for example there are 160 executables that use exactly 27 DLL’s. Eeeeexcellent. Time for some graphs!

Graphs!

First off, the number of DLL’s each exe uses. The X axis is just the exe, this is the data from the dlls-per-file.txt we just produced. As expected it looks vaguely exponential, though there’s a couple humps and bumps in there.

Fig 1: DLL’s used per file

Fig 1: DLL’s used per file

Now let’s look at dlls-per-file-frequency.txt, how many files are using 1 dll, how many are using 2, how many are using 3, etc. The Y axis got really crunched so I made it logarithmic. Because it was late at night, for some reason I used the natural logarithm. So as you can see there are about e^8.5 executables that use ~5ish DLL’s (about 5000), then a linear-ish-if-you-squint descent to ~3ish executables using ~150 DLL’s, and then a bit of a bumpy long tail after that.

Fig 2: DLL frequency

Fig 2: DLL frequency

We’re sort of ascending through this data in reverse order, so the last thing to look at is how many times each individual DLL is used. This is sort of the flip side of Fig 1, and again there was a huge and weird spread of values so I made the Y axis a natural log. Very surprisingly though, it’s still upward-curving… it’s a super-exponential distribution. Don’t see those very often! So there’s a bunch of DLL’s that are used 1 time, of course, but the more popular DLL’s get more popular extremely quickly.

Fig 3: Number of times each DLL is used

Fig 3: Number of times each DLL is used

It would have been nice to do a frequency graph of Fig 3, the same as I did with Fig 1 and Fig 2, but I forgot so you’ll just have to imagine it.

Analysis

So, we now know how many times each DLL is used on this system, and how many DLL’s each executable uses. Can we do anything actually useful with this data?

Disk space

We can measure how much hard disk space the DLL’s save vs. static linking. This will be an upper bound, since static linking doesn’t necessarily include unused code from a library into an executable, while the DLL doesn’t know what code will and will not be used.

This should be pretty easy, we go to our dll-counts.txt and just multiply our counts by the size of each DLL… except we didn’t store the full path for each DLL. Okayyyy, we need to go back to our processed-dlls0.txt and pull out the full paths instead of just the file names:

awk -e '{print $3;}' processed-dlls0.txt > processed-dlls1-fullpaths.txt

This is not actually accurate because our basic awk selection doesn’t actually parse the output completely correctly, but will hopefully be somewhere in the right range. So we have to rerun our count again and generate a file of counts for full paths:

sort processed-dlls1-fullpaths.txt | uniq -c | sort -n > dll-counts-fullpath.txt

This gets us a slightly screwy file that looks like this:

...
   1263 /lib/x86_64-linux-gnu/liblz4.so.1
   1364 /lib/x86_64-linux-gnu/libpng16.so.16
   1393 /lib/x86_64-linux-gnu/libgcrypt.so.20
   1408 /lib/x86_64-linux-gnu/libX11.so.6
   1431 /lib/x86_64-linux-gnu/libgio-2.0.so.0
   1432 /lib/x86_64-linux-gnu/libgpg-error.so.0
   1478 /lib/x86_64-linux-gnu/libzstd.so.1
   1534 /lib/x86_64-linux-gnu/libxcb.so.1
   1536 /lib/x86_64-linux-gnu/libXdmcp.so.6
   1543 /lib/x86_64-linux-gnu/libXau.so.6
   1640 /lib/x86_64-linux-gnu/libmount.so.1
   1709 /lib/x86_64-linux-gnu/libblkid.so.1
   1787 /lib/x86_64-linux-gnu/libmd.so.0
   1791 /lib/x86_64-linux-gnu/libbsd.so.0
   1870 /lib/x86_64-linux-gnu/libselinux.so.1
   1897 /lib/x86_64-linux-gnu/libpcre2-8.so.0
   1940 /lib/x86_64-linux-gnu/librt.so.1
   2039 /lib/x86_64-linux-gnu/libresolv.so.2
   2079 /lib/x86_64-linux-gnu/libgmodule-2.0.so.0
   2121 not
   2130 /lib/x86_64-linux-gnu/libgobject-2.0.so.0
   2156 /lib/x86_64-linux-gnu/libstdc++.so.6
   2440 /lib/x86_64-linux-gnu/liblzma.so.5
   2500 /lib/x86_64-linux-gnu/libglib-2.0.so.0
   2509 /lib/x86_64-linux-gnu/libpcre.so.3
   2775 /lib/x86_64-linux-gnu/libffi.so.8
   3743 /lib/x86_64-linux-gnu/libz.so.1
   3778 /lib/x86_64-linux-gnu/libgcc_s.so.1
   5485 /lib/x86_64-linux-gnu/libm.so.6
   6812 /lib/x86_64-linux-gnu/libdl.so.2
   7037 /lib/x86_64-linux-gnu/libpthread.so.0
  13134 /lib/x86_64-linux-gnu/libc.so.6
  27127 

As you can see, our awk call left some artifacts, and it doesn’t quiiiite match our previous dll-counts.txt for Various Reasons. You think that there’s only one libpthread.so.0 on your system? Pshaw, I have eight. If we were doing shit Right we would have this in a script already and edit it to take out those artifacts; certainly if I ever wanted to reproduce this data set that would be the way to go. Instead I am just gonna edit the file and remove things that aren’t absolute paths.

(Random sidenote: notice that linux-vdso.so.1 isn’t in this list. That’s because it’s not a real file, but rather a little chunk of code that the Linux kernel puts into every process to make life a little easier for libc or whatever to make certain system calls. See man vdso for more info.)

Ok, so NOW we can easily find the size of the files. Like fucking hell I’m gonna try that in bash, and it’s a bright shiny new day, so I finally am going to resort to Python:

#!/usr/bin/env python3
import os
sum = 0
for line in open('dll-counts-fullpath.txt', 'r'):
    dat = line.strip().split(' ')
    try:
        bytes_saved = int(dat[0]) * os.stat(dat[1]).st_size
        sum += bytes_saved
        print(bytes_saved, dat[1])
    except FileNotFoundError:
        pass
print(sum)

Mannnnn, what a cruel and terrible language, forcing us to care about crazy things like data types and files not existing. How dare it not cover up our mistakes for us.

Run that sucker and sort the output in descending order:

./count-bytes.py | sort -nr
155947508820
24102466080 /lib/x86_64-linux-gnu/libc.so.6
22413394416 /lib/x86_64-linux-gnu/libicudata.so.67
7225544080 /lib/x86_64-linux-gnu/libm.so.6
5606592736 /lib/x86_64-linux-gnu/libLLVM-12.so.1
5328199800 /lib/x86_64-linux-gnu/libz3.so.4
4985237096 /lib/x86_64-linux-gnu/libLLVM-11.so.1
4589986016 /lib/x86_64-linux-gnu/libstdc++.so.6
3809304720 /lib/x86_64-linux-gnu/libLLVM-9.so.1
3326152280 /lib/x86_64-linux-gnu/libgtk-3.so.0
3132200000 /lib/x86_64-linux-gnu/libglib-2.0.so.0
...

The first line with no file name is our sum total of bytes per DLL * number of times DLL is used, and it is 155947508820 bytes, or about 145 gigabytes. This seems high to me, but a) this is an upper bound, and b) the numbers don’t lie, right? Right? RIGHT??? And, about 78 GB of that, more than half the total, is saved by the top 10 in this list. That surprises me, though looking back at Fig 3 it makes sense.

For reference, the total non-/home data on this computer’s root filesystem is about 33 GB. And it’s a terabyte hard drive that isn’t even half full, and I have fast internet to do software updates, so all this is mostly irrelevant to me in practice.

Potential bug: I’m not sure whether Python’s os.stat() follows symlinks, or just gives us the size of the symlink. So this may be all screwy, though I’d expect the resulting sizes to be a lot smaller if it is. Fortunately it’s easy enough to spot check:

ls -al /lib/x86_64-linux-gnu/libc.so.6
 lrwxrwxrwx 1 root root 12 Dec 12 06:04 /lib/x86_64-linux-gnu/libc.so.6 -> libc-2.33.so
ls -al /lib/x86_64-linux-gnu/libc-2.33.so
 -rwxr-xr-x 1 root root 1835120 Dec 12 06:04 /lib/x86_64-linux-gnu/libc-2.33.so

So libc-2.33.so is 1,835,120 and is used 13,134 times, multiply those together and we get 24,102,466,080, the same number of bytes our program reports for /lib/x86_64-linux-gnu/libc.so.6. So whew, we are fine.

Memory

We can measure how much RAM the DLL’s save vs. static linking. This will also be an upper bound, since OS’s don’t necessarily page the entire DLL into memory at once, AFAIK they generally just page in sections of it lazily as they are actually used, and copy-on-write any data that is mutated. This needs a corpus of programs that are actually usually running though, which is more data to collect. On the flip side this is pretty realistic; I don’t know about anyone else, but I have my computers set up to start a fixed set of programs every time they boot, and I generally use those programs every day. On the flip side, I have 16 GB of RAM in my desktop and almost never use more than half of it.

So let’s get a count of how many instances of each process I’m running on my machine:

ps aux | awk -e '{ print $11; }' | grep -v '^\[' | sort -n | uniq -c > program-counts.txt

However, it is now clear that we now need three different data tables for this analysis: the programs running, the DLL’s used per program, and the bytes used per DLL. Unfortunately this is really getting to the point where the best tool for the job is a relational database, and I promised myself I wouldn’t go that deep. So I’m going to leave this as an exercise to the reader.

Conclusions

Draw your own.

No? Here’s some to start with:

  • Don’t use shell scripting for data analysis.
  • The ~200 most common DLL’s are used by lots more things than the other 1400ish.
  • If you use Linux, glibc, and a big fat desktop machine with lots of games and dev tools and stuff installed, then in the worst case scenario static linking everything can make your programs take up to 5x more hard drive space. But if you’re rich and in the first world you probably won’t notice it, so who cares, right?
  • kmail uses 298 DLL’s, which is mayyyyyybe more than any single program should use. I mean dang, that’s like twice firefox+chrome put together. Looking at the list though, I am not even sure I can blame it. Most of those DLL’s seem to do something useful.

Future work

Hey, we actually got a result or two that were surprising! However, I think this sort of data set has a lot of potential for going deeper. Someone should do that, and make a proper database that they can pull queries out of and such. But it probably won’t be me, at least not any time soon. So here some other things I think would be interesting for that hypothetical person to explore:

  • What if we distinguish “command line programs” and “graphical programs” and such? What patterns do we see?
  • What if we do the opposite, and use a clustering approach to group programs together that tend to use similar DLL’s? I strongly expect you’ll easily find “GTK programs” and “Qt programs” and such, but what other classes might there be?
  • Look at other systems! Does, say, FreeBSD or Fedora or Arch have significantly different kinds of stats?
  • This is harder to measure, but where do we draw the lines between “this DLL is worth it” and “this DLL is not worth it”? If we take all the DLL’s on the system that only have 1 program using them, and made them static libraries instead, would it really make our lives any different? What about 10 programs? 100?
  • Lots of DLL’s use other DLL’s. I haven’t considered this at all. What do these transitive dependency trees look like? Are they common or rare? How deep are they?

What interesting stuff can’t we investigate with this kind of approach?

  • We can’t measure runtime performance differences. DLL’s generally need an extra function pointer chased for every function call, sometimes this can add up. To do this you need an apples-to-apples comparison of a system built with dynamic linking vs. one without, where both use the same code base. Comparing Debian to Alpine or BSD isn’t too useful since they use different libc’s, very different programs much of the time, etc.
  • We can’t measure load time performance differences. Program startup speed is largely a concern of the bad old days right now, since SSD’s are common and CPU’s and memory are fast. However, a program that heavily uses DLL’s may load more slowly since all those DLL’s have to be found and loaded, or it may load more quickly if they’re already loaded.
  • We can’t measure programmer convenience. Linking a program to a DLL is faster than to a static lib. In some programs (like large Rust things that use lots of crates) static linking is a significant amount of the total compilation time.
  • We can’t measure API conveneience. DLL’s are quite often used for plugins or other “change this program without recompiling it” functionality that you would otherwise need a scripting langauge for. In this case sharing code is really not the goal, the goal is to offer a mechanism for modifying a program without recompiling it.
  • We can’t draw conclusions about anything other than Linux, or similar OS’s if you run this on FreeBSD or Alpine or whatever. The Windows ecosystem is drastically different and uses libraries in very different ways than Linux does. Android/iOS is drastically different again.

So huzzah, you now have some real data for your next Internet Argument, and you know how to (badly) collect more if you didn’t know that already. Go get to work.