ThoughtsOnPackageSystems

Introductory ramblings

I had some thoughts on crates.io that might be interesting to other people.

First, the decision to not include namespaces in Rust’s crates.io repository was 1000% correct. I’m learning Elm now, which has a package repo namespaced by username, and I have to sort through eight different packages named “elm-css” to find the one that doesn’t suck.

Second off, more-or-less-unmoderated package repos such as cargo, npm and pypi are here to stay: they are far, far too useful to get rid of. Raising the barrier to entry for publishing a crate would make life way harder for people to get started publishing crates, and once someone has published a crate and someone stumbles across it and starts using it, it is more likely to continue improving. This is important.

Such repositories have two problems though, the first is “how do I find packages that don’t suck” and the second (increasingly) is “how do I trust the packages (and their dependencies) aren’t malicious”. These problems are not unique to cargo and anyone who cares about them should look at the whole ecosystem of similar package repositories before coming to any conclusions. Some traditional options are:

  • Moderation: Have people who vet packages and only allow “good” ones. Pro’s: Solves both problems, when done well. Can work very well; see Linux package repositories. Con’s: Takes a lot of work; how many thousands of volunteers does Debian have? High barrier to entry. Curated packages often lag (far) behind state-of-the-art.
  • Curation: Basically, have lists of “good” packages, created by humans, for humans to search through. Pro’s: Like moderation, when done well it solves both problems. There have been some efforts of this in Rust like stdx, awesome-rust and so on, but…. Cons: Hard to keep up to date. Hard to keep momentum and interest for; they rot very quickly. Hard to search and organize!!! Often not vetted for security! Basically, if we (the Rust community) can’t even manage this, organizationally, then moderation is out of the question.

These two demonstrate a couple elements of the problem: First, to get the highest quality decisions, a human has to be in the loop to make the decision. But in reality, a machine-generated decision can still be useful as long as you know how trustworthy it is (or isn’t). And second, these do about the same thing whether or not they are part of the package system. They also both try to address both “good packages” and “safe packages”, but really those problems are separate. The first problem is a problem of searching, which is in theory easy but in practice very tricky to do well. The second problem is a problem of validation, which is in theory impossible but in practice usually not a huge deal for most systems!

Let’s inspect these problems separately.

Finding a package

When I look for a new crate to solve a problem, there’s many things upon which I base my judgment: version number, release history, does it look actively developed, is there documentation, and so on. Crates.io only provides a small fraction of this information, and does not let you filter or search on most of it. There’s no way to say “find me crates that have had more than two releases”, “find me crates that are at version >= 0.2.0”, “find me crates that are at version >= 1.0”, and so on. All this information is there though! The problem is that crates.io is critical infrastructure, and so a change to crates.io effects the entire Rust ecosystem. This makes the developers of it rightfully conservative. If crates.io breaks, it’s a disaster. If someone makes a design decision that turns out to be a mistake, then it’s very difficult to undo without breaking stuff that relied on it. If they add new potential metadata in a crate specification, then every crate has to change if it’s going to take advantage of it.

crates.io is a bad place to experiment with stuff!

But, there is zero reason that you can’t have a crate index that lets you search on more interesting metadata outside of crates.io. Human feedback is also a very useful piece of information; in github the number of stars, follows and forks for a repo is human information that’s useful for making qualitative judgments, especially if you correct for how old and how active the repo is. Adding user feedback to crates.io is not easy ’cause a) it’s not high priority, b) suddenly you have to handle user accounts even more than you already do, which is not easy, and c) it must be moderated, can be manipulated to bias results, etc. Which is very bad for the authoritative crate source.

Proposed solution: Create an external search service that has more powerful search. Suck in all the data crates.io can give you. Have user feedback, reviews, more powerful categories, more informative graphs, crate recommendations. Make it test whether the project has examples, documentation, unit tests. Don’t worry about being biased or breaking things because nothing critical rests on it. Make the most useful search possible. And, by producing a badge or such that can be included in a crate’s readme, you can let people browsing crates.io also use some of that data.

(There’s no reason this can’t be multi-language, for that matter. It’s just a Small Matter Of Programming!)

Trusting a package

So the next problem is validation. How do I know that this project doesn’t contain malicious code? Again, you can provide an external service to do this validation. Again, there’s no reason this must be part of the core package repository, as long as you can make the information in it available for packages to advertise via a badge or such. Even better, a cargo plugin could be made that goes through all your dependencies and retrieves how well-trusted they are, or even does this automatically on building and gives you a warning if, for example, you use a package that is untrusted or known to be malicious.

In the end it’s a matter of trust, and the thing is, computers and computer security researchers tend to view trust as binary… but humans almost never do. Additionally, real trust requires a thorough security audit, which is a ton of very highly-skilled (aka expensive) work, and even then that will seldom catch everything. But catching the low-hanging fruit is not going to be hard, IS going to be very useful, and can be automated. Does this package have a build.rs? Yellow flag. Does this package overlap a common library or stdlib name, or is it a typo of such? Yellow flag. Does it use network code, unsafe, external (C) libraries, or run other programs? Yellow flags, all. Does this package depend on another package that’s been marked malicious with a high degree of confidence? Oops, red flag, and alert the package owner.

Then, people who are trusted but are not security experts (so, normal schmoe volunteers like me) can go through and pass judgment calls on on these packages that have been marked as suspicious. “Oh it uses a build.rs but it’s obviously not malicious.” Many of these judgment calls can be made by many different people, and aggregated, and then this information is presented to the user for them to trust or not. You can have levels of trust: “the machine says this is probably okay”, “some random dude says this is probably okay”, “a trusted but non-security-guru reviewer says this is probably okay”, “a security guru says this is probably okay”. Aggregate this information and present it conveniently.

Again, make a cargo plugin that checks this and complains if you are using something untrustworthy. This removes the burden of knowledge from the user because they only need to care about the extraordinary cases, while also making sure that they ARE alerted to the extraordinary cases. I am not a security researcher. I, most of the time, don’t really care if a package is peer-reviewed to death. The cid crate is small and obscure and has probably never had a security audit, but still is probably not going to burn my computer down. But I sure as hell don’t want to be trivially pwned by typing serdr instead of serde in my dependencies.

Again, this can be multi-language. These problems and solutions are not specific to Rust!

Security is not an absolute state. Anyone who wants to devote enough time and energy to compromising a system will do it. But you can make a huge impact just by raising the bar to entry and making it easy (automatic!) to filter out script-kiddies. This is a lesson I’ve learned from Rust’s memory management: Just because a problem is impossible to solve perfectly does not mean an imperfect solution is useless. You can go very, very far with good engineering tradeoffs.

Conclusions

  • Making crates.io do everything is Hard, and also undesirable
  • crates.io doesn’t have to do everything, just collect and provide fundamental information to more flexible and experimental services
  • Search and validation are separate problems but have similar properties
  • Human knowledge is critical but difficult to get
  • Machine knowledge is lossy but very easy to get (once it’s implemented)
  • Putting these together incrementally can be very powerful
  • It is useful to present this information, along with how reliable this combination of machine+human knowledge is (or is not)
  • Aggregating this information and making it easy to integrate into crates.io again (via badges or other means) closes the loop and means even people who don’t use these other services directly will benefit