NPM and Directory Hardlinks

05/11/2020 — 2 Min Read — In Software

NPM is famous for generating absurdly huge node_modules directories. A bit of this is due to JS developers having a great love of tiny packages, and each package has some overhead. A bigger chunk of it is due to NPM's rather brute force approach to dependency management: absent any instruction to use npm dedup it stores one copy of a package each time the package is depended upon. Combined with the default search path node.js uses trying ./node_modules/, ../node_modules/, ../../node_modules, etc., this works well enough, for very inefficient values of “well”¹.

Some of this is mitigated by npm dedup, but that can probably only work intelligently for a single project, so if you have lots of them lying around (and worse, not all in a consistent place), you still have lots of redundancy. So there's a few requirements for a serious improvment:

Each package should be stored as few times as possible, preferably once for everything that uses it.
Side-by-side installations of multiple versions of the same package need to be supported—this is why NPM generates such bigness to begin with!
It should be possible to write an accurate garbage collector.
Operation of the new system should be transparent. Ideally,
1. node could keep its import resolution algorithm² intact, and
2. users wouldn't normally have to worry where their packages are going.

As far as I can tell, a major impediment to this is that neither Windows nor Unix permits directory hardlinks. WebDAV does, but nobody uses WebDAV as the basis for an OS filesystem API³. In turn, there's two obvious limitations to adding directory hardlinks to POSIX (which technically allows them, but doesn't really say how they should work).

The .. directory entry exists.
Only empty directories can be deleted.

The second item is easy to deal with: Allow unlink() to work on directories with two or more links (fds don't count for this). The first is only easy to deal with if we don't care about .. producing useless results (which it already arguably does once symlinks get involved) or breaking compatibility by ditching .. and . altogether. But let us assume this is fine.

Once we have directory hardlinks, it becomes possible to very efficiently deduplicate across all projects on the same volume (well, hardlink scope). First, we establish a package cache someplace, say, storing $version of $package in $XDG_DATA_DIR/npm/packages/$package/$version. Then, we make a rule that a package's node_modules directory contains hardlinks to the cache. Say a package depends on react-dom version 17.0.1, calling npm install would execute the equivalent of ln node_modules/react-dom $XDG_DATA_DIR/npm/packages/react-dom/17.0.1, noting the lack of -s.

This way, each version package is stored exactly once, and we can easily find out which ones are unused: they have a link count of exactly one. To collect garbage, just keep deleting them until no more have such a link count.

It's a bad idea to check in your node_modules if you want to do that, and people do want to do that. Git has no concept of hardlinks, so although the content-addressable nature of its storage layer ensures deduplication occurs, the deduplication will become duplication on checkout and we'll be back where we started. The obvious solution is to use symbolic links instead of hard ones and place the package cache inside the repo, node_modules_vendor perhaps. That we lose refcounts by using symlinks isn't much of a problem because nothing outside is referring to it.

node.js and V8 don't, as far as I know, deduplicate modules at load time at all, and it'd be rather weird if they deduplicated the instantiation of the module. At best they could reuse the AST, bytecode, and so forth, like can be done with individual functions.↩
I don't like every detail of that algorithm, in particular that you have to say import 'foo/index.mjs' in ES6 mode, but that's another story.↩
I half suspect this is because it's neither POSIX nor Windows like in the fine detail. This is also another post or two on its own, possibly one of those “Falsehoods programmers believe about X” ones, and a polemic decrying Unix fans for being narrow-minded.↩