DEV Community

Cover image for Analyzing Storage Consumption in Sonatype Nexus npm Repositories
Pavel Zeman
Pavel Zeman

Posted on

Analyzing Storage Consumption in Sonatype Nexus npm Repositories

I've been using Sonatype Nexus Repository Community Edition (or just Nexus) for some time as a npm repository. While using it, I've observed, that it requires quite a lot of storage. So I decided, to analyze the storage usage and I present my results in this article.

Configuration

For this article, I've downloaded the latest version of Nexus and analyzed its storage requirements with the following configuration:

  • Version: 3.79.1-04
  • Database: Embedded H2
  • Blob store type: File

npm repository types

For npm, Nexus provides the following repository types, each serving a distinct purpose and having specific properties:

  • Hosted
  • Proxy
  • Group

Nexus architecture

Hosted repository

Hosted repository is designed to store npm packages that are published internally by an organization. This type of repository provides a private space for proprietary or custom packages, allowing users to upload, manage, and share their own npm packages securely within the organization.

Hosted repositories are writable, meaning users with appropriate permissions can publish new packages or update existing ones directly to Nexus. These repositories do not automatically fetch or cache external npm packages; their content is limited to what has been explicitly published to them.

Proxy repository

Proxy repository, on the other hand, acts as an intermediary between Nexus and a remote registry such as https://registry.npmjs.org. When a package is requested from a proxy repository, Nexus fetches it from the remote registry if it is not already cached locally. Once retrieved, the package is stored in the proxy repository, making subsequent requests for the same package faster and reducing external bandwidth usage.

Proxy repositories are read-only from the perspective of users, i.e. they cannot be directly published to. Instead, they serve as a transparent cache, improving reliability and performance for external dependencies, and ensuring that previously downloaded packages remain available even if the remote registry is temporarily unreachable.

Group repository

Group repository aggregates multiple repositories of any type. Typically, it combines one or more hosted and proxy repositories into a single unified endpoint. The group repository presents a consolidated view to npm clients so that users only need to configure a single registry URL in their npm settings. When a package is requested from a group repository, Nexus searches its member repositories in a defined order and serves the first match it finds. This configuration simplifies development and CI/CD workflows by allowing both internal and external npm packages to be accessed seamlessly through one URL.

In Nexus Community Edition, group repositories are read-only, i.e. they cannot be directly published to. All the packages must be published to a hosted repository, which can then be included in a group repository.

Storage principles

Nexus stores all its data in the following storages - relational database and blob store.

Relational database (H2 for this article) is used to store metadata about so-called assets. In Nexus terminology, asset in a npm repository is either a package root or a tarball. Package root is a single JSON containing information about the package itself, i.e. its name, versions, license, etc. (see for example package root of the react package). For each version, the package root also contains a link to its tarball, i.e. a compressed file containing the code of that specific package version. As an example, you can get tarball for react version 19.0.0.

Blob store contains actual asset data, i.e. a single JSON file for a package root and a compressed tgz file for a tarball. The package root JSON file is not compressed and can be quite large, because it contains metadata for all package versions. For example, current package root of the react package has about 5.7 MB. The blob store content can be stored either directly in a filesystem or in a cloud storage (Amazon S3 or Google Cloud Storage). In this article, we will focus on the filesystem storage.

As the relational database stores only asset metadata, its size is expected to be much smaller than the size of the blob store. As a result, when calculating the Nexus storage requirements, it should be sufficient to consider only the blob store. For example, my proxy repository with about 2.5 thousand assets requires 4 MB of storage in the relational database and more than 400 MB of storage in the blob store, i.e. the blob store is more than 100 times larger than the relational database.

Storage analysis

The amount of data stored by Nexus differs between repository types.

For hosted repositories, the following rules apply:

  • Nexus creates package root, when a first version of a package is published.
  • Nexus updates the package root every time a version of the package is published or removed.
  • Nexus creates/removes tarball, when a package version is published/removed.

Total size of the assets for a single package can then be calculated as a sum of the size of the package root and sum of all tarballs for all its versions. This size can be reduced by removing specific versions or the package itself. Another option is to define a cleanup policy, which automatically removes package versions (i.e. tarballs) based on defined criteria. However, I've found no way, how to automatically remove packages themselves.

For proxy repositories, the following rules apply:

  • Nexus creates package root, when package metadata or a package version is requested.
  • Nexus creates tarball, when a package version is requested.

Similarly to hosted repositories, cleanup policies can be defined to automatically remove package versions.

Group repositories are virtual by nature, they just group packages stored somewhere else. However, even group repositories require storage. Nexus does not directly reuse package roots from other repositories included in the group, but creates a local copy of the package root instead. On the other hand, the tarballs are reused, i.e. they are not stored in the group repository. Similarly to proxy repositories, package root is created, when package metadata or a package version is requested for the first time. Unfortunately, I've found no way, how to remove obsolete package roots other than completely recreating the group repository. I've also found no way, how to check, which package roots are physically stored in the group repository other than querying the metadata in the relational database.

A summary of the storage analysis is provided in the following table.

Repository type Package root Tarball
Hosted Created when first package version is published Created/removed when package version is published/removed
Proxy Created when package metadata or first package version is requested Created when package version is requested
Group Created when package metadata or first package version is requested Reused from other repositories included in the group, i.e. no extra storage required

To analyze storage of a specific repository, we can use Nexus GUI, which provides limited information, but we can also query the relational database to get all the details. For example, the following SQL query provides a summary of all assets and their sizes grouped by repository and asset kind (package root or tarball).

SELECT
    r.name, a.kind, count(*) c, sum(ab.blob_size) bytes
FROM
    npm_asset a JOIN
    npm_asset_blob ab ON a.asset_blob_id = ab.asset_blob_id JOIN
    npm_content_repository cr ON cr.repository_id = a.repository_id JOIN
    repository r ON r.id = cr.config_repository_id
GROUP BY
    r.name, a.kind
ORDER BY
    r.name, a.kind
Enter fullscreen mode Exit fullscreen mode

Key takeaways

  • Nexus supports 3 types of npm repositories (hosted, proxy and group), each serving a distinct purpose and behaving differently in terms of storage.
  • Group repositories are virtual by nature, but they still require storage for package roots.
  • Blob store size is typically the dominant factor in overall storage requirements.
  • Cleanup policies can be used to automatically remove package versions, but not packages themselves.

Top comments (0)