Explanation#

This page provides background on the data that condastats queries and how the tool works internally.

The Anaconda public dataset #

condastats is built on top of the Anaconda public package data, a collection of hourly-summarized download counts for conda packages published on S3.

Key facts

Coverage: download records since January 2017.

Channels: includes data from the default anaconda channel, conda-forge, and selected other channels.

Update frequency: the dataset is updated once a month with the previous month’s data.

Format: Apache Parquet files stored in s3://anaconda-package-data/conda/monthly/, organized by year and month.

How the data is organized #

Each monthly Parquet file contains one row per download “bucket” with the following columns:

Column	Description
`pkg_name`	Package name (e.g., `pandas`, `numpy`).
`pkg_version`	Version string of the downloaded package.
`pkg_platform`	Target platform (e.g., `linux-64`, `osx-arm64`, `win-64`).
`pkg_python`	Python version the package was built for (e.g., `3.11`).
`data_source`	The channel or repository (e.g., `anaconda`, `conda-forge`).
`counts`	Number of downloads in this bucket.
`time`	Month of the download (`YYYY-MM`).

How condastats queries work #

When you run a query, condastats performs the following steps:

Performance considerations #

Important

Network I/O is the bottleneck. Each query must fetch Parquet data from S3 over the internet. Specifying a narrow time range will be significantly faster than querying all data since 2017.

Dask lazy evaluation. condastats uses Dask to construct a lazy computation graph and only materializes the result at the end with .compute(). This keeps memory usage low even for large time ranges.
Parquet column pruning. The groupby functions only read the columns they need (e.g., pkg_platform queries do not read pkg_version), which reduces the amount of data transferred from S3.

condastats was inspired by pypistats, which provides similar download statistics for PyPI packages. While pypistats queries the PyPI BigQuery dataset (via the pypistats.org API), condastats reads directly from the Anaconda S3 dataset.

Explanation#

The Anaconda public dataset#

How the data is organized#

How condastats queries work#

Performance considerations#

Relationship to pypistats#

The Anaconda public dataset #

How the data is organized #

How condastats queries work #

Performance considerations #

Relationship to pypistats #