Explanation#
This page provides background on the data that condastats queries and how the tool works internally.
The Anaconda public dataset#
condastats is built on top of the Anaconda public package data, a collection of hourly-summarized download counts for conda packages published on S3.
Coverage: download records since January 2017.
Channels: includes data from the default anaconda
channel, conda-forge, and selected other channels.
Update frequency: the dataset is updated once a month with the previous month’s data.
Format: Apache Parquet files stored in
s3://anaconda-package-data/conda/monthly/, organized by year and
month.
How the data is organized#
Each monthly Parquet file contains one row per download “bucket” with the following columns:
Column |
Description |
|---|---|
|
Package name (e.g., |
|
Version string of the downloaded package. |
|
Target platform (e.g., |
|
Python version the package was built for (e.g., |
|
The channel or repository (e.g., |
|
Number of downloads in this bucket. |
|
Month of the download ( |
How condastats queries work#
When you run a query, condastats performs the following steps:
1. Determine the time range
If you pass --month, a single Parquet file is read. If you pass
--start_month/--end_month, the corresponding range of files is
read. If neither is given, all available monthly files are read (this
can be slow).
2. Read from S3
condastats uses Dask to lazily read the Parquet files directly from the public S3 bucket. No authentication is needed – the data is publicly accessible.
3. Filter by package
Only rows matching the requested package name(s) are kept.
4. Apply additional filters
For the overall subcommand, optional filters (platform, data source,
version, Python version) further narrow the result.
5. Aggregate
The filtered rows are grouped and summed:
overallgroups bypkg_name(and optionally bytime).pkg_platform,data_source,pkg_version,pkg_pythongroup bypkg_nameand their respective column.
6. Return the result
The CLI prints the pandas result to stdout. The Python API returns the
pandas.Series or pandas.DataFrame directly.
Performance considerations#
Important
Network I/O is the bottleneck. Each query must fetch Parquet data from S3 over the internet. Specifying a narrow time range will be significantly faster than querying all data since 2017.
Dask lazy evaluation. condastats uses Dask to construct a lazy computation graph and only materializes the result at the end with
.compute(). This keeps memory usage low even for large time ranges.Parquet column pruning. The groupby functions only read the columns they need (e.g.,
pkg_platformqueries do not readpkg_version), which reduces the amount of data transferred from S3.
Relationship to pypistats#
condastats was inspired by pypistats, which provides similar download statistics for PyPI packages. While pypistats queries the PyPI BigQuery dataset (via the pypistats.org API), condastats reads directly from the Anaconda S3 dataset.