Kun-peng is a metagenomic classifier designed for large reference collections. It keeps memory usage practical by splitting the database into hash shards and loading only the shards needed for the current reads.
Kun-peng stores reference data as sharded hash tables and loads only the shards needed for the current reads, which keeps memory usage much lower than fully loaded workflows. It follows Kraken-style minimizer-based taxonomy assignment while optimizing storage layout and streaming for large pan-domain databases.
- Sharded hash layout keeps memory usage low by loading only the hash shards needed for the current reads.
- Minimizer-based indexing keeps disk and memory usage practical for large reference collections without relying on fully loaded workflows.
- The workflow is modular: database construction and classification can run end-to-end or as separate steps for debugging and benchmarking.
- Outputs are Kraken-compatible, including
kreport2, so existing downstream tooling can usually be reused directly.
Database build:
- Prepare a reference library from NCBI downloads or your own FASTA files.
- Estimate capacity and split minimizers into chunk files.
- Build
hash_*.k2d,hash_config.k2d,taxo.k2d, andopts.k2d.
Classification:
splitrchunks the input reads.annotateloads only the required hash shards.resolvecomputes taxonomy assignments and reports results.
brew install eric9n/tap/kun_pengDownload a release for Linux, macOS, or Windows from:
Then make sure kun_peng is on your PATH.
Requirements:
- Rust toolchain
Build:
cargo build --releaseThe binary will be available at:
./target/release/kun_peng
Verify installation:
kun_peng --versionChoose the path that matches your situation.
Use this if:
data/already contains the required taxonomy and genome downloads
Before you start:
kun_pengis installed or available at./target/release/kun_peng- you want to build a new database in
test_database/ - you have an example input such as
data/COVID_19.fa
If you still need to prepare data/, you can use ncbi_dl to download taxonomy and genomes:
brew install eric9n/tap/ncbi_dl
ncbi_dl -d data tax
ncbi_dl -d data gen -g <group>For example, choose <group> according to your use case, such as viral, bacteria, or archaea.
kun_peng build --download-dir data/ --db test_database --hash-capacity 1G
mkdir -p temp_chunk test_out
kun_peng classify \
--db test_database \
--chunk-dir temp_chunk \
--output-dir test_out \
data/COVID_19.faSuccess looks like:
test_database/containshash_*.k2d,hash_config.k2d,taxo.k2d, andopts.k2dtest_out/containsoutput_*.txtand*.kreport2
Detailed guide:
Use this if:
- you already have
library/*.fnacontent to build from - you want to extend a database with your own FASTA files
Before you start:
- the target database directory already exists or will be created as
test_database/ - if you use
add-library, the database must already contain the expected library and taxonomy structure
Prepare the library with downloaded genomes:
kun_peng merge-fna --download-dir data/ --db test_databaseor add your own FASTA files:
kun_peng add-library --db test_database -i /path/to/fastasThen build and classify:
kun_peng build-db --db test_database --hash-capacity 1G
mkdir -p temp_chunk test_out
kun_peng classify \
--db test_database \
--chunk-dir temp_chunk \
--output-dir test_out \
data/COVID_19.faSuccess looks like:
test_database/contains rebuilthash_*.k2dfilestest_out/containsoutput_*.txtand*.kreport2
Detailed guide:
Use this if:
- you already have a Kraken 2 database containing
hash.k2d,opts.k2d, andtaxo.k2d
Before you start:
- the Kraken 2 database is available at
/path/to/kraken_db - you want to convert it once and then use Kun-peng classification workflows
Convert the Kraken 2 database into Kun-peng's sharded format:
kun_peng hashshard --db /path/to/kraken_db --hash-capacity 1GThen classify:
mkdir -p temp_chunk test_out
kun_peng classify \
--db /path/to/kraken_db \
--chunk-dir temp_chunk \
--output-dir test_out \
data/COVID_19.faIf you have enough RAM to load all hash_*.k2d files at once:
bash cal_memory.sh /path/to/kraken_db
kun_peng direct --db /path/to/kraken_db data/COVID_19.faSuccess looks like:
/path/to/kraken_dbcontainshash_config.k2dandhash_*.k2dtest_out/containsoutput_*.txtand*.kreport2afterclassify
Detailed guide:
Kun-peng exposes both end-to-end and stepwise subcommands:
build: full build pipeline from downloaded databuild-db: build database artifacts from an existing librarymerge-fna: normalize downloaded genomes into library filesadd-library: add local FASTA files into a database libraryhashshard: convert a Kraken 2 database into sharded Kun-peng formatclassify: integrated chunk-based classification workflowdirect: load all hash tables for high-memory, high-speed classificationsplitr,annotate,resolve: stepwise classification pipeline
For command details:
kun_peng --help
kun_peng <subcommand> --helpInputs:
- FASTA / FASTQ
- Gzipped FASTA / FASTQ
- Multiple input files in one command
- A single
.txtfile listing input paths forclassify
Main outputs from classify:
output_*.txt: Kraken-style per-read classification output*.kreport2: hierarchical taxonomy summary
Example:
mkdir -p temp_chunk test_out
kun_peng classify --db test_database --chunk-dir temp_chunk --output-dir test_out data/COVID_19.fa--hash-capacitycontrols the number of slots per shard. As a rule of thumb,1Gcapacity produces about a 4 GiB shard file.- Smaller shard sizes can reduce per-file memory pressure and improve I/O flexibility, at the cost of more files.
directmode requires RAM roughly equal to the total size of allhash_*.k2dfiles.classifyuses much less memory because it loads shards on demand.
To estimate memory for direct mode:
bash cal_memory.sh test_database- Use a clean
--chunk-dirforclassify. Leftoversample_*.k2,sample_id*.map, orsample_*.binfiles will cause an error. - After
add-library, always rerunbuild-db. Oldhash_*.k2dfiles will not match the updated library. hashshardstops ifhash_config.k2dalready exists in the target directory. Use a fresh directory or back up the old file first.- If direct mode needs too much RAM, switch to
classify.
- docs/cli-reference.md: command reference by workflow
- docs/build-db-demo.md: build a database from downloads or local FASTA
- docs/classify-demo.md: integrated and direct classification workflows
- docs/hashshard-demo.md: convert a Kraken 2 database
- examples/README.md: runnable Rust examples
@article{Chen2026KunPeng,
author = {Chen, Qiong and Zhang, Boliang and Peng, Chen and Huang, Jiajun and Liu, Zhen and Shen, Xiaotao and Jiang, Chao},
title = {Kun-peng enables scalable and accurate pan-domain metagenomic classification},
journal = {Briefings in Bioinformatics},
volume = {27},
number = {2},
year = {2026},
month = mar,
doi = {10.1093/bib/bbag119},
url = {https://academic.oup.com/bib/article/27/2/bbag119/8525000},
publisher = {Oxford University Press}
}
