AlloGen / data /sample /README.md
chq1155's picture
AlloGen public release: Q_theta scorer + PXDesign guidance + Colab demo
ad9572d

Sample dataset (in-repo)

This directory ships a single pre-built target β€” cam (Calmodulin) β€” so users can run a smoke test of the training and evaluation pipeline without first downloading the full multi-target dataset (10 GB on Zenodo) or rebuilding from raw PDB files (30 min per target).

Contents

sample/
└── cam/
    β”œβ”€β”€ train.pkl   # 84 paired holo/apo complex graphs   (~24 MB)
    β”œβ”€β”€ val.pkl     # 12 validation graphs                (~1.3 MB)
    └── test.pkl    # 96 held-out evaluation graphs       (~25 MB)

Each pickle is a list of dicts produced by code/data/build_dataset.py. Splits follow the family-stratified scheme used in the paper (equivalent to data/processed_familysplit/cam/ train+val and data/processed_familysplit_v5/cam/test.pkl in the source tree).

Smoke test (1-epoch end-to-end)

# Train both phases for 1 epoch
python code/scripts/train.py \
    --target cam \
    --phase both \
    --data_dir data/sample \
    --checkpoint_dir checkpoints_smoke \
    --epochs 1 \
    --no_wandb

# Evaluate
python code/scripts/evaluate.py \
    --target cam \
    --checkpoint checkpoints_smoke/best_phase2.pt \
    --data_dir data/sample \
    --outdir eval_smoke

Expected runtime: ~1 minute on a single GPU.

Want more data?

  • All 12 paper targets, pre-built: see data/DOWNLOAD.md for the Zenodo link.
  • Build from raw PDBs locally: scripts/build_data.sh paper12.
  • Per-target PDB lists and chain mappings: data/target_lists/*.txt (68 targets).