# Sample dataset (in-repo) This directory ships a single pre-built target — **`cam` (Calmodulin)** — so users can run a smoke test of the training and evaluation pipeline without first downloading the full multi-target dataset (~10 GB on Zenodo) or rebuilding from raw PDB files (~30 min per target). ## Contents ``` sample/ └── cam/ ├── train.pkl # 84 paired holo/apo complex graphs (~24 MB) ├── val.pkl # 12 validation graphs (~1.3 MB) └── test.pkl # 96 held-out evaluation graphs (~25 MB) ``` Each pickle is a list of dicts produced by `code/data/build_dataset.py`. Splits follow the family-stratified scheme used in the paper (equivalent to `data/processed_familysplit/cam/` train+val and `data/processed_familysplit_v5/cam/test.pkl` in the source tree). ## Smoke test (1-epoch end-to-end) ```bash # Train both phases for 1 epoch python code/scripts/train.py \ --target cam \ --phase both \ --data_dir data/sample \ --checkpoint_dir checkpoints_smoke \ --epochs 1 \ --no_wandb # Evaluate python code/scripts/evaluate.py \ --target cam \ --checkpoint checkpoints_smoke/best_phase2.pt \ --data_dir data/sample \ --outdir eval_smoke ``` Expected runtime: ~1 minute on a single GPU. ## Want more data? - All 12 paper targets, pre-built: see `data/DOWNLOAD.md` for the Zenodo link. - Build from raw PDBs locally: `scripts/build_data.sh paper12`. - Per-target PDB lists and chain mappings: `data/target_lists/*.txt` (68 targets).