| # Sample dataset (in-repo) |
|
|
| This directory ships a single pre-built target β **`cam` (Calmodulin)** β so users |
| can run a smoke test of the training and evaluation pipeline without first |
| downloading the full multi-target dataset (~10 GB on Zenodo) or rebuilding |
| from raw PDB files (~30 min per target). |
|
|
| ## Contents |
|
|
| ``` |
| sample/ |
| βββ cam/ |
| βββ train.pkl # 84 paired holo/apo complex graphs (~24 MB) |
| βββ val.pkl # 12 validation graphs (~1.3 MB) |
| βββ test.pkl # 96 held-out evaluation graphs (~25 MB) |
| ``` |
|
|
| Each pickle is a list of dicts produced by `code/data/build_dataset.py`. |
| Splits follow the family-stratified scheme used in the paper |
| (equivalent to `data/processed_familysplit/cam/` train+val and |
| `data/processed_familysplit_v5/cam/test.pkl` in the source tree). |
|
|
| ## Smoke test (1-epoch end-to-end) |
|
|
| ```bash |
| # Train both phases for 1 epoch |
| python code/scripts/train.py \ |
| --target cam \ |
| --phase both \ |
| --data_dir data/sample \ |
| --checkpoint_dir checkpoints_smoke \ |
| --epochs 1 \ |
| --no_wandb |
| |
| # Evaluate |
| python code/scripts/evaluate.py \ |
| --target cam \ |
| --checkpoint checkpoints_smoke/best_phase2.pt \ |
| --data_dir data/sample \ |
| --outdir eval_smoke |
| ``` |
|
|
| Expected runtime: ~1 minute on a single GPU. |
|
|
| ## Want more data? |
|
|
| - All 12 paper targets, pre-built: see `data/DOWNLOAD.md` for the Zenodo link. |
| - Build from raw PDBs locally: `scripts/build_data.sh paper12`. |
| - Per-target PDB lists and chain mappings: `data/target_lists/*.txt` (68 targets). |
|
|