File size: 1,550 Bytes
ad9572d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | # Sample dataset (in-repo)
This directory ships a single pre-built target β **`cam` (Calmodulin)** β so users
can run a smoke test of the training and evaluation pipeline without first
downloading the full multi-target dataset (~10 GB on Zenodo) or rebuilding
from raw PDB files (~30 min per target).
## Contents
```
sample/
βββ cam/
βββ train.pkl # 84 paired holo/apo complex graphs (~24 MB)
βββ val.pkl # 12 validation graphs (~1.3 MB)
βββ test.pkl # 96 held-out evaluation graphs (~25 MB)
```
Each pickle is a list of dicts produced by `code/data/build_dataset.py`.
Splits follow the family-stratified scheme used in the paper
(equivalent to `data/processed_familysplit/cam/` train+val and
`data/processed_familysplit_v5/cam/test.pkl` in the source tree).
## Smoke test (1-epoch end-to-end)
```bash
# Train both phases for 1 epoch
python code/scripts/train.py \
--target cam \
--phase both \
--data_dir data/sample \
--checkpoint_dir checkpoints_smoke \
--epochs 1 \
--no_wandb
# Evaluate
python code/scripts/evaluate.py \
--target cam \
--checkpoint checkpoints_smoke/best_phase2.pt \
--data_dir data/sample \
--outdir eval_smoke
```
Expected runtime: ~1 minute on a single GPU.
## Want more data?
- All 12 paper targets, pre-built: see `data/DOWNLOAD.md` for the Zenodo link.
- Build from raw PDBs locally: `scripts/build_data.sh paper12`.
- Per-target PDB lists and chain mappings: `data/target_lists/*.txt` (68 targets).
|