AlloGen public release: Q_theta scorer + PXDesign guidance + Colab demo

ad9572d 5 days ago

1.55 kB

	# Sample dataset (in-repo)

	This directory ships a single pre-built target — `cam` (Calmodulin) — so users
	can run a smoke test of the training and evaluation pipeline without first
	downloading the full multi-target dataset (~10 GB on Zenodo) or rebuilding
	from raw PDB files (~30 min per target).

	## Contents

	```
	sample/
	└── cam/
	├── train.pkl # 84 paired holo/apo complex graphs (~24 MB)
	├── val.pkl # 12 validation graphs (~1.3 MB)
	└── test.pkl # 96 held-out evaluation graphs (~25 MB)
	```

	Each pickle is a list of dicts produced by `code/data/build_dataset.py`.
	Splits follow the family-stratified scheme used in the paper
	(equivalent to `data/processed_familysplit/cam/` train+val and
	`data/processed_familysplit_v5/cam/test.pkl` in the source tree).

	## Smoke test (1-epoch end-to-end)

	```bash
	# Train both phases for 1 epoch
	python code/scripts/train.py \
	--target cam \
	--phase both \
	--data_dir data/sample \
	--checkpoint_dir checkpoints_smoke \
	--epochs 1 \
	--no_wandb

	# Evaluate
	python code/scripts/evaluate.py \
	--target cam \
	--checkpoint checkpoints_smoke/best_phase2.pt \
	--data_dir data/sample \
	--outdir eval_smoke
	```

	Expected runtime: ~1 minute on a single GPU.

	## Want more data?

	- All 12 paper targets, pre-built: see `data/DOWNLOAD.md` for the Zenodo link.
	- Build from raw PDBs locally: `scripts/build_data.sh paper12`.
	- Per-target PDB lists and chain mappings: `data/target_lists/*.txt` (68 targets).