YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
LLM auto annotation for HICO-DET dataset (Pose from Halpe, Part State from HAKE).
Environment
The code is developed using python 3.11.11 on Ubuntu 21.xx with torch==2.6.0+cu124, transformers==4.57.3 (with Qwen3 series)
Annotating HICO-Det
A. Installation
- Install required packges and dependencies.
- Clone this repo, and we'll call the directory that you cloned as ${ROOT}.
- Creat necessary directories:
mkdir outputs mkdir model_weights - Download LLM's weights into model_weights from hugging face.
B. Prepare Dataset
- Install COCO API:
pip install pycocotools - Download dataset.
- Organize dataset, your directory tree of dataset should look like this (there maybe extra files.):
{DATA_ROOT} |-- Annotation | |--hico-det-instance-level | | |--hico-det-training-set-instance-level.json | `--hico-fullbody-pose | |--halpe_train_v1.json | `--halpe_val_v1.json |ββ Configs | |--hico_hoi_list.txt | `--Part_State_76.txt |ββ Images | |--images | |--test2015 | | |--HICO_test2015_00000001.jpg | | |--HICO_test2015_00000002.jpg | | ... | `--train2015 | |--HICO_train2015_00000001.jpg | |--HICO_train2015_00000002.jpg | ... `ββ Logic_Rules |--gather_rule.pkl `--read_rules.py
C. Start annotation
Modify the data_path, model_path, output_dir='outputs' by your configuration in "{ROOT}/scripts/annotate.sh".
IDX={YOUR_GPU_IDS}
export PYTHONPATH=$PYTHONPATH:./
data_path={DATA_ROOT}
model_path={ROOT}/model_weights/{YOUR_MODEL_NAME}
output_dir={ROOT}/outputs
if [ -d ${output_dir} ];then
echo "dir already exists"
else
mkdir ${output_dir}
fi
CUDA_VISIBLE_DEVICES=$IDX OMP_NUM_THREADS=1 torchrun --nnodes=1 --nproc_per_node={NUM_YOUR_GPUs} --master_port=25005 \
tools/annotate_hico.py \
--model-path ${model_path} \
--data-path ${data_path} \
--output-dir ${output_dir} \
Start auto-annotation
bash scripts/annotate_hico.sh
D. Multi-stage HICO pipeline
The repository now supports a 3-stage HICO workflow:
- Long description generation
- Description refinement
- Description examination / checking
Each stage writes per-rank JSON files first, then merges them into one JSON file for the next stage.
Stage 1. Generate long descriptions
This is the original HICO annotation stage. It uses Conversation in data/convsersation.py.
Run:
bash scripts/annotate_hico.sh
This creates per-rank files such as:
outputs/labels_0.json
outputs/labels_1.json
Merge them with:
python3 tools/merge_json_outputs.py \
--input-dir outputs \
--pattern "labels_*.json" \
--output-path outputs/merged_labels.json
Stage 2. Refine generated descriptions
This stage reads a merged JSON from Stage 1 and adds a refined_description field. It uses Conversation_For_Clean_Descrption in data/convsersation.py.
Modify data_path, model_path, annotation_path, and output_dir in scripts/refine_hico.sh, then run:
bash scripts/refine_hico.sh
This creates files such as:
outputs/refine/refine_labels_0.json
Merge them with:
python3 tools/merge_json_outputs.py \
--input-dir outputs/refine \
--pattern "refine_labels_*.json" \
--output-path outputs/merged_refine.json
Stage 3. Examine / check generated descriptions
This stage reads a merged JSON from Stage 2 and adds an examiner_result field. It uses Conversation_examiner in data/convsersation.py.
Modify data_path, model_path, annotation_path, and output_dir in scripts/examine_hico.sh, then run:
bash scripts/examine_hico.sh
This creates files such as:
outputs/examiner/examiner_labels_0.json
Merge them with:
python3 tools/merge_json_outputs.py \
--input-dir outputs/examiner \
--pattern "examiner_labels_*.json" \
--output-path outputs/merged_examine.json
One-shot pipeline
If you want to run all 3 stages end-to-end, use:
bash scripts/pipeline_hico.sh
Before running it, edit the following variables in scripts/pipeline_hico.sh:
DATA_PATHLONG_MODEL_PATHREFINE_MODEL_PATHEXAMINE_MODEL_PATHLONG_GPU_IDSREFINE_GPU_IDSEXAMINE_GPU_IDSLONG_NPROCREFINE_NPROCEXAMINE_NPROC
The pipeline will produce:
outputs/pipeline/merged_long.jsonoutputs/pipeline/merged_refine.jsonoutputs/pipeline/merged_examine.json
E. Using different VLM backends
The HICO scripts are no longer hardcoded to Qwen only. The model loading logic is centralized in tools/vlm_backend.py, so you can use different VLM families for long-description generation, refinement, and examination.
The following scripts support backend selection:
tools/annotate_hico.pytools/refine_hico.pytools/examine_hico.pytools/clean_initial_annotation.py
Each of them accepts:
--model-path--model-backend--torch-dtype
Examples:
torchrun --nnodes=1 --nproc_per_node=1 tools/annotate_hico.py \
--model-path /path/to/model \
--model-backend auto \
--torch-dtype bfloat16 \
--data-path ../datasets/HICO-Det \
--output-dir outputs/test \
--max-samples 5
You may also force a backend explicitly, for example:
--model-backend qwen3_vl
--model-backend qwen3_vl_moe
--model-backend llava
--model-backend deepseek_vl
--model-backend hf_vision2seq
--model-backend hf_causal_vlm
Where to customize for a new model
If you want to adapt the repository to a new model family, the main file to edit is:
tools/vlm_backend.py
This file controls:
- backend detection:
infer_model_backend(...) - model/processor loading:
load_model_and_processor(...) - prompt/image packaging:
build_batch_tensors(...) - output decoding:
decode_generated_text(...)
In most cases, you do not need to change the HICO task scripts themselves.
How to add a new model backend
There are three common situations.
The model already works with Hugging Face
AutoProcessorandAutoModelForVision2SeqorAutoModelForCausalLM. In that case, you may only need to run with:--model-backend autoor explicitly:
--model-backend hf_vision2seqor:
--model-backend hf_causal_vlmThe model needs custom backend detection. Add a rule inside
infer_model_backend(...)intools/vlm_backend.py.The model needs a custom class or custom multimodal input format. Add a new branch inside:
load_model_and_processor(...)build_batch_tensors(...)decode_generated_text(...)if needed
Rule of thumb
- If you want to change task behavior or prompting, edit
data/convsersation.py. - If you want to support a new model family, edit
tools/vlm_backend.py. - If you want to add a new stage, add a new script under
tools/.
F. Annotation format
A list of dict that contains the following keys:
{
'file_name': 'HICO_train2015_00009511.jpg',
'image_id': 0,
'keypoints': a 51-elements list (17x3 keypoints with x, y, v),
'vis': a 51-elements list (17 keypionts, each has 3 visiblity flags),
'instance_id':0,
'action_labels': [{'human_part': part_id, 'partstate': state_id}, ...],
'height': 640,
'width': 480,
'human_bbox': [126, 258, 150, 305],
'object_bbox': [128, 276, 144, 313],
'description': "The person is riding a bicycle, supported by visible evidence of their body interacting with the bike.\n\n- The right hand is holding the right handlebar.\n- The left hand is holding the left handlebar.\n- The right hip is positioned over the seat, indicating the person is sitting on the bicycle.\n- The right foot is on the right pedal.\n- The left foot is on the left pedal."
}
After refinement and examination, extra fields may appear in the JSON:
{
'refined_description': "A refined 2-3 sentence version aligned with the target HOI label.",
'examiner_result': "Verdict: PASS or FAIL ..."
}
Annotate COCO
- Download COCO dataset.
- Organize dataset, your directory tree of dataset should look like this (the files inside the Config is copied from the HICO-Det):
{DATA_ROOT} |-- annotations | |--person_keypoints_train2017.json | `--person_keypoints_val2017.json |ββ Configs | |--hico_hoi_list.txt | `--Part_State_76.txt |ββ train2017 | |--000000000009.jpg | |--000000000025.jpg | ... `-- val2017 |--000000000139.jpg |--000000000285.jpg ...
Start annotation
Modify the data_path, model_path, output_dir='outputs' by your configuration in "{ROOT}/scripts/annotate_coco.sh".
IDX={YOUR_GPU_IDS}
export PYTHONPATH=$PYTHONPATH:./
data_path={DATA_ROOT}
model_path={ROOT}/model_weights/{YOUR_MODEL_NAME}
output_dir={ROOT}/outputs
if [ -d ${output_dir} ];then
echo "dir already exists"
else
mkdir ${output_dir}
fi
CUDA_VISIBLE_DEVICES=$IDX OMP_NUM_THREADS=1 torchrun --nnodes=1 --nproc_per_node={NUM_YOUR_GPUs} --master_port=25005 \
tools/annotate_coco.py \
--model-path ${model_path} \
--data-path ${data_path} \
--output-dir ${output_dir} \
Start auto-annotation
bash scripts/annotate_coco.sh
By defualt, the annotation script only annotates the COCO train2017 set. To annotate val2017, find the following two code in Line167-Line168 in the tools/annotate_coco.py and replace the 'train2017' to 'val2017'.
dataset = PoseCOCODataset(
data_path=os.path.join(args.data_path, 'annotations', 'person_keypoints_train2017.json'), # <- Line 167
multimodal_cfg=dict(image_folder=os.path.join(args.data_path, 'train2017'), # <- Line 168
data_augmentation=False,
image_size=336,),)
Annotation format
A list of dict that contains the following keys:
{
'file_name': '000000000009.jpg',
'image_id': 9,
'keypoints': a 51-elements list (17x3 keypoints with x, y, v),
'vis': a 51-elements list (17 keypionts, each has 3 visiblity flags),
'height': 640,
'width': 480,
'human_bbox': [126, 258, 150, 305],
'description': "The person is riding a bicycle, supported by visible evidence of their body interacting with the bike.\n\n- The right hand is holding the right handlebar.\n- The left hand is holding the left handlebar.\n- The right hip is positioned over the seat, indicating the person is sitting on the bicycle.\n- The right foot is on the right pedal.\n- The left foot is on the left pedal."
}