YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

LLM auto annotation for HICO-DET dataset (Pose from Halpe, Part State from HAKE).

Environment

The code is developed using python 3.11.11 on Ubuntu 21.xx with torch==2.6.0+cu124, transformers==4.57.3 (with Qwen3 series)

Annotating HICO-Det

A. Installation

  1. Install required packges and dependencies.
  2. Clone this repo, and we'll call the directory that you cloned as ${ROOT}.
  3. Creat necessary directories:
    mkdir outputs
    mkdir model_weights
    
  4. Download LLM's weights into model_weights from hugging face.

B. Prepare Dataset

  1. Install COCO API:
    pip install pycocotools
    
  2. Download dataset.
  3. Organize dataset, your directory tree of dataset should look like this (there maybe extra files.):
    {DATA_ROOT}
    |-- Annotation
    |   |--hico-det-instance-level
    |   |    |--hico-det-training-set-instance-level.json
    |   `--hico-fullbody-pose
    |        |--halpe_train_v1.json
    |        `--halpe_val_v1.json
    |── Configs
    |   |--hico_hoi_list.txt
    |   `--Part_State_76.txt
    |── Images
    |   |--images
    |       |--test2015
    |       |   |--HICO_test2015_00000001.jpg
    |       |   |--HICO_test2015_00000002.jpg
    |       |   ...
    |       `--train2015
    |           |--HICO_train2015_00000001.jpg
    |           |--HICO_train2015_00000002.jpg
    |           ...
    `── Logic_Rules
         |--gather_rule.pkl
         `--read_rules.py
    

C. Start annotation

Modify the data_path, model_path, output_dir='outputs' by your configuration in "{ROOT}/scripts/annotate.sh".

IDX={YOUR_GPU_IDS}
export PYTHONPATH=$PYTHONPATH:./

data_path={DATA_ROOT}
model_path={ROOT}/model_weights/{YOUR_MODEL_NAME}
output_dir={ROOT}/outputs

if [ -d ${output_dir} ];then
    echo "dir already exists"
else
    mkdir ${output_dir}
fi

CUDA_VISIBLE_DEVICES=$IDX OMP_NUM_THREADS=1 torchrun --nnodes=1 --nproc_per_node={NUM_YOUR_GPUs} --master_port=25005 \
    tools/annotate_hico.py \
    --model-path ${model_path} \
    --data-path ${data_path} \
    --output-dir ${output_dir} \

Start auto-annotation

bash scripts/annotate_hico.sh

D. Multi-stage HICO pipeline

The repository now supports a 3-stage HICO workflow:

  1. Long description generation
  2. Description refinement
  3. Description examination / checking

Each stage writes per-rank JSON files first, then merges them into one JSON file for the next stage.

Stage 1. Generate long descriptions

This is the original HICO annotation stage. It uses Conversation in data/convsersation.py.

Run:

bash scripts/annotate_hico.sh

This creates per-rank files such as:

outputs/labels_0.json
outputs/labels_1.json

Merge them with:

python3 tools/merge_json_outputs.py \
    --input-dir outputs \
    --pattern "labels_*.json" \
    --output-path outputs/merged_labels.json

Stage 2. Refine generated descriptions

This stage reads a merged JSON from Stage 1 and adds a refined_description field. It uses Conversation_For_Clean_Descrption in data/convsersation.py.

Modify data_path, model_path, annotation_path, and output_dir in scripts/refine_hico.sh, then run:

bash scripts/refine_hico.sh

This creates files such as:

outputs/refine/refine_labels_0.json

Merge them with:

python3 tools/merge_json_outputs.py \
    --input-dir outputs/refine \
    --pattern "refine_labels_*.json" \
    --output-path outputs/merged_refine.json

Stage 3. Examine / check generated descriptions

This stage reads a merged JSON from Stage 2 and adds an examiner_result field. It uses Conversation_examiner in data/convsersation.py.

Modify data_path, model_path, annotation_path, and output_dir in scripts/examine_hico.sh, then run:

bash scripts/examine_hico.sh

This creates files such as:

outputs/examiner/examiner_labels_0.json

Merge them with:

python3 tools/merge_json_outputs.py \
    --input-dir outputs/examiner \
    --pattern "examiner_labels_*.json" \
    --output-path outputs/merged_examine.json

One-shot pipeline

If you want to run all 3 stages end-to-end, use:

bash scripts/pipeline_hico.sh

Before running it, edit the following variables in scripts/pipeline_hico.sh:

  • DATA_PATH
  • LONG_MODEL_PATH
  • REFINE_MODEL_PATH
  • EXAMINE_MODEL_PATH
  • LONG_GPU_IDS
  • REFINE_GPU_IDS
  • EXAMINE_GPU_IDS
  • LONG_NPROC
  • REFINE_NPROC
  • EXAMINE_NPROC

The pipeline will produce:

  • outputs/pipeline/merged_long.json
  • outputs/pipeline/merged_refine.json
  • outputs/pipeline/merged_examine.json

E. Using different VLM backends

The HICO scripts are no longer hardcoded to Qwen only. The model loading logic is centralized in tools/vlm_backend.py, so you can use different VLM families for long-description generation, refinement, and examination.

The following scripts support backend selection:

  • tools/annotate_hico.py
  • tools/refine_hico.py
  • tools/examine_hico.py
  • tools/clean_initial_annotation.py

Each of them accepts:

  • --model-path
  • --model-backend
  • --torch-dtype

Examples:

torchrun --nnodes=1 --nproc_per_node=1 tools/annotate_hico.py \
  --model-path /path/to/model \
  --model-backend auto \
  --torch-dtype bfloat16 \
  --data-path ../datasets/HICO-Det \
  --output-dir outputs/test \
  --max-samples 5

You may also force a backend explicitly, for example:

--model-backend qwen3_vl
--model-backend qwen3_vl_moe
--model-backend llava
--model-backend deepseek_vl
--model-backend hf_vision2seq
--model-backend hf_causal_vlm

Where to customize for a new model

If you want to adapt the repository to a new model family, the main file to edit is:

  • tools/vlm_backend.py

This file controls:

  • backend detection: infer_model_backend(...)
  • model/processor loading: load_model_and_processor(...)
  • prompt/image packaging: build_batch_tensors(...)
  • output decoding: decode_generated_text(...)

In most cases, you do not need to change the HICO task scripts themselves.

How to add a new model backend

There are three common situations.

  1. The model already works with Hugging Face AutoProcessor and AutoModelForVision2Seq or AutoModelForCausalLM. In that case, you may only need to run with:

    --model-backend auto
    

    or explicitly:

    --model-backend hf_vision2seq
    

    or:

    --model-backend hf_causal_vlm
    
  2. The model needs custom backend detection. Add a rule inside infer_model_backend(...) in tools/vlm_backend.py.

  3. The model needs a custom class or custom multimodal input format. Add a new branch inside:

    • load_model_and_processor(...)
    • build_batch_tensors(...)
    • decode_generated_text(...) if needed

Rule of thumb

  • If you want to change task behavior or prompting, edit data/convsersation.py.
  • If you want to support a new model family, edit tools/vlm_backend.py.
  • If you want to add a new stage, add a new script under tools/.

F. Annotation format

A list of dict that contains the following keys:

{
    'file_name': 'HICO_train2015_00009511.jpg',
    'image_id': 0,
    'keypoints': a 51-elements list (17x3 keypoints with x, y, v),
    'vis': a 51-elements list (17 keypionts, each has 3 visiblity flags),
    'instance_id':0,
    'action_labels': [{'human_part': part_id, 'partstate': state_id}, ...],
    'height': 640,
    'width': 480,
    'human_bbox': [126, 258, 150, 305],
    'object_bbox': [128, 276, 144, 313],
    'description': "The person is riding a bicycle, supported by visible evidence of their body interacting with the bike.\n\n- The right hand is holding the right handlebar.\n- The left hand is holding the left handlebar.\n- The right hip is positioned over the seat, indicating the person is sitting on the bicycle.\n- The right foot is on the right pedal.\n- The left foot is on the left pedal."
}

After refinement and examination, extra fields may appear in the JSON:

{
    'refined_description': "A refined 2-3 sentence version aligned with the target HOI label.",
    'examiner_result': "Verdict: PASS or FAIL ..."
}

Annotate COCO

  1. Download COCO dataset.
  2. Organize dataset, your directory tree of dataset should look like this (the files inside the Config is copied from the HICO-Det):
    {DATA_ROOT}
    |-- annotations
    |   |--person_keypoints_train2017.json
    |   `--person_keypoints_val2017.json
    |── Configs
    |   |--hico_hoi_list.txt
    |   `--Part_State_76.txt
    |── train2017
    |   |--000000000009.jpg
    |   |--000000000025.jpg
    |   ...
    `-- val2017
        |--000000000139.jpg
        |--000000000285.jpg
        ...
    

Start annotation

Modify the data_path, model_path, output_dir='outputs' by your configuration in "{ROOT}/scripts/annotate_coco.sh".

IDX={YOUR_GPU_IDS}
export PYTHONPATH=$PYTHONPATH:./

data_path={DATA_ROOT}
model_path={ROOT}/model_weights/{YOUR_MODEL_NAME}
output_dir={ROOT}/outputs

if [ -d ${output_dir} ];then
    echo "dir already exists"
else
    mkdir ${output_dir}
fi

CUDA_VISIBLE_DEVICES=$IDX OMP_NUM_THREADS=1 torchrun --nnodes=1 --nproc_per_node={NUM_YOUR_GPUs} --master_port=25005 \
    tools/annotate_coco.py \
    --model-path ${model_path} \
    --data-path ${data_path} \
    --output-dir ${output_dir} \

Start auto-annotation

bash scripts/annotate_coco.sh

By defualt, the annotation script only annotates the COCO train2017 set. To annotate val2017, find the following two code in Line167-Line168 in the tools/annotate_coco.py and replace the 'train2017' to 'val2017'.

dataset = PoseCOCODataset(
                data_path=os.path.join(args.data_path, 'annotations', 'person_keypoints_train2017.json'), # <- Line 167
                multimodal_cfg=dict(image_folder=os.path.join(args.data_path, 'train2017'), # <- Line 168
                        data_augmentation=False,
                        image_size=336,),)

Annotation format

A list of dict that contains the following keys:

{
    'file_name': '000000000009.jpg',
    'image_id': 9,
    'keypoints': a 51-elements list (17x3 keypoints with x, y, v),
    'vis': a 51-elements list (17 keypionts, each has 3 visiblity flags),
    'height': 640,
    'width': 480,
    'human_bbox': [126, 258, 150, 305],
    'description': "The person is riding a bicycle, supported by visible evidence of their body interacting with the bike.\n\n- The right hand is holding the right handlebar.\n- The left hand is holding the left handlebar.\n- The right hip is positioned over the seat, indicating the person is sitting on the bicycle.\n- The right foot is on the right pedal.\n- The left foot is on the left pedal."
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support