Learning Human-Like RL Agents Through Trajectory Optimization With Action Quantization

This is the official repository of the NeurIPS 2025 paper Learning Human-Like RL Agents Through Trajectory Optimization With Action Quantization.

If you use this work for research, please consider citing our paper as follows:

@inproceedings{
  guo2025learning,  
  title={Learning Human-Like RL Agents Through Trajectory Optimization With Action Quantization},
  author={Jian-Ting Guo, Yu-Cheng Chen, Ping-Chun Hsieh, Kuo-Hao Ho, Po-Wei Huang, Ti-Rong Wu, I-Chen Wu},
  booktitle={Thirty-ninth Conference on Neural Information Processing Systems},
  year={2025},
  url={https://openreview.net/forum?id=1A4Nlibwl5}
}

We propose a human‑likeness‑aware framework called Macro Action Quantization (MAQ), which consists of two components: (1) Human behavior distillation and (2) Reinforcement learning with Macro Actions.

The following instructions are prepared for reproducing the main experiments in the paper.

Human-like Reinforcement Learning

Human-like reinforcement learning remains underexplored in the RL community. Most research focuses on designing reward-driven agents; only a few studies investigate human-like RL that seeks both human-like behavior and optimal performance. However, most of these methods rely on pre‑defined behavior constraints or rule‑based penalties, requiring substantial effort for hand‑crafted design.

Door Task

MAQ+RLPD (Ours)	RLPD

Human-like RL Opens the door naturally.	Reward-driven RL Uses an unnatural "backhand" strategy to open the door, maximizing reward but sacrificing naturalness.

Hammer Task

MAQ+RLPD (Ours)	RLPD

Human-like RL Performs smoothly, striking the nail multiple times with controlled precision.	Reward-driven RL Hammers aggressively fast to maximize reward, resulting in unnatural behavior.

Training Macro Action Quantization

Prerequisites

The program requires a Linux platform with at least one NVIDIA GPU to operate. For training RLPD, the CUDA version must be newer than 12.0.

Build Programs

Clone this repository with the required submodules:

git clone --recursive https://github.com/rlglab/maq.git

cd MAQ

Enter the container to continue the instructions:

# start the container
./scripts/start-container.sh

Note

All the instructions must be executed in the container.

(Optional) Preprocessing Training and Testing Datasets

Note

This section is optional. If you are using a customized dataset, please review this section. If you are using the default dataset, you can skip this.

The datasets used for training and testing must be saved as .pkl files in the offline_data/ folder, and must have the same format as the following:

[
    { # Trajectory 1
        'observations': np.array([...]),      # Shape: (T, obs_dim)
        'actions': np.array([...]),           # Shape: (T, action_dim)
        'rewards': np.array([...]),           # Shape: (T, )
        'next_observations': np.array([...]), # Shape: (T, obs_dim)
        'terminals': np.array([...])          # Shape: (T, )
    },
    ... # Trajectory 2
]
# T is the trajectory length, obs_dim is the observation dimension, action_dim is the action dimension

For verification, you can run:

python3 offline_data/check_dataset_format.py --dataset_path "your_dataset.pkl"

If the dataset is valid, it will print "is legal"; otherwise, it will print "illegal: error_message".

Train Macro Action Quantization Methods

To reproduce the results in the paper, please run:

# For MAQ based methods
# Section 5.2.2: MAQ+RLPD in door task (with macro action length=9 and codebook size=16)
./scripts/train.sh --method "MAQ+RLPD" --sequence_length 9 --codebook_size 16 --environment "door-human-v1" --seed 1

# Section 5.2.2: MAQ+IQL in door task (with macro action length=9 and codebook size=16)
./scripts/train.sh --method "MAQ+IQL" --sequence_length 9 --codebook_size 16 --environment "door-human-v1" --seed 1

# Section 5.2.2: MAQ+DSAC in door task (with macro action length=8 and codebook size=8)
./scripts/train.sh --method "MAQ+DSAC" --sequence_length 8 --codebook_size 8 --environment "door-human-v1" --seed 1

# For baseline methods
# Section 5.2.2: SAC in door task
./scripts/train.sh --method "SAC" --environment "door-human-v1" --seed 1

# Section 5.2.2: RLPD in door task
./scripts/train.sh --method "RLPD" --environment "door-human-v1" --seed 1

# Section 5.2.2: IQL in door task
./scripts/train.sh --method "IQL" --environment "door-human-v1" --seed 1

For detailed parameters, please refer to the following table:

Parameter	Flag	Description	Default
Method	`--method`	Training method (MAQ+RLPD, MAQ+IQL, MAQ+DSAC, SAC, RLPD, IQL)	`MAQ+RLPD`
Environment	`-env`, `--environment`	D4RL environment name (e.g., door-human-v1, hammer-human-v1, pen-human-v1, relocate-human-v1)	`door-human-v1`
Sequence Length	`-seqlen`, `--sequence_length`	Macro action length	`9`
Codebook Size	`-cbsz`, `--codebook_size`	VQ-VAE codebook size	`16`
Seed	`-s`, `--seed`	Random seed	`1`
Training Source	`-trs`, `--training_source`	Training dataset path (relative to `offline_data/`)	`""` (Defaults to environment dataset provided by D4RL)
Testing Source	`-tes`, `--testing_source`	Testing dataset path (relative to `offline_data/`)	`""` (Defaults to environment dataset provided by D4RL)
Tag	`-t`, `--tag`	Tag for the experiment (e.g., date)	`""`
Auto Evaluate	`--auto_evaluate`	Automatically evaluate the trained model after training	`False`

Training Results

This section introduces the files generated by different training methods.

MAQ+RLPD and MAQ+DSAC

For example, the following command demonstrates training MAQ+RLPD with macro action length=9 and codebook size=16 on door task:

./scripts/train.sh --method "MAQ+RLPD" --sequence_length 9 --codebook_size 16 --environment "door-human-v1" --seed 1

You will obtain a folder:

RLPD_MAQ
└── log/
    └── exp_seed1_sq9_k16/
        └── door-human-v1/
            └── RLPDMAQ/
                ├── config.json
                ├── events.out.tfevents.*
                ├── experiment_data.csv
                └── model_<episode>_<normalized reward>.pth

If you also train MAQ+DSAC with the same environment, seed, sequence length and codebook size:

./scripts/train.sh --method "MAQ+DSAC" --sequence_length 9 --codebook_size 16 --environment "door-human-v1" --seed 1

You will obtain another folder in the same parent path:

DSAC_MAQ
└── log/
    └── exp_seed1_sq9_k16/
        └── door-human-v1/
            ├── RLPDMAQ/
            └── DSACMAQ/
                ├── config.json
                ├── events.out.tfevents.*
                ├── experiment_data.csv
                └── model_<episode>_<normalized reward>.pth

MAQ+IQL and IQL

We implemented IQL and MAQ+IQL based on the IQL-PyTorch. If you are running MAQ+IQL methods, you will obtain a folder:

IQL
└── log/
    └── door-human-v1/
        └── MAQ_iqlseed1_sq9_k16_seed1/
            ├── checkpoint_<episode>.pt
            ├── config.yaml
            └── events.out.tfevents.*

SAC

We implemented SAC based on the stable-baselines3. For example, the following command demonstrates training SAC on door task with seed 1:

./scripts/train.sh --method "SAC" --environment "door-human-v1" --seed 1

You will obtain a log folder under SAC/:

SAC
└── log/
    └── door-human-v1/
        └── SAC_seed1/
            ├── checkpoints.csv
            │   └── events.out.tfevents.*
            ├── SAC_<exp_number>
            │   └── events.out.tfevents.*
            └── SAC.csv

RLPD

We implemented RLPD based on ikostrikov's repository RLPD. We also made small changes to the dataset loading, making it compatible with the customized datasets. Before training RLPD, you need to build the environment for training RLPD. You can refer to their instructions or run the following command on our container:

# in workspace/
pip install -r rlpd/requirements.txt 
pip install jax==0.4.14 jaxlib==0.4.14+cuda11.cudnn86 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
pip install optax
pip install flax
pip install git+https://github.com/ikostrikov/dmcgym.git
pip uninstall mujoco-py
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/root/.mujoco/mjpro150/bin
pip install git+https://github.com/ikostrikov/dmcgym.git

For training RLPD with your datasets, you can use the following commands:

XLA_PYTHON_CLIENT_PREALLOCATE=false python train_finetuning.py --env_name=<your env> \
    --utd_ratio=20 \
    --start_training 10000 \
    --max_steps 1000000 \
    --config=configs/rlpd_config.py \
    --project_name=<project name store in the wandb> \
    --seed=<your seed> \
    --checkpoint_model=true \
    --log_dir=/workspace/rlpd/<your log> \
    --training_dataset=<your training dataset stored in the ./offline_data>

Train Macro Action Quantization with Different Macro Action Length and Codebook Size

We also provide a simple command to run the experiment with a specific macro action length and codebook size, and method. You can use the following command:

./scripts/lazy_rerun.sh

Note

This command will run all the combination of macro action length and codebook size and methods to reproduce the experiment in section 5.2.3, it may take several hours to finish all the combinations (108 experiments in total).

You can also customize the combination of macro action length and codebook size and methods in lazy_rerun.sh, by modifying the following variables:

ENVS=("door-human-v1" "pen-human-v1" "hammer-human-v1" "relocate-human-v1") # "door-human-v1" "pen-human-v1" "hammer-human-v1" "relocate-human-v1"
METHODS=("MAQ+RLPD") # "MAQ+DSAC" "MAQ+IQL"
SEQS=(1 2 3 4 5 6 7 8 9)
KS=(16) # 8 16 32
SEEDS=(1 10 100) # 1 10 100

Evaluation

Evaluate Macro Action Quantization Methods and Baselines

After training the agents, you can evaluate the methods by using the following command:

# For MAQ based methods
# Section 5.2.2: MAQ+RLPD in door task (with macro action length=9 and codebook size=16)
./scripts/evaluate.sh --method "MAQ+RLPD" --sequence_length 9 --codebook_size 16 --environment "door-human-v1" --seed 1

# Section 5.2.2: MAQ+IQL in door task (with macro action length=9 and codebook size=16)
./scripts/evaluate.sh --method "MAQ+IQL" --sequence_length 9 --codebook_size 16 --environment "door-human-v1" --seed 1

# Section 5.2.2: MAQ+DSAC in door task (with macro action length=8 and codebook size=8)
./scripts/evaluate.sh --method "MAQ+DSAC" --sequence_length 8 --codebook_size 8 --environment "door-human-v1" --seed 1

# For baseline methods
# Section 5.2.2: SAC in door task
./scripts/evaluate.sh --method "SAC" --environment "door-human-v1" --seed 1

# Section 5.2.2: IQL in door task
./scripts/evaluate.sh --method "IQL" --environment "door-human-v1" --seed 1

# Section 5.2.2: RLPD in door task
./scripts/evaluate.sh --method "RLPD" --environment "door-human-v1" --seed 1 --model_path "<your model path>"

For detailed parameters, please refer to the following table:

Parameter	Flag	Description	Default
Method	`--method`	Evaluating method (MAQ+RLPD, MAQ+IQL, MAQ+DSAC, SAC, RLPD, IQL)	`MAQ+RLPD`
Environment	`-env`, `--environment`	D4RL environment name (e.g., door-human-v1, hammer-human-v1, pen-human-v1, relocate-human-v1)	`door-human-v1`
Sequence Length	`-seqlen`, `--sequence_length`	Macro action length	`9`
Codebook Size	`-cbsz`, `--codebook_size`	VQ-VAE codebook size	`16`
Seed	`-s`, `--seed`	Random seed	`1`
Training Source	`-trs`, `--training_source`	Training dataset path (relative to `offline_data/`)	`""` (Defaults to environment dataset provided by D4RL)
Testing Source	`-tes`, `--testing_source`	Testing dataset path (relative to `offline_data/`)	`""` (Defaults to environment dataset provided by D4RL)
Tag	`-t`, `--tag`	Tag for the experiment (e.g., date)	`""`
Model Path	`-path`, `--model_path`	The path of the model to be evaluated	`""`

Note

For evaluating RLPD, you need to specify the model path of the RLPD model to be evaluated. Other baseline methods are able to locate the model path automatically.

Evaluation Results

After evaluating the methods, you can find the evaluation results in the evaluation_results/ folder. For example, after evaluating the agent on the door task, you will obtain:

evaluation_results/
└── agent_performance_door-human-v1.csv

The file contains a comprehensive list of metrics. Here is a detailed breakdown of the headers:

Category	Header Keyword	Description
Basic Info	`agent_type`	The type of the agent (e.g., MAQ+RLPD, SAC).
	`env_id`	The name of the environment (e.g., door-human-v1).
	`eval_episodes`	Total number of episodes used for this evaluation run.
	`base_seed`	The starting seed used for evaluation.
	`model_path`	The file path of the model being evaluated.
	`horizon` / `default_horizon`	The maximum steps allowed per episode (custom vs default).
Performance	`normalized_score`	The D4RL normalized score (0-100), higher is better.
	`raw_reward`	The raw environment reward.
	`success_rate`	The fraction of episodes where the task was successfully completed (0.0 to 1.0).
	`early_termination_rate`	Fraction of episodes terminated early due to success.
	`length`	The duration (number of steps) of the episode.
Human Similarity	`dtw_state_dtw_mean`	Dynamic Time Warping (State): Average distance between agent states and human states. Lower is more human-like.
	`dtw_action_dtw_mean`	Dynamic Time Warping (Action): Average distance between agent actions and human actions. Lower is more human-like.
	`wasserstein_state_w2_dist`	Wasserstein Distance (State): Distance between the distribution of agent states and human states. Lower is more human-like.
	`wasserstein_action_w2_dist`	Wasserstein Distance (Action): Distance between the distribution of agent actions and human actions. Lower is more human-like.
	`min_dtw_...`	Minimum DTW distance found among the human reference trajectories.

Metric Naming Convention

The headers follow a structured naming convention to give granular insights: [Condition]_[Metric Name]_[Statistic]

1. Conditions (Prefixes)

(No Prefix): Aggregated over all episodes.
success_: Aggregated only over successful episodes.
non_success_: Aggregated only over failed episodes.

2. Statistics (Suffixes)

_mean / _agg_mean: The average value.
_std / _agg_std: The standard deviation.
_max: The maximum value observed.
_min: The minimum value observed.
_exceed_default_pct: Percentage of episodes longer than the default horizon.

Examples:

success_length_mean: Average length of successful episodes.
non_success_dtw_action_dtw_mean_max: The maximum action DTW distance observed in any failed episode.
normalized_score_mean: Average normalized score across all episodes.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
IQL		IQL
RLPD_MAQ		RLPD_MAQ
SAC		SAC
VQVAE		VQVAE
docs		docs
human_similarity		human_similarity
offline_data		offline_data
rlpd		rlpd
scripts		scripts
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
human_similarity_calculator.py		human_similarity_calculator.py
human_similarity_utils.py		human_similarity_utils.py
mujoco210.zip		mujoco210.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Learning Human-Like RL Agents Through Trajectory Optimization With Action Quantization

Human-like Reinforcement Learning

Door Task

Hammer Task

Training Macro Action Quantization

Prerequisites

Build Programs

(Optional) Preprocessing Training and Testing Datasets

Train Macro Action Quantization Methods

Training Results

MAQ+RLPD and MAQ+DSAC

MAQ+IQL and IQL

SAC

RLPD

Train Macro Action Quantization with Different Macro Action Length and Codebook Size

Evaluation

Evaluate Macro Action Quantization Methods and Baselines

Evaluation Results

Metric Naming Convention

About

Uh oh!

Releases

Packages

Languages

rlglab/maq

Folders and files

Latest commit

History

Repository files navigation

Learning Human-Like RL Agents Through Trajectory Optimization With Action Quantization

Human-like Reinforcement Learning

Door Task

Hammer Task

Training Macro Action Quantization

Prerequisites

Build Programs

(Optional) Preprocessing Training and Testing Datasets

Train Macro Action Quantization Methods

Training Results

MAQ+RLPD and MAQ+DSAC

MAQ+IQL and IQL

SAC

RLPD

Train Macro Action Quantization with Different Macro Action Length and Codebook Size

Evaluation

Evaluate Macro Action Quantization Methods and Baselines

Evaluation Results

Metric Naming Convention

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages