The system plans complex tasks, acts with user supervision, and supports multimodal inputs/outputs like text, images, video, ...