From Extraction to Generation: A Filming Controller for Description-to-Trajectory Translation

– Introduction

This project plans to develop an automated filming controller capable of translating natural language descriptions into camera trajectories, bridging the gap between human perception and autonomous cinematography. The project progresses through three phases: refining 3D Human Pose Estimation (HPE) for robotic applications, improving global camera and human trajectory estimation, and developing a pipeline for dataset creation and generative model training. Key challenges addressed include the computational inefficiency and lack of robustness (especially to occlusion and partial views) of existing HPE methods in robotic contexts, and inaccuracies in global trajectory estimation, particularly concerning metric scale and rapid camera motion.

Methodologies employed include optimizing the detection backbone in HPE pipelines to significantly enhance processing speed (>13x improvement demonstrated) and integrating temporal processing modules (inspired by VIMO) to improve robustness against occlusion. Global camera trajectory accuracy was enhanced by incorporating metric depth priors into the estimation process, effectively addressing scale ambiguity issues observed in baseline methods like SLAHMR. A systematic pipeline was developed to curate a multimodal dataset by filtering existing video-caption datasets (MSR-VTT, MSVD, VATEX, LSMDC) based on image composition assessment, shot transition detection, and Large Language Model (LLM)-based action sequence alignment. Finally, the design for a conditional diffusion model using a Transformer architecture is proposed to generate camera trajectories from the curated action captions.

Experimental results validate the effectiveness of the HPE optimizations and the metric-depth-enhanced trajectory estimation through qualitative assessments. The dataset curation pipeline components were successfully implemented and demonstrated. The project lays the groundwork for intelligent, language-driven filming systems by improving foundational perception tasks and establishing a methodology for training description-to-trajectory models. Future work includes integrating camera intrinsic estimation, training the proposed generative model, and conducting comprehensive quantitative evaluations.

– Poster

– Video

– Conclusion

In conclusion, this project demonstrates the potential of natural language-driven automated filming controllers, advancing computer vision, robotics, and cinematography by leveraging progress in 3D human pose estimation (HPE), global trajectory extraction, dataset curation, and generative model training.

Significant computational bottlenecks in state-of-the-art HMR models like HMR 2.0 and TokenHMR were identified when applied to robotics scenarios. Through optimization of the detection backbone within the processing pipeline, a substantial (>13x) speedup was achieved, improving the system’s frame rate from approximately 0.13 FPS to 1.78 FPS and making it more viable for near real-time applications. Furthermore, the integration of temporal processing modules inspired by VIMO enhanced the robustness of 3D HPE, particularly in handling partial occlusions common in human-robot interaction tasks. This led to more plausible and temporally consistent pose estimations, as demonstrated qualitatively.

Recognizing inaccuracies in baseline global trajectory estimation methods like SLAHMR, especially during rapid camera rotations, metric depth priors were successfully integrated into the pipeline. Qualitative results demonstrated that this approach yields camera trajectories more consistent with visual ground truth, improving the accuracy of world-coordinate mapping.

Additionally, a systematic pipeline was developed and validated for creating a specialized dataset linking action captions to 3D motion data. This involved implementing automated filters for image composition assessment (using SAMP-Net based on) and shot transition detection (using TransNetV2), along with leveraging a Large Language Model (LLM) for aligning and synthesizing action sequences from diverse source captions (MSR-VTT, MSVD, VATEX, LSMDC). This pipeline provides a foundation for generating the necessary data to train description-to-trajectory models.

In summary, this project demonstrated effective strategies for improving the efficiency and robustness of 3D human perception, enhancing the accuracy of global camera tracking, and establishing a methodology for curating multimodal datasets tailored for generating camera trajectories from language. These advancements lay the groundwork for more intelligent and intuitive automated filming systems.