Chaojie's Homepage

Profile Photo

Mao Chaojie

I am currently a research scientist at Alibaba Tongyi Lab, where I focus on the research and application of foundational models, with a particular emphasis on multimodal generative models. My research interests include multimodal content understanding, multimodal editing and generative. Relevant works from my team have been accepted by conferences such as CVPR, ICLR, NeurIPS, and AAAI.
From 2018 to 2022, I work at Alibaba's DAMO Academy, focusing on the research and development of multimedia content understanding technologies. My work centers on the application of multimodal understanding algorithms in the media asset industry. The related capabilities are integrated into Alibaba Cloud's Multimedia AI product and have been widely applied in scenarios such as TV media asset management and internet search and recommendation systems.
From 2023 to the present, I have been working at Alibaba's Tongyi Lab on research and development related to foundational generative models. My research directions include fine-tuning frameworks for base models, lightweight fine-tuning and controllable generation for generative models, and a unified framework for multimodal generation (both diffusion-based and LLM-based). This has resulted in the ACE series of works, including ResTuning, SCEdit, ACE, ACE++, and VACE. The related capabilities and models have been open-sourced and deployed.

Project List

  • 2025.03: We propose VACE, an all-in-one model designed for video creation and editing. It encompasses various tasks, including reference-to-video generation (R2V), video-to-video editing (V2V), and masked video-to-video editing (MV2V), allowing users to compose these tasks freely. This functionality enables users to explore diverse possibilities and streamline their workflows effectively, offers a range of capabilities, such as Move-Anything, Swap-Anything, Reference-Anything, Expand-Anything, Animate-Anything, and more. The code and paper is available on VACE.
  • 2025.02: Wan2.1 is an advanced and powerful visual generation model developed by Tongyi Lab of Alibaba Group. It can generate videos based on text, images, and other control signals. The Wan2.1 series models are now fully open-source.
  • 2025.01: We report ACE++, an instruction-based diffusion framework that tackles various image generation and editing tasks. The code and paper is available on ACE++.
  • 2024.10: we propose ACE, an All-round Creator and Editor, which achieves comparable performance compared to those expert models in a wide range of visual generation tasks. This work has been accepted by ICLR2025.
  • 2024.04: We propose a unified style editing method supporting text-based, exemplar-based, and compositional style editing, named StyleBooth.
  • 2024.03: We propose a unified image inpainting framework that supports text-guided, subject-guided, and text-subject-guided inpainting simultaneously, named LAR-Gen.
  • 2023.12: We propose an efficient and controllable generation framework, SCEdit, which is accepted by CVPR2024. We also have open-sourced the code and a series of models.
  • 2023.12: We release 🪄SCEPTER library, a code framework for fine-tuning and controllable generation of generative models.
  • 2023.10: The foundational model fine-tuning framework ResTuning was accepted by NeurIPS 2023. The homepage is ResTuning
  • 2018-2022: We release the MultimediaAI services for multimedia understanding. MultimediaAI is an AI product to recognize the key structed information in multimedia (including video, audio, image and text). The information covers video category, the famous person recognition, keywords recognition, Optical Character Recognition(OCR), taging and detection of scene and object. To use these abilities, please refer to MultiMediaAI

Publication

  • Jiang Z, Han Z, Mao C, et al. VACE: All-in-One Video Creation and Editing[J]. arXiv preprint arXiv:2503.07598, 2025.
  • Wang A, Ai B, Wen B, et al. Wan: Open and Advanced Large-Scale Video Generative Models[J]. arXiv preprint arXiv:2503.20314, 2025.
  • Mao C, Zhang J, Pan Y, et al. Ace++: Instruction-based image creation and editing via context-aware content filling[J]. arXiv preprint arXiv:2501.02487, 2025.
  • Han Z, Jiang Z, Pan Y, et al. ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer[J]. arXiv preprint arXiv:2410.00086, 2024.
  • Yang Z, Feng R, Yan K, et al. BACON: Supercharge Your VLM with Bag-of-Concept Graph to Mitigate Hallucinations[J]. arXiv preprint arXiv:2407.03314, 2024.
  • Han Z, Mao C, Jiang Z, et al. Stylebooth: Image style editing with multimodal instruction[J]. arXiv preprint arXiv:2404.12154, 2024.
  • Pan Y, Mao C, Jiang Z, et al. Locate, assign, refine: Taming customized image inpainting with text-subject guidance[J]. arXiv e-prints, 2024: arXiv: 2403.19534.
  • Jiang Z, Mao C, Pan Y, et al. Scedit: Efficient and controllable image diffusion generation via skip connection editing[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern Recognition. 2024: 8995-9004.
  • Mao C, Jiang Z. Res-Attn: An Enhanced Res-Tuning Approach with Lightweight Attention Mechanism[J]. arXiv preprint arXiv:2312.16916, 2023.
  • Jiang Z, Mao C, Huang Z, et al. Res-tuning: A flexible and efficient tuning paradigm via unbinding tuner from backbone[J]. Advances in Neural Information Processing Systems, 2023, 36: 42689-42716.
  • Jiang Z, Mao C, Huang Z, et al. Rethinking efficient tuning methods from a unified perspective[J]. arXiv preprint arXiv:2303.00690, 2023.
  • Wu Z F, Wei T, Jiang J, et al. Ngc: A unified framework for learning with open-world noisy data[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 62-71.
  • Mao C, Li Y, Zhang Y, et al. Multi-channel pyramid person matching network for person re-identification[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2018, 32(1).
  • Mao C, Li Y, Zhang Z, et al. Pyramid person matching network for person re-identification[C]//Asian Conference on Machine Learning. PMLR, 2017: 487-497.