1

Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

Recent advances in generative video modeling, driven by large-scale datasets and powerful architectures, have yielded remarkable visual …

Ying Shen, Jerry Xiong, Tianjiao Yu, Ismini Lourentzou

Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flows

Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention …

Jiatao Gu, Ying Shen, Tianrong Chen, Laurent Dinh, Yuyang Wang, Miguel Ángel Bautista, David Berthelot, Josh Susskind, Shuangfei Zhai

STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flows

LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer

Recent advances in multimodal foundation models unifying image understanding and generation have opened exciting avenues for tackling a …

Ying Shen, Zhiyang Xu, Jiuhai Chen, Shizhe Diao, Jiaxin Zhang, Yuguang Yao, Joy Rimchala, Ismini Lourentzou, Lifu Huang

LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer

LLM Braces: Straightening Out LLM Predictions with Relevant Sub-Updates

Recent findings reveal that much of the knowledge in a Transformer-based Large Language Model (LLM) is encoded in its feed-forward …

Ying Shen, Lifu Huang

LLM Braces: Straightening Out LLM Predictions with Relevant Sub-Updates

Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling

Diffusion models have emerged as a powerful tool for generating high-quality images from textual descriptions. Despite their successes, …

Jiatao Gu, Ying Shen, Shuangfei Zhai, Yizhe Zhang, Navdeep Jaitly, Joshua M. Susskind

Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling

Learning by Asking for Embodied Visual Navigation and Task Completion

The research community has shown increasing interest in designing intelligent embodied agents that can assist humans in accomplishing …

Ying Shen, Daniel Bis, Cynthia Lu, Ismini Lourentzou

Learning by Asking for Embodied Visual Navigation and Task Completion

Many-to-many Image Generation with Auto-regressive Diffusion Models

Recent advancements in image generation have made significant progress, yet existing models present limitations in perceiving and …

Ying Shen, Yizhe Zhang, Shuangfei Zhai, Lifu Huang, Joshua M. Susskind, Jiatao Gu

Many-to-many Image Generation with Auto-regressive Diffusion Models

InternalInspector I2: Robust Confidence Estimation in LLMs through Internal States

Despite their vast capabilities, Large Language Models (LLMs) often struggle with generating reliable outputs, frequently producing …

Mohammad Beigi, Ying Shen, Runing Yang, Zihao Lin, Qifan Wang, Ankith Mohan, Jianfeng He, Ming Jin, Chang-Tien Lu, Lifu Huang

Multimodal Instruction Tuning with Conditional Mixture of LoRA

Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in diverse tasks across different domains, with an …

Ying Shen, Zhiyang Xu, Qifan Wang, Yu Cheng, Wenpeng Yin, Lifu Huang

Multimodal Instruction Tuning with Conditional Mixture of LoRA

Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning

Despite vision-language models' (VLMs) remarkable capabilities as versatile visual assistants, two substantial challenges persist …

Zhiyang Xu, Chao Feng, Rulin Shao, Trevor Ashby, Ying Shen, Di Jin, Yu Cheng, Qifan Wang, Lifu Huang

Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning