Skip to main content
Cornell University
We gratefully acknowledge support from the Simons Foundation, member institutions, and all contributors. Donate
arxiv logo > cs.MM
arXiv logo
Cornell University Logo

quick links

  • Login
  • Help Pages
  • About

Multimedia

Authors and titles for recent submissions

  • Mon, 6 Oct 2025
  • Fri, 3 Oct 2025
  • Thu, 2 Oct 2025
  • Wed, 1 Oct 2025
  • Tue, 30 Sep 2025

See today's new changes

Total of 45 entries
Showing up to 50 entries per page: fewer | more | all

Mon, 6 Oct 2025 (showing 2 of 2 entries )

[1] arXiv:2510.02746 [pdf, other]
Title: Detecting Notational Errors in Digital Music Scores
Géré Léo (Cnam, CEDRIC - VERTIGO), Nicolas Audebert (LaSTIG, IGN, CEDRIC - VERTIGO), Florent Jacquemard (CEDRIC - VERTIGO)
Journal-ref: International Conference on Technologies for Music Notation and Representation (TENOR) 2025, Oct 2025, Beijing, China
Subjects: Multimedia (cs.MM)
[2] arXiv:2510.02790 (cross-list from cs.CV) [pdf, html, other]
Title: MaskCD: Mitigating LVLM Hallucinations by Image Head Masked Contrastive Decoding
Jingyuan Deng, Yujiu Yang
Comments: accepted to emnlp2025 findings
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)

Fri, 3 Oct 2025 (showing 4 of 4 entries )

[3] arXiv:2510.02161 [pdf, html, other]
Title: Comparing Contrastive and Triplet Loss in Audio-Visual Embedding: Intra-Class Variance and Greediness Analysis
Donghuo Zeng
Comments: 8 pages, 4 tables, 3 figures
Subjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[4] arXiv:2510.01284 [pdf, html, other]
Title: Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation
Chetwin Low, Weimin Wang, Calder Katyal
Subjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[5] arXiv:2510.01698 (cross-list from cs.IR) [pdf, html, other]
Title: TalkPlay-Tools: Conversational Music Recommendation with LLM Tool Calling
Seungheon Doh, Keunwoo Choi, Juhan Nam
Comments: Accepted for publication at The Workshop on AI for Music, Neural Information Processing Systems (NeurIPS-AI4Music)
Subjects: Information Retrieval (cs.IR); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[6] arXiv:2510.01361 (cross-list from eess.IV) [pdf, other]
Title: An Efficient Quality Metric for Video Frame Interpolation Based on Motion-Field Divergence
Conall Daly, Darren Ramsook, Anil Kokaram
Comments: IEEE 17th International Conference on Quality of Multimedia Experience 2025 accepted manuscript, 7 pages
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

Thu, 2 Oct 2025 (showing 8 of 8 entries )

[7] arXiv:2510.00050 [pdf, html, other]
Title: Object-AVEdit: An Object-level Audio-Visual Editing Model
Youquan Fu, Ruiyang Si, Hongfa Wang, Dongzhan Zhou, Jiacheng Sun, Ping Luo, Di Hu, Hongyuan Zhang, Xuelong Li
Subjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[8] arXiv:2510.01174 (cross-list from cs.CV) [pdf, html, other]
Title: Code2Video: A Code-centric Paradigm for Educational Video Generation
Yanzhe Chen, Kevin Qinghong Lin, Mike Zheng Shou
Comments: Project Page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
[9] arXiv:2510.01009 (cross-list from cs.CV) [pdf, html, other]
Title: POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency
Ashim Dahal, Ankit Ghimire, Saydul Akbar Murad, Nick Rahimi
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[10] arXiv:2510.00990 (cross-list from cs.CY) [pdf, html, other]
Title: Disc-Cover Complexity Trends in Music Illustrations from Sinatra to Swift
Nicolas Fracaro, Stefano Cecconello, Mauro Conti, Niccolò Di Marco, Alessandro Galeazzi
Subjects: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
[11] arXiv:2510.00481 (cross-list from cs.NI) [pdf, html, other]
Title: Make a Video Call with LLM: A Measurement Campaign over Five Mainstream Apps
Jiayang Xu, Xiangjie Huang, Zijie Li, Zili Meng
Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM); Performance (cs.PF)
[12] arXiv:2510.00261 (cross-list from cs.CL) [pdf, html, other]
Title: Retrieval-Augmented Generation for Electrocardiogram-Language Models
Xiaoyu Song, William Han, Tony Chen, Chaojing Duan, Michael A. Rosenberg, Emerson Liu, Ding Zhao
Comments: 5 pages, 2 figures; Submitted to ICASSP 2026
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
[13] arXiv:2510.00058 (cross-list from eess.IV) [pdf, html, other]
Title: Variable Rate Image Compression via N-Gram Context based Swin-transformer
Priyanka Mudgal, Feng Liu
Comments: Accepted at ISVC 2025
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[14] arXiv:2510.00006 (cross-list from cs.SD) [pdf, other]
Title: Unpacking Musical Symbolism in Online Communities: Content-Based and Network-Centric Approaches
Kajwan Ziaoddini
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Computers and Society (cs.CY); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)

Wed, 1 Oct 2025 (showing 7 of 7 entries )

[15] arXiv:2509.26625 (cross-list from cs.LG) [pdf, html, other]
Title: Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training
Junlin Han, Shengbang Tong, David Fan, Yufan Ren, Koustuv Sinha, Philip Torr, Filippos Kokkinos
Comments: Project page: this https URL
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[16] arXiv:2509.26542 (cross-list from eess.AS) [pdf, html, other]
Title: Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap
Yueqian Lin, Zhengmian Hu, Qinsi Wang, Yudong Liu, Hengfan Zhang, Jayakumar Subramanian, Nikos Vlassis, Hai Helen Li, Yiran Chen
Comments: Code and data available at this https URL
Subjects: Audio and Speech Processing (eess.AS); Multimedia (cs.MM); Sound (cs.SD)
[17] arXiv:2509.25745 (cross-list from cs.CV) [pdf, html, other]
Title: FinCap: Topic-Aligned Captions for Short-Form Financial YouTube Videos
Siddhant Sukhani, Yash Bhardwaj, Riya Bhadani, Veer Kejriwal, Michael Galarnyk, Sudheer Chava
Comments: ICCV Short Video Understanding Workshop Paper
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
[18] arXiv:2509.25668 (cross-list from eess.IV) [pdf, html, other]
Title: Enhanced Template-based Intra Mode Derivation with Adaptive Block Vector Replacement
Jiaqi Zhang, Jiaye Fu, Chuanmin Jia, Siwei Ma, Karam Naser, Thierry Dumas, Saurabh Puri, Milos Radosavljevic
Subjects: Image and Video Processing (eess.IV); Multimedia (cs.MM)
[19] arXiv:2509.25652 (cross-list from cs.AI) [pdf, html, other]
Title: Iterative Residual Cross-Attention Mechanism: An Integrated Approach for Audio-Visual Navigation Tasks
Hailong Zhang, Yinfeng Yu, Liejun Wang, Fuchun Sun, Wendong Zheng
Comments: Accepted for publication by IEEE International Conference on Systems, Man, and Cybernetics 2025
Subjects: Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD)
[20] arXiv:2509.25558 (cross-list from cs.AI) [pdf, html, other]
Title: A(I)nimism: Re-enchanting the World Through AI-Mediated Object Interaction
Diana Mykhaylychenko, Maisha Thasin, Dunya Baradari, Charmelle Mhungu
Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); Multimedia (cs.MM)
[21] arXiv:2509.25348 (cross-list from cs.CV) [pdf, html, other]
Title: Editing Physiological Signals in Videos Using Latent Representations
Tianwen Zhou, Akshay Paruchuri, Josef Spjut, Kaan Akşit
Comments: 12 pages, 8 figures, 7 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)

Tue, 30 Sep 2025 (showing 24 of 24 entries )

[22] arXiv:2509.24546 [pdf, html, other]
Title: Nagare Media Engine: A System for Cloud- and Edge-Native Network-based Multimedia Workflows
Matthias Neugebauer
Subjects: Multimedia (cs.MM)
[23] arXiv:2509.24331 [pdf, html, other]
Title: OnomatoGen: Onomatopoeia Generation with the Alpha-Channel in Manga
Takara Taniguchi, Wataru Shimoda, Kota Yamaguchi, Hideki Nakayama
Comments: ICCVW COMIQ Oral
Subjects: Multimedia (cs.MM)
[24] arXiv:2509.23251 [pdf, html, other]
Title: XGC-AVis: Towards Audio-Visual Content Understanding with a Multi-Agent Collaborative System
Yuqin Cao, Xiongkuo Min, Yixuan Gao, Wei Sun, Zicheng Zhang, Jinliang Han, Guangtao Zhai
Subjects: Multimedia (cs.MM); Sound (cs.SD)
[25] arXiv:2509.25139 (cross-list from cs.AI) [pdf, html, other]
Title: Vision-and-Language Navigation with Analogical Textual Descriptions in LLMs
Yue Zhang, Tianyi Ma, Zun Wang, Yanyuan Qiao, Parisa Kordjamshidi
Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[26] arXiv:2509.25131 (cross-list from cs.SD) [pdf, other]
Title: MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech
Chengyao Wang, Zhisheng Zhong, Bohao Peng, Senqiao Yang, Yuqi Liu, Haokun Gui, Bin Xia, Jingyao Li, Bei Yu, Jiaya Jia
Comments: Code is available at this https URL
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[27] arXiv:2509.24921 (cross-list from cs.RO) [pdf, html, other]
Title: CineWild: Balancing Art and Robotics for Ethical Wildlife Documentary Filmmaking
Pablo Pueyo, Fernando Caballero, Ana Cristina Murillo, Eduardo Montijano
Subjects: Robotics (cs.RO); Multimedia (cs.MM)
[28] arXiv:2509.24783 (cross-list from cs.CV) [pdf, other]
Title: SkyLink: Unifying Street-Satellite Geo-Localization via UAV-Mediated 3D Scene Alignment
Hongyang Zhang, Yinhao Liu, Zhenyu Kuang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[29] arXiv:2509.24369 (cross-list from cs.CV) [pdf, html, other]
Title: From Satellite to Street: A Hybrid Framework Integrating Stable Diffusion and PanoGAN for Consistent Cross-View Synthesis
Khawlah Bajbaa, Abbas Anwar, Muhammad Saqib, Hafeez Anwar, Nabin Sharma, Muhammad Usman
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
[30] arXiv:2509.24325 (cross-list from eess.IV) [pdf, html, other]
Title: ReCon-GS: Continuum-Preserved Guassian Streaming for Fast and Compact Reconstruction of Dynamic Scenes
Jiaye Fu, Qiankun Gao, Chengxiang Wen, Yanmin Wu, Siwei Ma, Jiaqi Zhang, Jian Zhang
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[31] arXiv:2509.24298 (cross-list from cs.HC) [pdf, html, other]
Title: Bridging the behavior-neural gap: A multimodal AI reveals the brain's geometry of emotion more accurately than human self-reports
Changde Du, Yizhuo Lu, Zhongyu Huang, Yi Sun, Zisen Zhou, Shaozheng Qin, Huiguang He
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Multimedia (cs.MM)
[32] arXiv:2509.24215 (cross-list from cs.SE) [pdf, html, other]
Title: Metamorphic Testing for Audio Content Moderation Software
Wenxuan Wang, Yongjiang Wu, Junyuan Zhang, Shuqing Li, Yun Peng, Wenting Chen, Shuai Wang, Michael R. Lyu
Comments: Accepted by ASE 2025
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
[33] arXiv:2509.23879 (cross-list from cs.CV) [pdf, html, other]
Title: PCRI: Measuring Context Robustness in Multimodal Models for Enterprise Applications
Hitesh Laxmichand Patel, Amit Agarwal, Srikant Panda, Hansa Meghwani, Karan Dua, Paul Li, Tao Sheng, Sujith Ravi, Dan Roth
Comments: Accepted in EMNLP 2025
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
[34] arXiv:2509.23878 (cross-list from cs.SD) [pdf, html, other]
Title: Disentangling Score Content and Performance Style for Joint Piano Rendering and Transcription
Wei Zeng, Junchuan Zhao, Ye Wang
Comments: 30 pages, 13 figures
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[35] arXiv:2509.23852 (cross-list from cs.GR) [pdf, html, other]
Title: SIG-Chat: Spatial Intent-Guided Conversational Gesture Generation Involving How, When and Where
Yiheng Huang, Junran Peng, Silei Shen, Jingwei Yang, ZeJi Wei, ChenCheng Bai, Yonghao He, Wei Sui, Muyi Sun, Yan Liu, Xu-Cheng Yin, Man Zhang, Zhaoxiang Zhang, Chuanchen Luo
Subjects: Graphics (cs.GR); Multimedia (cs.MM); Robotics (cs.RO)
[36] arXiv:2509.23833 (cross-list from eess.AS) [pdf, html, other]
Title: AISHELL6-whisper: A Chinese Mandarin Audio-visual Whisper Speech Dataset with Speech Recognition Baselines
Cancan Li, Fei Su, Juan Liu, Hui Bu, Yulong Wan, Hongbin Suo, Ming Li
Subjects: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
[37] arXiv:2509.23796 (cross-list from cs.AI) [pdf, html, other]
Title: From Frustration to Fun: An Adaptive Problem-Solving Puzzle Game Powered by Genetic Algorithm
Matthew McConnell, Richard Zhao
Comments: Accepted at the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE-25)
Subjects: Artificial Intelligence (cs.AI); Multimedia (cs.MM); Neural and Evolutionary Computing (cs.NE)
[38] arXiv:2509.23673 (cross-list from cs.CV) [pdf, html, other]
Title: RCI: A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks
Amit Agarwal, Hitesh Laxmichand Patel, Srikant Panda, Hansa Meghwani, Jyotika Singh, Karan Dua, Paul Li, Tao Sheng, Sujith Ravi, Dan Roth
Comments: Accepted in EMNLP 2025
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
[39] arXiv:2509.23435 (cross-list from cs.SD) [pdf, html, other]
Title: AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models
Wenyu Li, Xiaoqi Jiao, Yi Chang, Guangyan Zhang, Yiwen Guo
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[40] arXiv:2509.23200 (cross-list from eess.IV) [pdf, html, other]
Title: Enhanced Quality Aware-Scalable Underwater Image Compression
Linwei Zhu, Junhao Zhu, Xu Zhang, Huan Zhang, Ye Li, Runmin Cong, Sam Kwong
Comments: 19 pages, 14 figures; submitted to ACM Transactions on Multimedia Computing, Communications, and Applications
Subjects: Image and Video Processing (eess.IV); Multimedia (cs.MM)
[41] arXiv:2509.22744 (cross-list from eess.AS) [pdf, html, other]
Title: Index-MSR: A high-efficiency multimodal fusion framework for speech recognition
Jinming Chen, Lu Wang, Zheshu Song, Wei Deng
Comments: Submit to icassp 2026
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD)
[42] arXiv:2509.22740 (cross-list from eess.AS) [pdf, html, other]
Title: Learning What To Hear: Boosting Sound-Source Association For Robust Audiovisual Instance Segmentation
Jinbae Seo, Hyeongjun Kwon, Kwonyoung Kim, Jiyoung Lee, Kwanghoon Sohn
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD)
[43] arXiv:2509.22728 (cross-list from cs.SD) [pdf, html, other]
Title: Prompt-aware classifier free guidance for diffusion models
Xuanhao Zhang, Chang Li
Comments: 5 pages, 3 figures
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[44] arXiv:2509.22718 (cross-list from eess.AS) [pdf, html, other]
Title: PerformSinger: Multimodal Singing Voice Synthesis Leveraging Synchronized Lip Cues from Singing Performance Videos
Ke Gu, Zhicong Wu, Peng Bai, Sitong Qiao, Zhiqi Jiang, Junchen Lu, Xiaodong Shi, Xinyuan Qian
Subjects: Audio and Speech Processing (eess.AS); Multimedia (cs.MM); Sound (cs.SD)
[45] arXiv:2509.19812 (cross-list from cs.SD) [pdf, html, other]
Title: Efficient Speech Watermarking for Speech Synthesis via Progressive Knowledge Distillation
Yang Cui, Peter Pan, Lei He, Sheng Zhao
Comments: 6 pages of main text, 1 page of references, 2 figures, 2 tables, accepted at ASRU 2025
Subjects: Sound (cs.SD); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Total of 45 entries
Showing up to 50 entries per page: fewer | more | all
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack