Video Grounding and Its Generalization: From I.D. and Task-Specific Models to O.O.D. and Large Foundation Models
暫譯: 視頻定位及其泛化:從 I.D. 和任務特定模型到 O.O.D. 和大型基礎模型

Wang, Xin, Lan, Xiaohan, Zhu, Wenwu

  • 出版商: Springer
  • 出版日期: 2026-01-03
  • 售價: $6,950
  • 貴賓價: 9.5$6,603
  • 語言: 英文
  • 頁數: 209
  • 裝訂: Hardcover - also called cloth, retail trade, or trade
  • ISBN: 303194836X
  • ISBN-13: 9783031948367
  • 相關分類: Computer Vision
  • 海外代購書籍(需單獨結帳)

相關主題

商品描述

This book consists of two parts: Part I Methodologies for Video Grounding and Part II Generalized Video Grounding and Trending Directions. To make this book self-contained and cutting edge, Part I will cover basic and advanced methodologies for Video Grounding, discussing key comparisons with several representative Vision-Language learning tasks including multimodal understanding and generation. Part II will cover our insights for Generalized Video Grounding and the development of Video Grounding in the era of large foundation models, discussing future directions such as Out-of-Distribution settings which deserve further investigations.

Discussions on Video Grounding will cover both the task of Video Grounding and other Vision-Language Task, as well as their relations. The basics and advances will touch Video Grounding from model to benchmark, from supervised learning to unsupervised pre-training, from single video grounding to video corpus grounding, and from in-distribution setting to out-of-distribution setting. As for Generalized Video Grounding, we discuss cross-modal grounding, event grounding for multi-modal tasks, various distribution shifts in out-of-distribution setting, explainable Video Grounding, and large foundation model for Video Grounding.

We deeply hope this book can benefit interested readers from both academy and industry, covering needs from junior starters in research to senior practitioners in IT companies.

商品描述(中文翻譯)

本書分為兩個部分:第一部分是「視頻定位的方法論」,第二部分是「通用視頻定位及其趨勢方向」。為了使本書內容完整且具前瞻性,第一部分將涵蓋視頻定位的基本和進階方法論,並討論與幾個代表性的視覺-語言學習任務(包括多模態理解和生成)之間的關鍵比較。第二部分將分享我們對通用視頻定位的見解,以及在大型基礎模型時代視頻定位的發展,並討論未來的方向,例如值得進一步研究的分佈外(Out-of-Distribution)設置。

關於視頻定位的討論將涵蓋視頻定位任務及其他視覺-語言任務,以及它們之間的關係。基本和進階的內容將從模型到基準,從監督學習到無監督預訓練,從單一視頻定位到視頻語料庫定位,從分佈內設置到分佈外設置進行探討。至於通用視頻定位,我們將討論跨模態定位、多模態任務的事件定位、分佈外設置中的各種分佈轉變、可解釋的視頻定位,以及用於視頻定位的大型基礎模型。

我們深切希望本書能夠惠及來自學術界和業界的讀者,滿足從初學者到IT公司資深從業者的需求。

作者簡介

Xin Wang is currently an Associate Professor at the Department of Computer Science and Technology, Tsinghua University. He got both of his Ph.D. and B.E degrees in Computer Science and Technology from Zhejiang University, China. He also holds a Ph.D. degree in Computing Science from Simon Fraser University, Canada. His research interests include multimedia intelligence, machine learning and its applications. He has published over 200 high-quality research papers in ICML, NeurIPS, IEEE TPAMI, IEEE TKDE, ACM KDD, WWW, ACM SIGIR, ACM Multimedia etc., winning three best paper awards including ACM Multimedia Asia. He is the recipient of ACM China Rising Star Award, IEEE TCMC Rising Star Award and DAMO Academy Young Fellow.

Xiaohan Lan obtained her M.S. degree from Shenzhen International Graduate School, Tsinghua University. She received her B.E. degree from the Department of Computer Science and Technology of Beijing Normal University in 2020. Her main research interests include multimedia computation, vision and language understanding and deep learning.

Wenwu Zhu is currently a Professor in the Department of Computer Science and Technology at Tsinghua University. He also serves as the Vice Dean of Beijing National Research Center for Information Science and Technology. Prior to his current post, he was a Senior Researcher and Research Manager at Microsoft Research Asia. He was the Chief Scientist and Director at Intel Research China from 2004 to 2008. He worked at Bell Labs, New Jersey as Member of Technical Staff during 1996-1999. He received his Ph.D. degree from New York University in 1996. His research interests include graph machine learning, curriculum learning, data-driven multimedia, big data. He has published over 400 referred papers, and is inventor of over 100 patents. He received ten Best Paper Awards, including ACM Multimedia 2012 and IEEE Transactions on Circuits and Systems for Video Technology in 2001 and 2019. He serves as the EiC for IEEE Transactions on Circuits and Systems for Video Technology, the EiC for IEEE Transactions on Multimedia (2017-2019) and the Chair of the steering committee for IEEE Transactions on Multimedia (2020-2022). He serves as General Co-Chair for ACM Multimedia 2018 and ACM CIKM 2019. He is an AAAS Fellow, IEEE Fellow, ACM Fellow, SPIE Fellow, and a member of Academia Europaea.

作者簡介(中文翻譯)

Xin Wang 目前是清華大學計算機科學與技術系的副教授。他在中國浙江大學獲得計算機科學與技術的博士和工程學士學位。他還在加拿大西門菲莎大學獲得計算科學的博士學位。他的研究興趣包括多媒體智能、機器學習及其應用。他在 ICML、NeurIPS、IEEE TPAMI、IEEE TKDE、ACM KDD、WWW、ACM SIGIR、ACM Multimedia 等會議上發表了超過 200 篇高質量的研究論文,並獲得了包括 ACM Multimedia Asia 在內的三項最佳論文獎。他是 ACM 中國新星獎、IEEE TCMC 新星獎和 DAMO Academy 年輕研究員的獲得者。

Xiaohan Lan 獲得了清華大學深圳國際研究生院的碩士學位。她於 2020 年在北京師範大學計算機科學與技術系獲得工程學士學位。她的主要研究興趣包括多媒體計算、視覺與語言理解以及深度學習。

Wenwu Zhu 目前是清華大學計算機科學與技術系的教授。他還擔任北京國家信息科學與技術研究中心的副院長。在此之前,他曾是微軟亞洲研究院的高級研究員和研究經理。他在 2004 年至 2008 年期間擔任英特爾中國研究院的首席科學家和主任。他在 1996 年至 1999 年期間在新澤西州的貝爾實驗室擔任技術員工。他於 1996 年在紐約大學獲得博士學位。他的研究興趣包括圖形機器學習、課程學習、數據驅動的多媒體和大數據。他發表了超過 400 篇經過審核的論文,並擁有超過 100 項專利。他獲得了十項最佳論文獎,包括 2012 年的 ACM Multimedia 和 2001 年及 2019 年的 IEEE Transactions on Circuits and Systems for Video Technology。他擔任 IEEE Transactions on Circuits and Systems for Video Technology 的主編,曾擔任 IEEE Transactions on Multimedia(2017-2019)的主編,以及 IEEE Transactions on Multimedia 的指導委員會主席(2020-2022)。他擔任 ACM Multimedia 2018 和 ACM CIKM 2019 的總主席。他是美國科學促進會(AAAS)會士、IEEE 會士、ACM 會士、SPIE 會士,以及歐洲學院的成員。