Large Vision-Language Models: Pre-Training, Prompting, and Applications
暫譯: 大型視覺語言模型:預訓練、提示與應用

Zhou, Kaiyang, Liu, Ziwei, Gao, Peng

  • 出版商: Springer
  • 出版日期: 2025-08-31
  • 售價: $7,720
  • 貴賓價: 9.5$7,334
  • 語言: 英文
  • 頁數: 429
  • 裝訂: Hardcover - also called cloth, retail trade, or trade
  • ISBN: 3031949684
  • ISBN-13: 9783031949685
  • 相關分類: Computer Vision
  • 海外代購書籍(需單獨結帳)

商品描述

The rapid progress in the field of large multimodal foundation models, especially vision-language models, has dramatically transformed the landscape of machine learning, computer vision, and natural language processing. These powerful models, trained on vast amounts of multimodal data mixed with images and text, have demonstrated remarkable capabilities in tasks ranging from image classification and object detection to visual content generation and question answering. This book provides a comprehensive and up-to-date exploration of large vision-language models, covering the key aspects of their pre-training, prompting techniques, and diverse real-world computer vision applications. It is an essential resource for researchers, practitioners, and students in the fields of computer vision, natural language processing, and artificial intelligence.

Large Vision-Language Models begins by exploring the fundamentals of large vision-language models, covering architectural designs, training techniques, and dataset construction methods. It then examines prompting strategies and other adaptation methods, demonstrating how these models can be effectively fine-tuned to address a wide range of downstream tasks. The final section focuses on the application of vision-language models across various domains, including open-vocabulary object detection, 3D point cloud processing, and text-driven visual content generation and manipulation.

Beyond the technical foundations, the book explores the wide-ranging applications of vision-language models (VLMs), from enhancing image recognition systems to enabling sophisticated visual content generation and facilitating more natural human-machine interactions. It also addresses key challenges in the field, such as feature alignment, scalability, data requirements, and evaluation metrics. By providing a comprehensive roadmap for both newcomers and experts, this book serves as a valuable resource for understanding the current landscape, limitations, and future directions of VLMs, ultimately contributing to the advancement of artificial intelligence.

商品描述(中文翻譯)

在大型多模態基礎模型領域,特別是視覺-語言模型的快速進展,已經徹底改變了機器學習、計算機視覺和自然語言處理的格局。這些強大的模型在大量混合圖像和文本的多模態數據上進行訓練,展現了在從圖像分類、物體檢測到視覺內容生成和問題回答等任務中的卓越能力。本書提供了對大型視覺-語言模型的全面且最新的探索,涵蓋了其預訓練、提示技術和多樣化的現實世界計算機視覺應用的關鍵方面。這是計算機視覺、自然語言處理和人工智慧領域的研究人員、實踐者和學生的重要資源。

大型視覺-語言模型 開始探討大型視覺-語言模型的基本原理,涵蓋架構設計、訓練技術和數據集構建方法。接著,書中檢視了提示策略和其他適應方法,展示了如何有效地微調這些模型以應對各種下游任務。最後一部分專注於視覺-語言模型在各個領域的應用,包括開放詞彙物體檢測、3D點雲處理以及基於文本的視覺內容生成和操作。 除了技術基礎外,本書還探討了視覺-語言模型(VLMs)的廣泛應用,從增強圖像識別系統到實現複雜的視覺內容生成,並促進更自然的人機互動。它還解決了該領域中的關鍵挑戰,如特徵對齊、可擴展性、數據需求和評估指標。通過為新手和專家提供全面的路線圖,本書作為理解VLMs當前格局、限制和未來方向的寶貴資源,最終促進人工智慧的進步。

作者簡介

Kaiyang Zhou is an Assistant Professor at the Department of Computer Science, Hong Kong Baptist University, working on computer vision and machine learning. He has published more than 30 technical papers in top-tier journals and conferences in relevant fields, including CVPR, ICCV, ECCV, NeurlPS, ICLR, ICML, AAAI, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), and International Journal of Computer Vision (IJCV), with over 10,000 citations received in total. He is an Associate Editor of IJCV, the flagship journal in computer vision, and regularly serves as area chair and senior program committee for top-tier computer vision and machine learning conferences, such as NeurIPS, CVPR, ECCV, and AAAI.

Ziwei Liu is an Associate Professor at Nanyang Technological University, Singapore. His research interests include computer vision, machine learning, and computer graphics. He has published extensively with top-tier conferences and journals in relevant fields, including CVPR, ICCV, ECCV, NeurlPS, ICLR, ICML, IEEE Transactions on Pattern Analysis and Machine Intelligence, ACM Transactions on Graphics and Nature - Machine Intelligence. He is the recipient of ICCV Young Researcher Award, HKSTP Best Paper Award, CVPR Best Paper Award Candidate, ICBS Frontiers of Science Award and MIT Technology Review Innovators under 35 Asia Pacific. He serves as an area chair of CVPR, ICCV, ECCV, NeurlPS and ICLR, as well as an associate editor of International Journal of Computer Vision.

Peng Gao is a research scientist at Shanghai Artificial Intelligence Laboratory, working on large language models and vision-language models. His research interests include vision-language models, large language models and diffusion models for contents creation. He has published more than 40 papers in top-tier journals and conferences, including International Journal of Computer Vision (IJCV), ICML, ICLR, NeurIPS, CVPR, ICCV and ECCV, receiving more than 10,000 citations. He has led several influential open-source projects including LLaMa-Adapter and the Lumina series, receiving more than 7000 and 2000 stars, respectively.

作者簡介(中文翻譯)

周凱揚是香港浸會大學計算機科學系的助理教授,專注於計算機視覺和機器學習。他在相關領域的頂級期刊和會議上發表了超過30篇技術論文,包括CVPR、ICCV、ECCV、NeurIPS、ICLR、ICML、AAAI、IEEE模式分析與機器智慧期刊(TPAMI)和國際計算機視覺期刊(IJCV),總引用次數超過10,000次。他是IJCV的副編輯,這是計算機視覺領域的旗艦期刊,並定期擔任NeurIPS、CVPR、ECCV和AAAI等頂級計算機視覺和機器學習會議的區域主席和高級程序委員會成員。

劉子維是新加坡南洋理工大學的副教授。他的研究興趣包括計算機視覺、機器學習和計算機圖形學。他在相關領域的頂級會議和期刊上發表了大量論文,包括CVPR、ICCV、ECCV、NeurIPS、ICLR、ICML、IEEE模式分析與機器智慧期刊、ACM圖形學期刊和《自然 - 機器智慧》。他曾獲得ICCV青年研究者獎、HKSTP最佳論文獎、CVPR最佳論文獎候選人、ICBS科學前沿獎以及MIT科技評論35歲以下創新者亞太區獎。他擔任CVPR、ICCV、ECCV、NeurIPS和ICLR的區域主席,以及國際計算機視覺期刊的副編輯。

高鵬是上海人工智慧實驗室的研究科學家,專注於大型語言模型和視覺-語言模型。他的研究興趣包括視覺-語言模型、大型語言模型和內容創作的擴散模型。他在頂級期刊和會議上發表了超過40篇論文,包括國際計算機視覺期刊(IJCV)、ICML、ICLR、NeurIPS、CVPR、ICCV和ECCV,總引用次數超過10,000次。他主導了幾個有影響力的開源項目,包括LLaMa-Adapter和Lumina系列,分別獲得超過7000和2000顆星。