統計策略搜索強化學習方法及應用

趙婷婷

買這商品的人也買了...

相關主題

商品描述

智能體AlphaGo戰勝人類圍棋專家刷新了人類對人工智能的認識,也使得其核心技術強化學習受到學術界的廣泛關註。本書正是在如此背景下,圍繞作者多年從事強化學習理論及應用的研究內容及國內外關於強化學習的最近動態等方面展開介紹,是為數不多的強化學習領域的專業著作。該著作側重於基於直接策略搜索的強化學習方法,結合了統計學習的諸多方法對相關技術及方法進行分析、改進及應用。本書以一個全新的現代角度描述策略搜索強化學習算法。從不同的強化學習場景出發,講述了強化學習在實際應用中所面臨的諸多難題。針對不同場景,給定具體的策略搜索算法,分析算法中估計量和學習參數的統計特性,並對算法進行應用實例展示及定量比較。特別地,本書結合強化學習前沿技術將策略搜索算法應用到機器人控制及數字藝術渲染領域,給人以耳目一新的感覺。最後根據作者長期研究經驗,對強化學習的發展趨勢進行了簡要介紹和總結。本書取材經典、全面,概念清楚,推導嚴密,以期形成一個集基礎理論、算法和應用為一體的完備知識體系。

作者簡介

趙婷婷,天津科技大學人工智能學院副教授,主要研究方向為人工智能、機器學習。
中國計算機協會(CCF) 會員、YOCSEF 會員、中國人工智能學會會員、人工智能學會模式識別專委會委員,2017年獲得天津市"131”創新型人才培養工程第二層次人選稱號。

目錄大綱

1章 強化學習概述···························································································1 
1.1 機器學習中的強化學習··········································································1 
1.2 智能控制中的強化學習··········································································4 
1.3 強化學習分支··························································································8 
1.4 本書貢獻·······························································································11 
1.5 本書結構·······························································································12 
參考文獻········································································································14 

2章 相關研究及背景知識·············································································19 
2.1 馬爾可夫決策過程················································································19 
2.2 基於值函數的策略學習算法·································································21 
2.2.1 值函數·······················································································21 
2.2.2 策略迭代和值迭代····································································23 
2.2.3 Q-learning ··················································································25 
2.2.4 基於*小二乘法的策略迭代算法·············································27 
2.2.5 基於值函數的深度強化學習方法·············································29 
2.3 策略搜索算法························································································30 
2.3.1 策略搜索算法建模····································································31 
2.3.2 傳統策略梯度算法(REINFORCE算法)······························32 
2.3.3 自然策略梯度方法(Natural Policy Gradient)························33 
2.3.4 期望*大化的策略搜索方法·····················································35 
2.3.5 基於策略的深度強化學習方法·················································37 
2.4 本章小結·······························································································38 
參考文獻········································································································39 

3章 策略梯度估計的分析與改進·································································42 
3.1 研究背景·······························································································42 
3.2 基於參數探索的策略梯度算法(PGPE算法)···································44 
3.3 梯度估計方差分析················································································46 
3.4 基於*優基線的算法改進及分析·························································48 
3.4.1 *優基線的基本思想································································48 
3.4.2 PGPE算法的*優基線······························································49 
3.5 實驗·······································································································51 
3.5.1 示例···························································································51 
3.5.2 倒立擺平衡問題········································································57 
3.6 總結與討論····························································································58 
參考文獻········································································································60 

4章 基於重要性採樣的參數探索策略梯度算法··········································63 
4.1 研究背景·······························································································63 
4.2 異策略場景下的PGPE算法·································································64 
4.2.1 重要性加權PGPE算法·····························································65 
4.2.2 IW-PGPE算法通過基線減法減少方差····································66 
4.3 實驗結果·······························································································68 
4.3.1 示例···························································································69 
4.3.2 山地車任務················································································78 
4.3.3 機器人仿真控制任務································································81 
4.4 總結和討論····························································································88 
參考文獻········································································································89 

5章 方差正則化策略梯度算法·····································································91 
5.1 研究背景·······························································································91 
5.2 正則化策略梯度算法············································································92 
5.2.1 目標函數····················································································92 
5.2.2 梯度計算方法············································································94 
5.3 實驗結果·······························································································95 
5.3.1 數值示例····················································································95 
5.3.2 山地車任務··············································································101 
5.4 總結和討論··························································································102 
參考文獻······································································································103 

6章 基於參數探索的策略梯度算法的採樣技術········································105 
6.1 研究背景·····························································································105 
6.2 基於參數探索的策略梯度算法中的採樣技術····································107 
6.2.1 基線採樣··················································································108 
6.2.2 *優基線採樣··········································································109 
6.2.3 對稱採樣··················································································109 
6.2.4 對稱採樣··············································································111 
6.2.5 多模態對稱採樣··································································116 
6.2.6 SupSymPGPE 的獎勵歸一化··················································117 
6.3 數值示例實驗······················································································119 
6.3.1 平方函數··················································································120 
6.3.2 Rastrigin函數··········································································120 
6.4 本章總結·····························································································124 
參考文獻······································································································125 

7章 基於樣本有效重用的人形機器人的運動技能學習·····························127 
7.1 研究背景:真實環境下的運動技能學習···········································127 
7.2 運動技能學習框架··············································································128 
7.2.1 機器人的運動路徑和回報·······················································128 
7.2.2 策略模型··················································································129 
7.2.3 基於PGPE算法的策略學習方法···········································129 
7.3 有效重用歷史經驗··············································································130 
7.3.1 基於重要性加權的參數探索策略梯度算法 
(IW-PGPE算法)···································································130 
7.3.2 基於IW-PGPE算法的運動技能學習過程·····························131 
7.3.3 遞歸型IW-PGPE算法····························································132 
7.4 虛擬環境中的車桿擺動任務·······························································133 
7.5 籃球擊任務······················································································137 
7.6 討論與結論··························································································140 
參考文獻······································································································142 

8章 基於逆強化學習的藝術風格學習及水墨畫渲染·································145 
8.1 研究背景·····························································································145 
8.1.1 計算機圖形學背景··································································146 
8.1.2 人工智能背景··········································································147 
8.1.3 面向藝術風格化的渲染系統···················································148 
8.2 基於強化學習的筆刷智能體建模·······················································148 
8.2.1 動作的設計··············································································149 
8.2.2 狀態的設計··············································································150 
8.3 離線藝術風格學習階······································································151 
8.3.1 數據採集··················································································152 
8.3.2 基於逆強化學習的獎勵函數學習···········································153 
8.3.3 基於R-PGPE算法的渲染策略學習·······································154 
8.4 A4系統用戶界面················································································155 
8.5 實驗與結果··························································································157 
8.5.1 渲染策略學習結果··································································157 
8.5.2 基於IRL進行筆劃繪製的渲染結果·······································160 
8.6 本章小結·····························································································162 
參考文獻······································································································163