English 中文(简体)
培训有强化学习的神经网络
原标题:Training a Neural Network with Reinforcement learning

我知道传送神经网络的基本原理, 以及如何使用反向通信算法来训练它们, 但我正在寻找一种算法,

例如,""http://www.google.com/search?q=cart%20pole%20swing%20up" rel=“norefererr”>>cart porrup up 问题是我想用ANN解决的一个问题。在这种情况下,我不知道应该做些什么来控制工作台,我只知道我离理想位置有多近。我要让ANN学习以奖赏和惩罚为基础。因此,监督学习不是一个选项。

org/wiki/Snake28view_game%29" rel="noreferrer" snake game , 反馈延迟, 仅限于目标和反目标, 而不是奖励。

我可以为第一种情况想出一些算法,比如山坡攀登或基因算法,但我猜它们都会很慢。 它们也可能适用于第二种情况,但非常缓慢,而且不利于在线学习。

我的问题很简单:“强” 是否有一个简单的算法来训练一个强化学习的人工神经网络?“/强”我主要对实时奖励情况感兴趣,但如果有针对目标情况的算法,就更好了。

最佳回答

有关这一专题的一些研究论文如下:

和一些代码:

这些只是关于这个主题的一些顶尖的谷歌搜索结果。 第一批文件看起来相当不错, 虽然我还没有亲自读过。 我认为如果你对谷歌学者进行快速搜索, 你会发现更多关于神经网络的信息, 并进行强化学习。

问题回答

如果导致奖赏的输出 < code>r 返回到网络 < code> r < /code> 时间, 您将会根据奖赏的比例加强网络。 这不能直接适用于负面奖赏, 但我可以想到两个解决方案, 产生不同效果 :

(1) 如果在范围 rmin- rmax 中有一套奖赏, 请将其调整为 < code> 0- (rmax-rmin) , 以便它们都是非负的。 奖赏越大, 创造的加固越强 。

(2) 对于负奖励 < code>-r < /code>, 反制随机输出 < code> < r 时间, 只要它不同于导致负奖励的数值。 这将不仅加强理想产出, 还会分散或避免不良产出。





相关问题
How to add/merge several Big O s into one

If I have an algorithm which is comprised of (let s say) three sub-algorithms, all with different O() characteristics, e.g.: algorithm A: O(n) algorithm B: O(log(n)) algorithm C: O(n log(n)) How do ...

Grokking Timsort

There s a (relatively) new sort on the block called Timsort. It s been used as Python s list.sort, and is now going to be the new Array.sort in Java 7. There s some documentation and a tiny Wikipedia ...

Manually implementing high performance algorithms in .NET

As a learning experience I recently tried implementing Quicksort with 3 way partitioning in C#. Apart from needing to add an extra range check on the left/right variables before the recursive call, ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Enumerating All Minimal Directed Cycles Of A Directed Graph

I have a directed graph and my problem is to enumerate all the minimal (cycles that cannot be constructed as the union of other cycles) directed cycles of this graph. This is different from what the ...

Quick padding of a string in Delphi

I was trying to speed up a certain routine in an application, and my profiler, AQTime, identified one method in particular as a bottleneck. The method has been with us for years, and is part of a "...

热门标签