I am currently using Q-Learning to try to teach a bot how to move in a room filled with walls/obstacles. It must start in any place in the room and get to the goal state(this might be, to the tile that has a door, for example). Currently when it wants to move to another tile, it will go to that tile, but I was thinking that in the future I might add a random chance of going to another tile, instead of that. It can only move up, down, left and right. Reaching the goal state yields +100 and the rest of the actions will yield 0.
I am using the algorithm found here, which can be seen in the image bellow.
Now, regarding this, I have some questions:
- When using Q-Learning, a bit like Neural Networks, I must make distinction between a learning phase and a using phase? I mean, it seems that what they shown on the first picture is a learning one and in the second picture a using one.
- I read somewhere that it d take an infinite number of steps to reach to the optimum Q values table. Is that true? I d say that isn t true, but I must be missing something here.
I ve heard also about TD(Temporal Differences), which seems to be represented by the following expression:
Q(a, s) = Q(a, s) * alpha * [R(a, s) + gamma * Max { Q(a , s } - Q(a, s)]
which for alpha = 1, just seems the one shown first in the picture. What difference does that gamma make, here?
- I have run in some complications if I try a very big room(300x200 pixels, for example). As it essentially runs randomly, if the room is very big then it will take a lot of time to go randomly from the first state to the goal state. What methods can I use to speed it up? I thought maybe having a table filled with trues and falses, regarding whatever I have in that episode already been in that state or not. If yes, I d discard it, if no, I d go there. If I had already been in all those states, then I d go to a random one. This way, it d be just like what am I doing now, knowing that I d repeat states a less often that I currently do.
- I d like to try something else than my lookup table for Q-Values, so I was thinking in using Neural Networks with back-propagation for this. I will probably try having a Neural Network for each action (up, down, left, right), as it seems it s what yields best results. Are there any other methods (besides SVM, that seem way too hard to implement myself) that I could use and implement that d give me good Q-Values function approximation?
- Do you think Genetic Algorithms would yield good results in this situation, using the Q-Values matrix as the basis for it? How could I test my fitness function? It gives me the impression that GA are generally used for things way more random/complex. If we watch carefully we will notice that the Q-Values follow a clear trend - having the higher Q values near the goal and lower ones the farther away you are from them. Going to try to reach that conclusion by GA probably would take way too long?