Online learning can have two implementations for me. One - after each turn/run we adapt, two - we adapt after each step. Second version is far more "online", but I keep asking myself - is it realy better.

Consider a simple example, in a single run you come to point/state A. You make another more (to B), update A's value and move on. That changes almost nothing in this run, but if the "world" has changed you may have just slowed down your adaptation. Now... what's with "almost" - the change occurs when you get back to A. If you are running with activity trails, then the situation is even more messy.
What has happened - you have updated A's value twice. That's not very bad, but if you have increased it you have just made a mistake - you have made a cycle in your run, you lost time, energy, whatever. If you increased A's value the first time you left it, the second time you should have done something with it - a cycle is a sign that something went _not good_ so you can leave the value as it were or decrease it, not increase it. By increasing you assume that cycles are good, and that following the path that leaves you nowhere is alright with you.
Reconsider.