Future of threading in Ogre

by **Sqeaky** » Sun Dec 02, 2012 2:49 am

This thread is an attempt at a continuation From viewtopic.php?f=25&t=75459&start=50#p478280

I am including this link from the last thread as it seems important

dark_sylinc wrote:OK!! I added the tasks needed for Ogre 2.x
They're in the WIKI PAGE.

From Page 17 of the slides in the post: viewtopic.php?f=25&t=75459&start=50#p478276

dark_sylinc wrote:A thing about tasks (aka. “Short-lived threads”):

Different people have differing defintions of Tasks/Blocks/Workunits/microthreading/whatever this can make this discussion difficult. There are many task scheduler based threading solutions that launch very few threads by running multiple blocks/WorkUnits/Whatever in the thread before its destruction. The active research in this area makes defining things difficult. Having few good definitions makes discussing difficult.

In my experience on modern systems much of the cost of creating a thread comes from it being a system call and at some point needing to allocate memory. A system that is spawning hundreds of threads on modern commodity hardware is too broken to be considered by any performance instensive application, but only because it is creating a large amount of needless objects. I think there are some situations in which it may be faster to create/destroy threads rather than use other synchronization primitives, but in general I agree fewer creations of short lived objects on the heap is a good thing.

From the page 14 slides in the post: viewtopic.php?f=25&t=75459&start=50#p478276

dark_sylinc wrote:There is no guarantee parents are updated in the same thread as their children.
Doing so would be more trouble than it's worth. (But if real world code proves it isn't
hard to guarantee this order → no syncs would be needed until the end)

How much work is in each of those boxes (on page 14)? If there is a significant amount of work, then it may be possible to put them each into Task/WorkUnit in a system that respects data dependencies and does not create unneeded threads. If it is just a few calculations I may have an idea. Is it possible to easily group the nodes in some way where they can be worked on and the dependency is respected because the grouping implies thread safety (Please forgive my ignorance of Ogre). For example, if the nodes are in a hierarchy where each node has one parent and N children there is an implied guarantee. This would let us logically split the heirarchy near its base into X groups we could keep each child with its parents(and grandparents and so on until the split) and each group is worked by a single thread. This requires no synchonization until the work ends and naturally there will be no race condtions.

Towards the end of your slides you mention that entities with animations must be handled in a different fasion. With these complications and the culling issues you in mentioned your previous response it seems that some things to be rendered follow very different codepaths, if that is the case then a more nuanced view of threading that we have discussed might be required. Is there a document that describes these processes in detail (moreso than your slides) so I could make more intelligent and specific observations? Or should I be looking at the source code, and helping to populate the wiki page with my findings?

From: viewtopic.php?f=25&t=75459&start=50#p478279

Xavyiy wrote:What I am trying to say? That if UpdateAllTransforms() & UpdateAllAnimation() are the most relevant candidates for parallelizing, I really suggest doing it in a simple, clear and efficient way: the barrier system. Easy to debug, easy to modify and, at the end, more efficient than using a task-based lib. Less overhead. Granularity could be a problem, but nothing stop us from doing some real-time profiling and adapt the range of objects each thread updates. Also, a great thing here is that the time needed to update each node is constant (it'll not be the same for animations, but it's not going to be a huge difference), so that helps a lot! (In a task-based system each task may have very different execution times, you can mix little tasks with big tasks, so granularity/profiling becomes very important to avoid wasting of CPU resources).

Real time profiling like you describe seems to imply that a change in the amount of work or number of threads requires fine tuning. If that is the case then the ability to perform this tuning must be exposed. Can you think of a way to adjust the location of the thread barrier with a heuristic of some kind? Also, a Workunit based system that respects data dependencies would presumably only have periods of waiting once all the work was complete, this compensates for cases when one thread take longer then expected just before the thread barrier.

***Shameless plug*** In a post in the previous thread I linked to a library I am working on. I think it is close to an ideal fit for most game related tasks (maybe Ogre) : viewtopic.php?f=25&t=75459&start=25#p477907

Edit - Fixed quotation, and again

bydark_sylinc » Sun Dec 02, 2012 4:38 am

Sqeaky wrote:

dark_sylinc wrote:There is no guarantee parents are updated in the same thread as their children.
Doing so would be more trouble than it's worth. (But if real world code proves it isn't
hard to guarantee this order → no syncs would be needed until the end)

How much work is in each of those boxes (on page 14)? If there is a significant amount of work, then it may be possible to put them each into Task/WorkUnit in a system that respects data dependencies and does not create unneeded threads.

Look the code I posted in my post (about SoA_Vector3 multiplied with SoA_Matrix4 and co.). That's how it should look like.
It's constant execution for each node (each node takes the same time). It's just basic math (additions, multiplication, and conditional moves).
The thing is, if you have 10.000 objects, you'll probably have 10.000 nodes. Therefore, you can split those 10.000 into multiple worker threads. This currently one bottleneck due to poor inefficient updates. The other two hot spots are updating the skeletons, and culling.

Sqeaky wrote:If it is just a few calculations I may have an idea. Is it possible to easily group the nodes in some way where they can be worked on and the dependency is respected because the grouping implies thread safety (Please forgive my ignorance of Ogre). For example, if the nodes are in a hierarchy where each node has one parent and N children there is an implied guarantee.

The hierarchy depends on each user. One game may have the Root scene node, and 10.000 children of that root node.
Another game may have have 3 levels on the hierarchy, while another one may have 7 (note that in Ogre 1.x if you do that last one, performance will be **horrible**)
Also Node "1" may have 3 leaf children, while Node "2" may have one child, which happens to have it's own child too. There is no real guarantee. It's also possible that node "20" (level 3) was child of "8" (level 2), and now becomes child of "1". This last example doesn't happen often though. It's rare.

Typically, 50% of the games have no children at all (they all live in level 1 because level 0 is reserved for root scene node), except for a few nodes which may have multiple children (and children's children) which happens to be for the player (to attach weapons, to use a complex camera setup, to use multiple items following the player, particle effects).

Another 45% of games have 2 levels of depth (+1 for the root node); which are very balanced because the 1st level is for the objects (i.e. enemies, npcs, houses, trees) and the 2nd level is for their attachments (weapons, particle fxs, etc) where most of objects tend to have attachments. Again, if there are more levels it's usually because it's for the local player that needs a more complex setup.

This is why I added the tip of letting the last levels to be handled in single threading (because usually, there are very few nodes)

The remaining 5% just uses the scene hierarchy in unpredictable ways. Sometimes because of their very specific needs, sometimes out of plain ignorance or novice.

Personally you're overthinking it too much. We need to sort to guarante data locality, we need to split data across render queues, split data across each voxel, guarantee SIMD order. There's a lot of sorting we have to do. There is a compromise, we can't do everything because the overhead starts becoming too big (and code complexity to handle all cases becomes unmanageable). Sorting per thread brings a lot of headaches (because we don't just have to move a few pointers, we have to hard copy multiple Vector3 and one Matrix4 in their SoA versions) unless we split per thread, but that would fragment memory even more.
Simpler = better.

This is something that will be easier to see once we get there. Otherwise we'll be discussing about vaporware.

Sqeaky wrote:Towards the end of your slides you mention that entities with animations must be handled in a different fasion. With these complications and the culling issues you in mentioned your previous response it seems that some things to be rendered follow very different codepaths, if that is the case then a more nuanced view of threading that we have discussed might be required. Is there a document that describes these processes in detail (moreso than your slides) so I could make more intelligent and specific observations? Or should I be looking at the source code, and helping to populate the wiki page with my findings?

You'll had to dig how Skeletal animation works. There's a thousand ways to do it. I wrote the slides with Ogre's way of doing it in mind, so you may need to watch the src.
It boils down to reading two "keyframes" which are position, rotation & scale at a specific point in time, then interpolating between two keyframes and setting the result to each bone (a bone is similar to a node). Then expand this result to it's children.
Of course, multiple animations may happen at the same time, which they have to be interpolated (or added) a process called animation blending.
More bones => More calculations. More animations active at the same time => More calculations.
The good news is that each animation update is very isolated between each object, so no need to think about data dependencies or race conditions.

Sqeaky wrote:

dark_sylinc wrote:What I am trying to say? That if UpdateAllTransforms() & UpdateAllAnimation() are the most relevant candidates for parallelizing, I really suggest doing it in a simple, clear and efficient way: the barrier system. Easy to debug, easy to modify and, at the end, more efficient than using a task-based lib. Less overhead. Granularity could be a problem, but nothing stop us from doing some real-time profiling and adapt the range of objects each thread updates. Also, a great thing here is that the time needed to update each node is constant (it'll not be the same for animations, but it's not going to be a huge difference), so that helps a lot! (In a task-based system each task may have very different execution times, you can mix little tasks with big tasks, so granularity/profiling becomes very important to avoid wasting of CPU resources).

Real time profiling like you describe seems to imply that a change in the amount of work or number of threads requires fine tuning. If that is the case then the ability to perform this tuning must be exposed. Can you think of a way to adjust the location of the thread barrier with a heuristic of some kind? Also, a Workunit based system that respects data dependencies would presumably only have periods of waiting once all the work was complete, this compensates for cases when one thread take longer then expected just before the thread barrier.

***Shameless plug*** In a post in the previous thread I linked to a library I am working on. I think it is close to an ideal fit for most game related tasks (maybe Ogre) : viewtopic.php?f=25&t=75459&start=25#p477907

Sparkprime Edit: Xavyiy said that, not me. But I agree on what he said. I don't really know what's the fuss from you on this one. Threading UpdateAllTransforms & UpdateAllAnimations using primitive sync barries like sparkprime xavyiy says (which is what I'm trying to fight for) is so simple that it would take me 15 minutes to write, each (total 30 min). No joke. In multithreading, keeping things simple is very important. That's why I like it.

Also, a Workunit based system that respects data dependencies would presumably only have periods of waiting once all the work was complete, this compensates for cases when one thread take longer then expected just before the thread barrier.

That doesn't make sense. What sparkprime xavyiy said is that UpdateAllTransforms has constant time for all nodes which makes it impossible for a thread to get earlier to the barrier (assuming there is no oversubscription of course).
It is posible for UpdateAllAnimations that to happen, but very unlikely and not worth the trouble.

Real time profiling like you describe seems to imply that a change in the amount of work or number of threads requires fine tuning

I don't know what you mean by that. When did sparkprime xavyiy talk about real time profiling?

Cheers

byXavyiy » Sun Dec 02, 2012 5:09 am

Well, looks like you both are referring something I wrote, not Sparkprime.

But I agree on what he said. I don't really know what's the fuss from you on this one. Threading UpdateAllTransforms & UpdateAllAnimations using primitive sync barries like sparkprime says (which is what I'm trying to fight for) is so simple that it would take me 15 minutes to write, each (total 30 min). No joke. In multithreading, keeping things simple is very important. That's why I like it.

That's the point. In this case, a simple barrier sync system is enough and likely more efficient than a task-based scheduler. When I was writting the Paradise Threading lib I did a lot of profiling and the conclusion was that for well-defined/controlled problems(like a FFT, or in this case update nodes/anims) it was faster to use a simple barrier between worker threads than executing it "in tasks", just because of the little overhead that the scheduler adds. The difference was very small though. And in this case: the simplier, the better.

Real time profiling like you describe seems to imply that a change in the amount of work or number of threads requires fine tuning

About "real-time" profiling: that's something cool when you're doing "parallel fors" on some data since this way you're able to get the optimal granularity after some iterations. But in this case, each node will take the same CPU cycles in being updated, so this is not very important here: it's a very controlled scenario. And yes, animations will differ a little from one to each other, but as dark_sylinc pointed making the threading system more complex due to it is not worth at all.

Personally you're overthinking it too much. We need to sort to guarante data locality, we need to split data across render queues, split data across each voxel, guarantee SIMD order. There's a lot of sorting we have to do. There is a compromise, we can't do everything because the overhead starts becoming too big (and code complexity to handle all cases becomes unmanageable). Sorting per thread brings a lot of headaches (because we don't just have to move a few pointers, we have to hard copy multiple Vector3 and one Matrix4 in their SoA versions) unless we split per thread, but that would fragment memory even more.
Simpler = better.

+1.

Some references: some years ago, I found an Intel demo (Destroy the Casttle) which splitted the physics part over some threads(==logical cores). The result: if multithreading was enabled it was really smooth, if not... quite bad. When I looked the sources I was surprised about how simple the code was, but in fact it was doing the job pretty well. The same must be applied here: if it can be done in a simple and efficient way, let's do it that way! http://www.newtondynamics.com/forum/vie ... =14&t=4578

bydark_sylinc » Sun Dec 02, 2012 5:56 am

Xavyiy wrote:Well, looks like you both are referring something I wrote, not Sparkprime.

Sorry... I hope you don't have an identity crisis now :lol:

Fixed my post.

Edit: Off topic, Your user name is unpronounceable in spanish. My mind just... it just reads "xaviyi"

bySqeaky » Sun Dec 02, 2012 6:27 am

Thanks for all the details, I think I will need to digest them for a bit before I can really understand it. I fixed the quotation, twice

.

The thread barrier idea is simple, and simple is good, I agree. If you think it is ideal you should press for it/code it, particularly if it is only 1 man-hour of work. I don't think I know enough about Ogre at this point do anything like that so fast. Superficially, it seemed that doing one stage after the other using barriers was too simple to match what I have been reading about the order things must occur in, but again I don't know enough yet to speak on this with confidence. If it can be done this simply it sounds close to ideal.

dark_sylinc wrote:There is no real guarantee. It's also possible that node "20" (level 3) was child of "8" (level 2), and now becomes child of "1". This last example doesn't happen often though. It's rare.(sic) The remaining 5% just uses the scene hierarchy in unpredictable ways. Sometimes because of their very specific needs, sometimes out of plain ignorance or novice.

When can they change parents? Anytime? If it can change when the game dev moves them then the game devs need to not do that while this is processing regardless of threading strategy.

dark_sylinc wrote:Personally you're overthinking it too much. We need to sort to guarante data locality, we need to split data across render queues, split data across each voxel, guarantee SIMD order.

I am honestly trying to do as little work as possible and keep it simple. I have a system that does some (maybe all) of the things I have suggested are possible and I am genuinely trying to determine if Ogre can benefit from it. If it can be used it could save Ogre team developer time, and potentially get me bugfixes and peer review of my code. Even if not used I will still learn much about Ogre, which I need to do anyway. Much of the complex work is already done and I am trying to prevent others from having to do heavy lifting again.

As for the cache misses, sometimes just having a pointer to an object that fits in a cache line helps. Modern cache prefetching uses time hints and locality hints. If the nodes are small enough to fit into a small number of cache lines the amount of cache misses might drop just because of the amount of cache that the extra cores brings into play. I think this digression belongs in another thread though.

(sic)Granularity could be a problem, but nothing stop us from doing some real-time profiling and adapt the range of objects each thread updates.(sic)

Right in the middle of his paragraph talking about how the thread barriers could be moved earlier or backwards in time putting different amounts of time in front of or behind them based on empiricly gathered data. The way I understood his statement, he meant we could test and come up with simple conditions for where this should go. Then I read into it to understand that would mean that for ideal performance we would need to expose an API to allow game devs to change these conditions, and I pre-emptively suggested replacing that with a smart heuristic. ie I probably overthought it. Please disregard my earlier comments on this.

Xavyiy wrote:(sic)in this case, each node will take the same CPU cycles in being updated(sic)

As for math in multiple threads always taking about the same time, that is not always the case. There are many factors that could adjust it: inability to divide work evenly, OS pre-emption, cache misses, NUMA factors, non-identically clocked CPU cores, and other interesting things. Despite these possible sources of timing vagueness I agree that in most cases the end of all the work is likely to occur within an acceptably short period and little waiting is likely to occur.

Xavyiy wrote:When I was writting the Paradise Threading lib I did a lot of profiling and the conclusion was that for well-defined/controlled problems(like a FFT, or in this case update nodes/anims) it was faster to use a simple barrier between worker threads than executing it "in tasks", just because of the little overhead that the scheduler adds.

Can you tell me about the algorithms you chose for the scheduler, I could not find the website? I have done a ton of research on this recently for similar, but less obviously parallelizable, use cases then what has been described now, and I never stumbled accross this.

Off topic, Your username is unpronounceable in english, my mind just sees "that guy with the paradise engine" :wink:

bydark_sylinc » Sun Dec 02, 2012 7:56 am

Sqeaky wrote:

dark_sylinc wrote:There is no real guarantee. It's also possible that node "20" (level 3) was child of "8" (level 2), and now becomes child of "1". This last example doesn't happen often though. It's rare.(sic) The remaining 5% just uses the scene hierarchy in unpredictable ways. Sometimes because of their very specific needs, sometimes out of plain ignorance or novice.

When can they change parents? Anytime? If it can change when the game dev moves them then the game devs need to not do that while this is processing regardless of threading strategy.

Hi, the threading for updateAllTransforms & UpdateAllAnimation is largely sync, not async (mostly to divide work, rather than doing work in the background; and there are lots of dependencies on their results).

Hierarchy changes happen outside rendering the frame (that is, after or before the engine is rendering the frame, not while the engine is inside "renderOneFrame()"); which means you don't have to worry about the hierarchy suddenly changing from another thread or between internal calls.

The Ogre user may want to put renderOneFrame() and update the logic & physics in another thread (ie. Distant Sould does this). But that is done at a higher level than that of Ogre and is out of our scope. The user would be responsible for sync'ing correctly (he shouldn't change positions, orientation, or hierarchy while the Graphics thread is inside renderOneFrame) and he can do that in any way he choses to.

Personally in Distant Souls, my position & orientation updates aren't done safely (because that rarely results in stuff out of place; and when it does, the glitch will be corrected in the next frame) however I lock when I need to change hierarchy because otherwise Ogre crashes. I don't want to go into details, just giving an example so it can be better understood.
I do plan to write a demo showing how to do this kind of threading when we get there (around version +2.5) so that users know how to do it (I implemented this technique by myself, and so far I am only aware of another Ogre user being able to independently discover & replicate that implementation, he wrote a blog post explaining how he does it)

byTheSHEEEP » Sun Dec 02, 2012 11:14 am

Just wanting to drop by and say that I am reading all of this with a great interest and really like the ideas that come up here.

Unfortunately, I really don't know enough about the Ogre internals to really help here.
Other then, yes, simple is good, indeed! Especially when it comes to threading.

I've been working with an engine that had its threading completely broken down into a multitude of different classes that all work somehow together through a job scheduling and messaging system. Every single class was relatively easy to understand, but grasping the whole system was a nightmare as it was just so incredibly complex. Bad case of over-the-top object orientation.

So, really, keeping it simple is vital. Being rather easy to get into is one of the things that make Ogre so good. And if we could somehow manage to create a threading system that even newbies could understand with a bit of thought, it would be awesome.

byXavyiy » Sun Dec 02, 2012 2:02 pm

Sorry... I hope you don't have an identity crisis now

Hehe, it has been close

As for math in multiple threads always taking about the same time, that is not always the case. There are many factors that could adjust it: inability to divide work evenly, OS pre-emption, cache misses, NUMA factors, non-identically clocked CPU cores, and other interesting things. Despite these possible sources of timing vagueness I agree that in most cases the end of all the work is likely to occur within an acceptably short period and little waiting is likely to occur.

Yes, you're completely right. There is a ton of low-level factors which will make the execution time not constant, but as you've said it's going to be an acceptably short time so, in average, it should work pretty well (I'm sure the the majority of times it takes much more time to execute the scheduler algorithms(which involves signaling and a logical part) than the time threads are waiting in the barrier before all nodes has been updated due to different execution times between threads. Of course, if you're updating 5k nodes splitted in 2 threads this time in the barrier may become more relevant, but taking account that nowadays CPU standards are 4/8 cores(so 4/8 threads) and we'll not have more than 1k nodes in a typical scene, we can assume the barrier system will work well enough.

Can you tell me about the algorithms you chose for the scheduler, I could not find the website? I have done a ton of research on this recently for similar, but less obviously parallelizable, use cases then what has been described now, and I never stumbled accross this.

Well, I haven't used any "standard algorithm" for the scheduler.

The TaskManager(scheduler) is able to perform a serie of operations to a TaskQueue: execute, parallel_for(2D/3D), etc. Each task has some params where you(the scheduler) can define it's range(for parallel_for), etc.

Also, you can define data dependencies between tasks in form of strings(multiple strings per task, since inside a task you're able to notify the scheduler that a dependency has been already calculated, before finishing the task), and the TaskManager will only execute each task when all its data dependencies has been executed. For example (parallel initialization of all systems in the paradise engine)

Code: Select all

      // Parallel systems creation
      PThreading::TaskQueue tq;
      PThreading::FPtrTask* task;

      for(PUtils::uint32 k = 0; k < mSystems.size(); k++)
      {
         task = new PThreading::FPtrTask(
            PThreading::MakeFPtrWrapper(&Core::_createSystem, this), 
            PThreading::TaskParameters(static_cast<void*>(mSystems[k])));

         // Set ouput data
         task->getOutputData()->add(mSystems[k]->getSystemName());

         tq.queue(task);
      }

      // Graphics system needs scene system for SceneSystem::Listener registering in MaterialManager
      tq.peek(System::ST_GRAPHICS)->getInputData()->add(mSystems[System::ST_SCENE]->getSystemName());

      if (!PThreading::TaskManager::getSingleton().execute(&tq))
      {
         PENGINE_LOG_CRITICAL("PEngine::Core error while creating systems. Check PThreading log");
         tq.clear(true);

         for (PUtils::uint32 k = 0; k < mListeners.size(); k++)
         {
            mListeners.at(k)->coreCreated(false);
         }

         return false;
      }

                tq.clear(true);

And this is how a function(Core::_createSystem in this case) which can be executed as a task looks like:

Code: Select all

PThreading::TaskParameters Core::_createSystem(const PThreading::TaskParameters& tparams)
   {
      System* s = static_cast<System*>(tparams.getCustomData());

      bool result = s->create();

      if (!result)
      {
         PENGINE_LOG_CRITICAL("PEngine::Core error while creating " + s->getSystemName() + " system");
      }

      if (s->getSystemType() == System::ST_GRAPHICS)
      {
         // Since we've created Ogre in a working thread, we've to save the current opengl context and reset it,
         // then we've to restore the opengl context in the new thread
         static_cast<GraphicsSystem*>(s)->_saveAndClearOGLContext();
      }

      for (PUtils::uint32 k = 0; k < mListeners.size(); k++)
      {
         mListeners.at(k)->systemCreated(s->getSystemType(), result);
      }

      return PThreading::TaskParameters();
   }

So, here you can see an example of data dependencies: The graphics system creation task will be always executed after the scene manager, see this line:

Code: Select all

tq.peek(System::ST_GRAPHICS)->getInputData()->add(mSystems[System::ST_SCENE]->getSystemName());

Then the task manager "just"(it's quite complex at really... doing/debugging threading in this scenario is not an easy task

) deals with the internals for making the "magic" happens. So more than using a well-know algorithm I've do it the way it was more efficient regarding my tests: it was not worth to create a dependency tree structure of tasks before executing them since the majority of the times there are only a few data depencies. So instead of going complex I just look for the next task in the queue which can be executed(all its dependency data is already execute) and in error case(for example recursive dependencies) notify the user.

Some more info at:
http://www.ogre3d.org/forums/viewtopic. ... 90#p420819
http://www.ogre3d.org/forums/viewtopic. ... 90#p421669
http://www.paradise-studios.net/?p=29

Hope this is useful for you!

Edit: Off topic, Your user name is unpronounceable in spanish. My mind just... it just reads "xaviyi"

Off topic, Your username is unpronounceable in english, my mind just sees "that guy with the paradise engine"

Hehe, yeah! My username is quite... particular

I always have problems when giving my email address to people, specially with the last 3 characters!

To be honest I didn't remember when I've used it for the first time, but I guess it was when I was 11 or 12

So don't try to find any special meaning of my nick... it's just a kid nickname =)!

bySqeaky » Tue Dec 04, 2012 8:16 am

Xavyiy wrote:Of course, if you're updating 5k nodes splitted in 2 threads this time in the barrier may become more relevant, but taking account that nowadays CPU standards are 4/8 cores(so 4/8 threads) and we'll not have more than 1k nodes in a typical scene, we can assume the barrier system will work well enough

I would worry more about the future and more cores. In the two thread example each thread would get about 2.5k nodes and if a thread took an extra 3 microseconds, then the faster thread waits for an just 3 microseconds. If the same 3 microsecond delay happens on a modern 8 core(625/thread) or a near future 16 core(312.5/thread) machine then 21 or 45 cpu microseconds are squandered. I am not sure how likely any given kind of delay is, but any delay of fixed length costs the same time across all threads on the barrier. Since the simplicity of coding this has been asserted I think it would be worth measuring. Likely I am looking at too hard, certainly even with large fluctuations, any multithreaded design on a multicore system will outperform the single threaded version.

dark_sylinc wrote:Personally in Distant Souls, my position & orientation updates aren't done safely (because that rarely results in stuff out of place; and when it does, the glitch will be corrected in the next frame) however I lock when I need to change hierarchy because otherwise Ogre crashes.

I wouldn't make the kinds of assumptions this implies. As stated this would have... interesting performance characteristics on a single threaded machine. I will assume that in a single threaded situation you just run physics first. This also doesn't sound like it scales well beyond 3 three cores(physics/Ogre/other). If run on different hardware you could get wildly different behavior from these race conditions.

Which physics library are you using? Most physics libraries have threading options or an least an implementable thread provider class. Like this, most physics libraries can be configured to monopolize the available cpu resources for a brief period. This can be done at the beginning of a frame. Done this way it would implicitly preclude the need to synchronize and would scale as well as the physics library allows. Then while Ogre is running other game/engine tasks could be performed in other threads.

Xavyiy wrote:Then the task manager "just"(it's quite complex at really... doing/debugging threading in this scenario is not an easy task ) deals with the internals for making the "magic" happens. So more than using a well-know algorithm I've do it the way it was more efficient regarding my tests: it was not worth to create a dependency tree structure of tasks before executing them since the majority of the times there are only a few data depencies. So instead of going complex I just look for the next task in the queue which can be executed(all its dependency data is already execute) and in error case(for example recursive dependencies) notify the user.

The complexity of this kind of code is exactly why I chose to completely encapsulate the functionality in a separate library. If I find, or even suspect a bug, I create a test in a small isolate test environment attempting to duplicate it. It seems to be the only way to troubleshoot complex multithreaded isses without advanced multi-threaded debuggers (of which I am unaware of any). It is more work though, but there are other advantages that make the work cost worthwhile.

I may be inferring too much again, but I agree creating a dependency tree for the purpose of rigidly scheduling tasks is not generally a good idea. However, I am a huge fan of keeping data required to do so in convenient locations to allow for the creation of simple algorithms with sophisticated behavior. Anything created before the actual execution takes place is less likely to help because executions times always change and dependencies always seem to be completed out of predicted order whenever inconvenient. If you keep the simple pieces of data around, then the required value can be calculated which can be faster than a potential cache miss.

Something odd I noticed right of away from the numbers on the page you linked, was that the 'Micro-threaded DFT' has a speedup of 1.86x despite being highly parrelizable. I would have expected 1.99x or 2x speedup from most implementations. Was this micro threading implementation creating a thread per task or doing something similar? If it wasn't do you have any explanations or theories for the difference in performance from optimal? I think I should grab a copy of your math and try a benchmark myself.

Thank you for the links I find all the information quite helpful. I will read as much as I can.

byZonder » Thu Dec 13, 2012 3:03 pm

Something I came across that I think should be kept in mind is lock less queues. Resource loading springs to mind instantly

http://www.drdobbs.com/parallel/lock-fr ... /208801974

bydark_sylinc » Thu Dec 13, 2012 4:52 pm

That Dr. Dobbs link is old. There is an academic article that analyzes it, and reveals the proposed implementation is fully flawed with tiny but nonetheless important race conditions (more than just one).
The authors concluded that in theory the lock-less tail-based implementation could work, but at least the implementation they provided as reference was flawed and shouldn't be considered reliable. I can't remember the academic article. Google may find it again.

Edit: Oh wait, no need for that. Dr. Dobbs proves Dr. Dobbs was wrong. One has to be very careful when going the "lock-less" route.

bybstone » Thu Dec 13, 2012 5:20 pm

There's one old implementation - M. M. Michael and M. L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. It has been verified multiple times by various researches and proved to be correct. But here what Wikipedia says:

Wikipedia wrote:However, in 2011 Kogan and Petrank[6] presented a wait-free queue building on the CAS primitive, generally available on common hardware. Their construction expands the lock-free queue of Michael and Scott,[7] which is an efficient queue often used in practice. A follow-up paper by Kogan and Petrank[8] provided a methodology for making wait-free algorithms fast and used this methodology to make the wait-free queue practically as fast as its lock-free counterpart.

And the wait-free guarantee is even stronger than the lock-free. So if anybody wants a lock-less queue than I'd recommend finding and studying that work by Kogan and Petrank. But yes, the article from Dr. Dobbs doesn't look credible.

bysaejox » Fri Dec 14, 2012 6:18 pm

Did anybody try boost 1.53?
I am very excited about that release, it has all the features i need.
I already integrated boost in core of my code. Boost::Function, Boost::Bind, Boost::Serialization, Boost::Filesystem, Boost::Thread, Boost::Signals2, Boost::asio ...
now it has both atomic<T> and lock-free queues.
They keep on making my life easier

http://www.boost.org/users/history/version_1_53_0.html

Does Ogre really needs to rid of boost?

byTheSHEEEP » Fri Dec 14, 2012 6:44 pm

I also use boost within any project I work on, but I prefer libraries that use as few dependencies as possible.
And as great as boost is, it blows up anything that uses it (in size, I mean, do not quote that out of context

), which is basically not such a good thing for a library like Ogre (which btw. already has more than enough dependencies).
And you certainly don't need boost for lock-free queues and atomic.

byFaranwath » Fri Dec 14, 2012 6:58 pm

Build times could benefit from getting rid of Boost. Just using whatever it takes to get things done is the way to go, I think.

byKlaim » Fri Dec 14, 2012 7:26 pm

Oh by the way, just for pure information related to boost/lockfree: they are finally doing the split of repositories, so the plan is to have different boost libraries in github around February (officially, but it's already working apparently). See: https://svn.boost.org/trac/boost/wiki/W ... dularBoost

I agree with the direction that Ogre should avoid Boost anyway. Even if boost is splitted, most used boost libraries still are interdependant. Some are not but are more specific.

bysaejox » Fri Dec 14, 2012 8:52 pm

So even with atomic<T> , lock-free, future and promise Boost does not get passing grade.

Maybe, C++11 ?

Ogre 2.0 wont arrive tomorrow, maybe C++11 its enough for all threading needs.

bySqeaky » Wed Dec 19, 2012 6:40 am

I read the PDF at the top of a google search: https://www.google.com/search?client=ub ... 8&oe=utf-8 - http://www.google.com/url?sa=t&rct=j&q= ... 4169,d.b2I
Clearly these guys know about the failings of FIFO work queues, they point out like a million failed designs. Assuming I found the right paper, this cannot be used in most real world systems, and even though my knowledge of Ogre internals is still limited I think that includes Ogre. The algorithm employed seems to be more of a container than a workqueue, though they make mention of using it as a workqueue. They make no real note of any common real world synchronization problems, such as multiple workunits needs exclusive access to the same data.

I must have the wrong paper, could I get a link to the paper being referenced.

There is no need to wait for C++11 to be ubiquitous, wrapping native threading APIs is simple, I already did it once. There are a ton of zlib/mit licensed projects that do exactly this, with no additional threading framework.

For what it is worth, I dislike boost. It rarely seems to have quite what I need, and when it does it is easier to re-create it than to get boost working on my system and learn the API for that one class. Even their 'header only' libraries are usually also a mess, never just one or even a few headers, more like dozens, and even then there are countless steps I would need to complete importing into my project. I do like their lexographical cast though, that is very hard to duplicate efficiently. Despite any misgivings I have, it is everywhere and very documented so it may be worthwhile if other like it more than I do.

bybstone » Wed Dec 19, 2012 11:02 am

Here: PDF: Wait-Free Queues With Multiple Enqueuers and Dequeuers.

Edit: Their work is built on top of a garbage collected memory management but they claim that extending that to direct memory management is not an issue.

byKnotanalt » Wed Jan 02, 2013 7:21 am

Sqeaky wrote:I read the PDF at the top of a google search: https://www.google.com/search?client=ub ... 8&oe=utf-8 - http://www.google.com/url?sa=t&rct=j&q= ... 4169,d.b2I
Clearly these guys know about the failings of FIFO work queues, they point out like a million failed designs. Assuming I found the right paper, this cannot be used in most real world systems, and even though my knowledge of Ogre internals is still limited I think that includes Ogre. The algorithm employed seems to be more of a container than a workqueue, though they make mention of using it as a workqueue. They make no real note of any common real world synchronization problems, such as multiple workunits needs exclusive access to the same data.

I must have the wrong paper, could I get a link to the paper being referenced.

There is no need to wait for C++11 to be ubiquitous, wrapping native threading APIs is simple, I already did it once. There are a ton of zlib/mit licensed projects that do exactly this, with no additional threading framework.

For what it is worth, I dislike boost. It rarely seems to have quite what I need, and when it does it is easier to re-create it than to get boost working on my system and learn the API for that one class. Even their 'header only' libraries are usually also a mess, never just one or even a few headers, more like dozens, and even then there are countless steps I would need to complete importing into my project. I do like their lexographical cast though, that is very hard to duplicate efficiently. Despite any misgivings I have, it is everywhere and very documented so it may be worthwhile if other like it more than I do.

Last time when I was doing this I found white papers to be surprisingly bad in general. You don't need "all that" anyway, definitely not what all boost tries to do. Just to wrap some thread functions in a way to work across all platforms would be plenty but it's not a one man job. Worker queue etc. I'm sure is if you have that, I can just past in my own code from there if no one else steps forward.

bydrwbns » Thu Apr 25, 2013 5:31 pm

Can you guys explain how if this is going to be any different than Ogre's current workQueue processing system?

byTransporter » Fri Apr 26, 2013 6:22 pm

TheSHEEEP wrote:I also use boost within any project I work on, but I prefer libraries that use as few dependencies as possible.
And as great as boost is, it blows up anything that uses it (in size, I mean, do not quote that out of context ), which is basically not such a good thing for a library like Ogre (which btw. already has more than enough dependencies).
And you certainly don't need boost for lock-free queues and atomic.

I agree! Now, I use boost in other Projects too, because I've learned (thx to ogre) how easy handling multiple threads is with boost. Before ogre, I've only used Windows api. Anyway, there are a few discussions about moving different parts like math into an own ogre components. What about creating an own ogre threading component?

bySqeaky » Sun Apr 28, 2013 12:18 am

Knotanalt wrote:Last time when I was doing this I found white papers to be surprisingly bad in general.

If by whitepapers you mean those things that try to pass themselves of as academic research papers but are really advertisements for busy corporate executive, then I agree with you. If you are bashing academic research papers, I have to completely disagree. There is a range of quality of course, but the ones that get cited in a positive way are pretty consistenly good and are responsible for the initial distribution of many of the algorithms we simply take for granted now.

I will read the paper provided, but not right now.

As for the thread implementation, I will again offer to allow Ogre to relicense as they see fit any part of the DAGFrameScheduler, and point out that I have a minimalistic cross platform (windows, Linux, Mac OSX) thread class that might be u-sable. If it is missing too many feature it clearly would not be.

bysparkprime » Thu May 02, 2013 7:01 am

About the PPoPP paper linked earlier. The benefit of wait-free is not usually worth the extra overhead -- all Ogre needs is lock-free.

Also PPoPP is one of, if not the most prodigious conference for parallel programming. Pretty much every paper that gets published to PPoPP will be outstanding merit. I am intending to submit some work there this year myself

bysparkprime » Thu May 02, 2013 7:04 am

I don't know what all the fuss is about boost and std::atomic -- ogre has had atomic operations for 5 years now already...