Monday, September 18, 2006

Back to work (and more on ORTE)

After coming back from holidays on the 6th of September, I've been able to work on the project with a refreshed mind again!

I expected to have finished coding by now, but after putting all the components together, I ran into an occasional 'segmentation fault'. I've been looking for the problem, but only with the help from one of my colleagues I was able to address the causes.

For the first cause, I should explain a little bit about how ORTE handles issues:
As I explained before, ORTE works with 'publishers' and 'subscribers'. ORTE periodically invokes a callback function on the publisher side, which is meant to prepare the data for sending in a memory buffer. When the callback function finishes, ORTE reads the buffer and copies it to a memory buffer on the subscriber side. It then invokes the subscriber callback function which is meant to process the incoming data.

So what went wrong in my implementation?
My request handler acts as a subscriber for requests. Whenever a request comes in, it behaves as a publisher to notify the requesting client on the data source ID. (You can find this scenario in the diagram added to my post of the 3th of August.)
I used the request handler's subscriber callback function to immediately create a publication of the data source ID. And that was the problem! ORTE only allows creation of publications in the main thread. When you try to create it in the subscriber callback function, it will be handled in another thread, causing it to fail.
I have already fixed this problem by placing the creation of the publication in the main thread, controlling it with the use of semaphores.

What about the second cause?
The second cause for the segmentation faults was caused by my enthusiasm to free memory as soon as possible.
After making the request handler send the data source ID, I immediately freed the memory used by this publication. The C code looks like this:

//p is the publication handle
ORTEPublicationSend(p); //sending the issue
ORTEPublicationDestroy(p); //destroy the publication handle

I expected the ORTEPublicationSend function to be locking the thread until the entire publication had been finished, but it turns out to be a false assumption. With the code above, I was destroying the publication handle before the publication was finished completely, causing a segmentation fault.

This issue has been fixed with a workaround for now.

What coding is still to be done?
For my first measurements, I have to set up all the data sources in different threads. Using the metaphors fixing one issue has introduced a new one. It prevents the data sources from sending data.

After this, I will give my code a review, and I will invite a colleague to review my code critically.

No comments: