-
Type: Story
-
Status: Resolved (View Workflow)
-
Priority: Medium
-
Resolution: Fixed
-
Affects Version/s: None
-
Fix Version/s: VOLTHA v2.10
-
Component/s: openolt-adapter, openonu-adapter, rw-core
-
Labels:None
-
Story Points:8
Girish GC
I wanted to bring to your notice that I pushed a change last Friday to change the default core timeout value from 10s to 30s. I observed that during reconcile the 10s timeout wasn't enough if a flow add were to arrive immediately after the adapter restart. https://gerrit.opencord.org/c/voltha-helm-charts/+/28508
As @teone pointed out on another channel, a more optimal way could be a stream gRPC connection to detect connection loss immediately than the current polling mechanism. But I guess that is more change now and probably counter productive with all the changes we already have.
With the increased timeout we have more stable runs now. There are still some other failures and we are looking into that.
khenaidoo
increasing the timeouts is ok for now. I have some thoughts about connection failure detection. Need to give try a few things… Will let you know if I find something this week ….
@Girish GC @teone While I can optimize the connection failure detection, it will still take time for the containers to communicate with each other (~5s-15s). The core sends the reconcile to the onu adapter once it can connect to it. The onu adapter will initiate the process by sending numerous OMCI messages to the olt adapter to proxy. If the latter cannot communicate the response to the ONU adapter then there will be a failure. By reducing the connection failure detection time all we are doing is reducing the time between an onu adapter receiving a reconcile request and the time the olt adapter establishing a connection to the onu adapter. There will still be a short interval where the failure can happen. Here are some options I am thinking around this situation:
- The ONU adapter sends OMCI messages to an OLT adapter only when the latter can reply back to it. This can be done via a streaming connection check. Note that the ONU adapter will need to have a retry mechanism for this as there will be failures after a restart.
- Move the OMCI proxy messaging entirely to GRPC streaming. This may be a better solution since “all” the OLT is doing taking a common OMCI message type from one channel and push it to the OLT app on the OLT device and sends the response back. I was contemplating doing this during the grpc migration but there are logic tied to the ordering of messages in the ONU adapter that was not right to do doing the grpc migration. This is something we can consider but it would need quite some work in the adapters, especially if we want multiple adapter instances support. Also, the retry mechanism on restarts will still be needed with the addition of establishing a streaming OMCI connection on start/restart. Given that we are a few weeks away from feature complete this may be best to do in a future release.
- Leave things as they are (maybe keep the timer at 30s) and optimize in a later release.
Girish GC
I am ok with option 3. Regarding option2, yes that could work, but please note that there are async omci indications too like alarm notifications, or attribute value change notifications (so it is not just req/resp).