-
Type: Bug
-
Status: To Do (View Workflow)
-
Priority: Medium
-
Resolution: Unresolved
-
Affects Version/s: VOLTHA v2.10
-
Fix Version/s: Future
-
Component/s: ofagent-go, rw-core, Testing
-
Labels:None
-
Story Points:5
-
Epic Link:
Purpose of the test/pod/hardware
To test openonu rolling update.
Current issue
This job https://jenkins.opencord.org/view/VOLTHA-2.X-Tests/job/periodic-software-upgrade-test-bbsim fails for voltha component minor upgrade and rolling update tests
Root cause
When openonu pod is changed/upgraded/downgraded, rw-core correctly detects that the image has been updated. It then waits for connection with old adapter to severed completely and then starts connection with the new adapter. Meanwhile the openonu adapter sets the pod ready state to be “True” after kafka and etcd probes are fine. The rw-core establishing connection with new adapter and new adapter setting its pod ready state to True are happening independently.
Now as soon as rw-core establishes connection with new adapter, it sends a reconcile request for all the devices managed by that adapter. The openonu adapter does not accept reconcile request until it is sure that openolt adapter is able to communicate with it. The openolt adapter does not establish connection with openonu adapter until receives a flow (which consequently triggeres a TP request to openonu adapter) or some other external trigger (like omci resp.) for it to communicate and establish connection with the openonu adapter.
During this time any reconcile request issued to openonu adapter will keep failing.
Meanwhile the test issues a volt-add-subscriber-access call (which should help establish between openolt and openonu adapter) but the subscriber is already provisioned in a previous component upgrade test (openolt adapter) so the call to provision subscriber is silently ignored. While everything looks good externally – the flows are installed, packet in/out are working fine, so the test proceeds to next component upgrade which is rw-core. When rw-core is restarted things seem to start breaking because the device was not reconciled in the previous component upgrade (for reasons already described).
Next Steps
There seems to be some architectural issues here that openonu adapter depends on openolt adapter to be able to communicate to it first before it can process reconcile, and openolt is dependent on some other external trigger for this to happen and rw-core just keeps trying to reconcile which never goes through. A jira ticket is needed to look into this.
Meanwhile we should try to change the test to unprovision subscriber at the start of each component upgrade test to see if that breaks this deadlock. Hope this made some sense.