-
Type: Bug
-
Status: Resolved (View Workflow)
-
Priority: High
-
Resolution: Fixed
-
Affects Version/s: None
-
Fix Version/s: VOLTHA v2.9, VOLTHA v2.10
-
Component/s: openolt-agent
-
Labels:
-
Story Points:5
-
Epic Link:
When using the new Broadcom BAL release 3.10.2.2 on the XGSPON OLTs at ONFthe dev_mgmt_daemon process throws lot of errors at random and becomes unusable after that.
I have attached the dev_mgmt_daemon logs for reference. Interestingly we see this issue randomly and not on all the OLTs at ONF. Per Edgecore, the work around is hard or soft reboot the OLT until the issue goes away.
We also see another flavor of issues on some of the XGSPON OLTs used at ONF.
The openolt-agent fails to reset all the PON mac devices on the OLT. Basically below errors are seen at agent.
[59115: I OPENOLT ] core_api_handler.cc 398| Enabling PON 8 Devices ...
[59115: I OPENOLT ] core_api_handler.cc 309| Reset PON device: 0
[60117: E OPENOLT ] core_api_handler.cc 315| Failed to reset PON device(0) failed, err = Operation timed out
[61124: E OPENOLT ] core_api_handler.cc 451| Enable PON device 0 failed, err = Operation timed out
[61324: I OPENOLT ] core_api_handler.cc 309| Reset PON device: 1
[62331: E OPENOLT ] core_api_handler.cc 315| Failed to reset PON device(1) failed, err = Operation timed out
[63337: E OPENOLT ] core_api_handler.cc 451| Enable PON device 1 failed, err = Operation timed out
[63537: I OPENOLT ] core_api_handler.cc 309| Reset PON device: 2
[64544: E OPENOLT ] core_api_handler.cc 315| Failed to reset PON device(2) failed, err = Operation timed out
[65550: E OPENOLT ] core_api_handler.cc 451| Enable PON device 2 failed, err = Operation timed out
[65750: I OPENOLT ] core_api_handler.cc 309| Reset PON device: 3
[66757: E OPENOLT ] core_api_handler.cc 315| Failed to reset PON device(3) failed, err = Operation timed out
[67763: E OPENOLT ] core_api_handler.cc 451| Enable PON device 3 failed, err = Operation timed out
[67964: I OPENOLT ] core_api_handler.cc 309| Reset PON device: 4
[68970: E OPENOLT ] core_api_handler.cc 315| Failed to reset PON device(4) failed, err = Operation timed out
[69976: E OPENOLT ] core_api_handler.cc 451| Enable PON device 4 failed, err = Operation timed out
[70176: I OPENOLT ] core_api_handler.cc 309| Reset PON device: 5
[71183: E OPENOLT ] core_api_handler.cc 315| Failed to reset PON device(5) failed, err = Operation timed out
[72189: E OPENOLT ] core_api_handler.cc 451| Enable PON device 5 failed, err = Operation timed out
[72389: I OPENOLT ] core_api_handler.cc 309| Reset PON device: 6
[73396: E OPENOLT ] core_api_handler.cc 315| Failed to reset PON device(6) failed, err = Operation timed out
[74402: E OPENOLT ] core_api_handler.cc 451| Enable PON device 6 failed, err = Operation timed out
[74602: I OPENOLT ] core_api_handler.cc 309| Reset PON device: 7
[75609: E OPENOLT ] core_api_handler.cc 315| Failed to reset PON device(7) failed, err = Operation timed out
On the dev_mgmt_daemon i see below logs which seem relevant to the issue.
[59889: I dev_agent_4 ] da_fsm.c 447| Connect API request received
[59889: I dev_agent_4 ] da_fsm.c 463| Internal connect request received
[59889: I dev_agent_4 ] da_bh.c 255| CONNECT DEVICE
[59889: I dev_agent_4 ] da_bh_pcie.c 821| bcmuser_device_is_ready_for_reconnect returned success: is_ready_for_reconnect=no
[59889: I dev_agent_4 ] da_bh_pcie.c 739| Connection to PON MAC device from reset
[59999: I dev_agent_4 ] da_bh_pcie.c 198| FLD_INFO: ddr_length=0x1000000 sram_base=0x7f8e103fa000 soc_ddr_base=0x7f8dac800000 soc_regs_base=0x7f8dac000000
[60006: I dev_agent_4 ] da_bh_pcie.c 713| Failed to retrieve Embedded Software Error(s): -19
[60006: I dev_agent_4 ] da_bh_pcie.c 625| Write bootloader to SRAM
[60006: I dev_agent_4 ] da_bh_pcie.c 273| Loading boot loader
[60006: I dev_agent_4 ] da_bh_pcie.c 307| Writing boot loader (40712)
[60008: I dev_agent_4 ] da_bh_pcie.c 317| Reading boot loader for checking (40712)
[60024: I dev_agent_4 ] da_bh_pcie.c 325| Comparing boot loader (40712)/(0)
[60024: I dev_agent_4 ] da_bh_pcie.c 367| Transferred 40712 bytes
I have seen this issue happening occasionally. This issue seems to be a platform specific problem. When this happens I noticed that could not SSH to the OLT and when I logged in via serial console I saw below logs continuously appearing on the console
[ 2394.588500] ixgbe 0000:05:00.0 eth2: Received ECC Err, initiating reset
[ 2394.588500] ixgbe 0000:05:00.0 eth2: Received ECC Err, initiating reset
[ 2394.588500] ixgbe 0000:05:00.0 eth2: Received ECC Err, initiating reset
[ 2394.588500] ixgbe 0000:05:00.0 eth2: Received ECC Err, initiating reset
[ 2394.588500] ixgbe 0000:05:00.0 eth2: Received ECC Err, initiating reset
Per Edgecore, we need to ship the hardware to their office to debug further or hard reboot the OLT until the issue goes away.