DDTSes From ES+/C7600 Troubleshooting Guide - Cisco



DDTSes From ES+/C7600 Troubleshooting GuideIdentifierRNE OKHeadlineCSCsv05515N (internal)x40g: Improve the message wordings for recoverable tcam errorsCSCsw31515N (internal)ES+: %DEV_SELENE-DFCx-3-SRAM_ECC: Selene SRAM ECC ErrorsCSCtb76621YES+ ROMMON: MPC8548 DDR20 errata fix for Multi-bit ECC errorsCSCtb78538N (internal)ES+ ROMMON: controller setting changes to prevent Multi-bit ECC errors CSCtc17311YES+: TCAM_MGR_HW_ERR: TCAM device had corrupted data errorsCSCtd66014*YES+: ECC_DOUBLE: Double-bit ECC error detected on NP - High T, Normal VCSCtd99244*YES+: ECC_SINGLE or ECC_DOUBLE error detected on NP CSCtd99248*YES+: ECC_DOUBLE: Double-bit ECC error detected on NPCSCte14535YInvalid LinkFPGA or LINKFPGA Bus Error CSCtg31984YDBUS-HDR error in ES/ES+ ModulesCSCth11714*YES+ ECC_DOUBLE: Double-bit ECC error or reset due to eznp_ecc_err_isr CSCth15790*YLow-queue ES+: ECC_DOUBLE: Double-bit ECC error detected on NP, Mem 16CSCth20868YLink FPGA Update Failures with Different signaturesCSCth25959YIOS changes for updating the new temperature thresholds for ES+CSCti80887YTemperature incorrect when sensor is Not_OperationalCSCtn41667YIOS fix for handling the Power calcuation issues with ES+ Combo cardsCSCtn68668YFix LC inlet temp issue (ES+XC) and Alarm handling issues (All ES+)CSCtn95122YES+: ECC_DOUBLE: Double-bit ECC error detected on NP, Mem 17 CSCto55567YES+: LC failed to recover due to Metropolis lockupCSCtq07626YES+: DEV_SELENE XAUI_LEN, FIFO_FULL, XAUI_GNT and XAUI_MIN errorsCSCtr37182YXAUI code error reporting needs to be changedCSCtr74529YES+: LONGBUSYREAD: C2W Interface busy for long time reading temp sensor CSCtr74953YES+: Watchdog resets fail to write crashinfo, causing Keep Alive failureCSCts25729YES+: PCI read hang causes Keep Alive failure, fails to write crashinfoCSCtt13344YES+: Ingress traffic will not pass with > 7091 bytes packet size*NOTE: There were multiple issues related to ECC parity errors on ES+ linecard. All of the known issues are fixed in latest release, the recommendation for customers who have ES+ deployment is to upgrade software to 12.2(33)SRE5 or 15.0(1)S5 or future releases. Customer who have deployed 12.2(33)SRD release and if they cannot upgrade to 12.2(33)SRE5 for some reason then the recommendation is to have them upgrade to latest rebuild - 12.2(33)SRD6.========================================CSCsv05515x40g: Improve the message wordings for recoverable tcam errors----------------------------------------If this error message is encountered, please contact Cisco TAC for further support.========================================CSCsw31515ES+: %DEV_SELENE-DFCx-3-SRAM_ECC: Selene SRAM ECC Errors----------------------------------------If this error message is encountered, please contact Cisco TAC for further support.========================================CSCtb76621ES+ ROMMON: MPC8548 DDR20 errata fix for Multi-bit ECC errors----------------------------------------Symptom:%C6K_MEM_ECC-DFCx-2-MBE: Multiple bit error detected at ...%C6K_MEM_ECC-DFCx-3-SYNDROME_MBE: 8-bit Syndrome for the detected Multi-bit error: ...%C7600_MEM_ECC-DFCx-2-MBE: Multiple bit error detected at ...%C7600_MEM_ECC-DFCx-3-SYNDROME_MBE: 8-bit Syndrome for the detected Multi-bit error: ...Conditions:Observed on ES+ line card of Cisco 7600 Series Router.Workaround:There is no workaround.Further Problem Description:This fix is integrated in the 12.2(33r)SRD7 ROMMON image for ES+ card. SRD7 rommon image is bundled into IOS package for Cisco 7600 Series Router starting from 15.0(1)S. Cisco 7600 Series Routers running an image from 12.2(33)SRD or 12.2(33)SRE version may also run SRD7 rommon. If affected by this issue, contact Cisco TAC and request the 12.2(33r)SRD7 image. Please refer this link for the rommon upgrade procedure: ROMMON: controller setting changes to prevent Multi-bit ECC errors----------------------------------------If this error message is encountered, please contact Cisco TAC for further support.========================================CSCtc17311ES+: TCAM_MGR_HW_ERR: TCAM device had corrupted data errors----------------------------------------Symptoms: TCAM device is reporting corrupted data:%X40G-DFC2-3-TCAM_MGR_HW_ERR: GTM HW ERROR: TCAM device had corrupted data, the error is corrected for channel ...Conditions: Observed on ES+ linecards of Cisco 7600 Series Routers, by a background TCAM consistency checker. Workaround: There is no workaround. Further Problem Description: These messages can safely be ignored as the entries are already corrected.========================================CSCtd66014ES+: ECC_DOUBLE: Double-bit ECC error detected on NP - High T, Normal V----------------------------------------Symptoms: ES+ line card crashes at powerup of a Cisco 7600 router that is running Cisco IOS 12.2SRE image if either the Traffic Manager or Frame memories in the ES+ Network processors report a double bit ECC error. The ES+ line card crashinfo will have the following string:%NP_DEV-DFC2-3-ECC_DOUBLE: Double-bit ECC error detected on NP 0, Mem 19, SubMem 0x1,SingleErr 1, DoubleErr 1 Count 1 Total 1Conditions: Router reloads, OIR of ES+ cards, system environment temperatures that slowly vary around an ambient temperature of about 30 degreesC. This happens at system powerup. We have seen double bit ECC problems reported after a few hours of traffic if the ambient temperatures vary around 30 degreesC.Workaround: No configuration workaround is available. The line card will reset itself and will be operational in the second reload.========================================CSCtd99244ES+: ECC_SINGLE or ECC_DOUBLE error detected on NP----------------------------------------Symptoms:7600 series router with ES+ line card crashes reporting single bit or double bit ECC error.%NP_DEV-DFC2-3-ECC_SINGLE: Single-bit ECC error detected on NP 0, Mem 18, SubMem 0x1,SingleErr 1, DoubleErr 0 Count 1 Total 1%NP_DEV-DFC2-3-ECC_DOUBLE: Double-bit ECC error detected on NP 0, Mem 19, SubMem 0x1,SingleErr 1, DoubleErr 1 Count 1 Total 1Conditions:Symptom observed on ES+ linecard of C7600 series routers, usually in the initial phases of line card bootup, but this has also been reported after a few hours of traffic through the ES+ line card ports.Workaround:There is no workaround.Further Problem Description:Software fix is available in :12.2(33)SRD5 or higher12.2(33)SRE2 or higher15.0(1)S or higherIf symptom persists after IOS upgrade please contact Cisco TAC.========================================CSCtd99248ES+: ECC_DOUBLE: Double-bit ECC error detected on NP----------------------------------------Symptoms: 7600 series routers with ES+ line cards there could be occasional double bit ECC errors for the traffic manager and other metadata memories that are reported on the Network processor on the ES+ line card.Example error message:%NP_DEV-DFC9-3-ECC_DOUBLE: Double-bit ECC error detected on NP 3, Mem 18, SubMem 0x1,SingleErr 1, DoubleErr 1 Count 1 Total 1Conditions: This symptom is observed when the router reloads, OIR of ES+ cards, system environment temperatures that slowly vary around an ambient temperature of about 30 degreesC. This happens at system power up. The double bit ECC errors reported after a few hours of traffic.Workaround: No configuration workaround is available. The line card will reset itself and will be operational in the second reload.Further Problem Description:Software fix is available in :12.2(33)SRD5 or higher12.2(33)SRE2 or higher15.0(1)S or higherIf symptom persists after IOS upgrade please contact Cisco TAC.========================================CSCte14535Invalid LinkFPGA or LINKFPGA Bus Error----------------------------------------Symptom:Possible symptoms are:%FPD_MGMT-3-INVALID_IMG_VER: Invalid ... LinkFPGA .. image version detected for ... card in slot-dc ...%FPD_MGMT-6-UPGRADE_PASSED: ... LinkFPGA ... image in the ... card in slot-dc 7-2 has been successfully updated from version ?.? to version ...%C7600_ES-2-IOFPGA_IO_BUS_ERROR: C7600-ES Line Card IOFPGA IO LINKFPGA Bus ErrorConditions:Observed during boot/reload of ES+ line card in Cisco 7600 Series Routers. Rare in normal working ES+ cards.Workaround:This fix is an enhancement which adds an additional recovery cycle for reading the LinkFPGA.Further Problem Description:The link FPGA should recover in the next recovery reload of the ES+. If the recovery does not happen after 3 consecutive times, then a persistent hardware fault may be the reason. Contact TAC for RMA procedures.========================================CSCtg31984DBUS-HDR error in ES/ES+ Modules----------------------------------------Symptom:7600 with ES/ES+ module may report error EARL_L2_ASIC-DFC2-4-DBUS_HDR_ERR onafter boot up. There is no function impact to the switch due to this error.Conditions:7600 with ES/ES+ modules present. The problem can happen up to a few hoursafter boot up. Workaround:No workaround. Problem has been resolved in 12.2(33)SRD5 and 12.2(33)SRE2.========================================CSCth11714ES+ ECC_DOUBLE: Double-bit ECC error or reset due to eznp_ecc_err_isr----------------------------------------Symptom:7600 Series router with ES+ line card crashes reporting error:%NP_DEV-DFC2-3-ECC_DOUBLE: Double-bit ECC error detected on NP 0, Mem 19,SubMem 0x1,SingleErr 1, DoubleErr 1 Count 1 Total 1Another possible symptom is:%PM_SCP-SP-1-LCP_FW_ERR: System resetting module 1 to recover from error: eznp_ecc_err_isr: ECC intr handler for NP: 1 failedConditions:Symptom observed on ES+ linecard of C7600 series routers.Workaround:None.Further Problem Description:Software fix is available in :12.2(33)SRD5 or higher12.2(33)SRE2 or higher15.0(1)S or higherIf symptom persists after IOS upgrade please contact Cisco TAC.========================================CSCth15790Low-queue ES+: ECC_DOUBLE: Double-bit ECC error detected on NP, Mem 16----------------------------------------Symptoms:%NP_DEV-DFC9-3-ECC_DOUBLE: Double-bit ECC error detected on NP 3, Mem 16, SubMem 0x1,SingleErr 1, DoubleErr 1 Count 1 Total 1Conditions:Symptom observed on Low-queue ES+ line cards (ES+T) of C7600 series routers, in NP Mem 16.Workaround:There is no workaround.Further Problem Description: If symptom persists after IOS upgrade please contact Cisco TAC.========================================CSCth20868Link FPGA Update Failures with Different signatures----------------------------------------Symptom:ES+ card crashes with different failure messages during production. In Most of the cases the initial message for reload will be FPD upgrade failure for multiple attempts.The crash messages in this case will be different at different bootup attempts. These messages can be System Exception, FPD upgrade failure, IOFPGA bus error. Message Examples areInitial symptom would be:%FPD_MGMT-3-INVALID_IMG_VER: Invalid 20x1G LinkFPGA (FPD ID=7) image version detected for 7600-ES+20G card in slot-dc 7-2.IOFPGA bus error symptom:%C7600_ES-DFC7-2-IOFPGA_IO_BUS_ERROR: C7600-ES Line Card IOFPGA IO LINKFPGA Bus Error:and other system Exceptions.Conditions:Symptom observed during boot-up of 7600-ES+ linecards.Workaround:None.========================================CSCth25959ENV-4-MINORTEMPALARM - updating the new temperature thresholds for ES+----------------------------------------Symptom:Temperature alarm (ENV-4-MINORTEMPALARM) is reported, with AMBER LED on the line card faceplate. Conditions:7600 series router with any model of the ES+ line card. Workaround:No workaround. Further Problem Description:Temperature thresholds were set too low before this bug-fix . Correct settings are:--------------------------------------------Sensor Minor Major ID Threshold Threshold --------------------------------------------BB Outlet 0 65 80 BB Outlet 1 70 85 --------------------------------------------It is recommended to evaluate also the related bug CSCtn68668.========================================CSCti80887Temperature 128 degC reported when sensor is Not_Operational----------------------------------------Symptom:Faceplate LED on the linecard is red. Temperature sensor is reporting 128 degC.In addition, following I2C error may be reported by the linecard, confirming that the temperature sensor can not be read:I2C Read Error READ bus=0x1 addr=0x4D port_sel=0x0 flags = 0x0 cmd=0x0 size=2Conditions:Faulty sensor on a ES+ linecard of a C7600 Series Router.Workaround:None.Further Problem Description:This SW fix is correcting the reporting of an invalid sensor. Under same circumstances, 'NO' (Not Operational) will be reported instead of 128 degC.========================================CSCtn41667IOS fix for handling the Power calcuation issues with ES+ Combo cards----------------------------------------Symptom:Following ES+ PIDS consume more power than the expected values.76-ES+XC-20G3C 76-ES+XC-20G3CXL76-ES+XC-40G3C76-ES+XC-40G3CXLThis might lead to situation of other modules getting powered down due to "power deny" .Conditions:Specific to ES+XC variants (Combo cards) of Cisco 7600 Series Routers.Workaround:Configure power redundancy-mode combined until the IOS is upgraded to a release with correct power settings.========================================CSCtn68668Fix LC inlet temp issue (ES+XC) and Alarm handling issues (All ES+)----------------------------------------Symptoms: The following symptoms are observed:1. The STATUS LED on the line card faceplate is amber.2. The remote command module module show platform hardware environment temperature command reports high line card inlet temperature:Router#remote command mod 1 show plat hard env temp----------------------------------------------------------Temperature and Threshold Table----------------------------------------------------------Sensor Minor Major CurrentID Threshold Threshold Temperature----------------------------------------------------------BB Outlet 0 60 75 47BB Inlet 0 50 65 27BB Outlet 1 75 85 54BB Inlet 1 50 65 32 PE Outlet 60 75 53PE Inlet 50 65 34LC Outlet 60 75 49LC Inlet 50 65 50 <<<<<<<<Conditions: This issue is specific to the following Cisco 7600 ES+ combo cards:76-ES+XC-20G3C76-ES+XC-20G3CXL76-ES+XC-40G3C76-ES+XC-40G3CXLLine card inlet sensor is inappropriately positioned in a place where temperatures are higher than on the inlet point.Workaround: There is no workaround.Further Problem Description: There are no problems with the functioning of the board. Only the external communication is affected. "BB Inlet 1" shows the actual inlet temperature. It can be used for reliable measurement of line card inlet temperature.========================================CSCtn95122ES+: ECC_DOUBLE: Double-bit ECC error detected on NP, Mem 17----------------------------------------Symptoms: The ECC double-bit error is reported in syslog, followed with a linecard crash:%NP_DEV-DFC5-3-ECC_DOUBLE: Double-bit ECC error detected on NP ... Mem 17Conditions: Observed on ES+ linecards of C7600 Series Routers when heavy configuration changes are applied to the linecard. In addition, there are other unknown race conditions that can cause this. This bug-fix is specific to Double-bit errors on Mem 17.Workaround: There is no workaround.========================================CSCto55567ES+: FABRICCRCERRS after SSO due to Metropolis lockup----------------------------------------Symptoms: line card reports fabric errors:%FABRIC_INTF_ASIC-DFC9-4-FABRICCRCERRS: Fabric ASIC 0: 322 Fabric CRC error events in 100ms periodAlso, TestMacNotification and TestFabricCh0Health diagnostic tests are failing.Conditions: Symptom is observed on ES+ line cards of C7600 Series Routers after SSO with multicast traffic flowing through the line card.Workaround: Soft reload the line card using the hw-module module module reset exec command.========================================CSCtq07626ES+: DEV_SELENE XAUI_LEN, FIFO_FULL, XAUI_GNT and XAUI_MIN errors----------------------------------------Symptom:Errors detected by selene ASIC:%DEV_SELENE-DFC1-3-XAUI_LEN%DEV_SELENE-DFC1-3-FIFO_FULL%DEV_SELENE-DFC1-3-XAUI_GNT%DEV_SELENE-DFC1-3-XAUI_MINConditions:Observed on ES+ linecards of Cisco 7600 Series Routers.Workaround:None.Further Problem Description:Listed error types are not HW failures. Instead of being reported through error messages, occurrence of these errors can be tracked through CLI: remote command module module show platform hardware drops.========================================CSCtr37182ES+: single occurrence of DEV_SELENE XAUI_CODE error----------------------------------------Symptoms: Single occurrence of XAUI_CODE and XAUI_RX_RDY message in the syslog:%DEV_SELENE-DFC1-3-XAUI_CODE: Selene 1 XAUI 1 Coding Error%DEV_SELENE-DFC1-3-XAUI_RX_RDY: Selene 1 XAUI 1 Rx Rdy changed stateConditions: This symptom is observed on ES+ linecards of Cisco 7600 series router. Workaround: There is no workaround.Further Problem Description: Single occurrence of this error can safely be ignored.========================================CSCtr74529ES+: LONGBUSYREAD: C2W Interface busy for long time reading temp sensor----------------------------------------Symptoms:%ENVM-4-LONGBUSYREAD: C2W Interface busy for long time reading temperature sensorConditions: Observed on ES+ linecard of Cisco 7600 Series Routers.Workaround: There is no workaround.========================================CSCtr74953ES+: Watchdog resets fail to write crashinfo, causing Keep Alive failure----------------------------------------Symptom:%OIR-SP-3-PWRCYCLE: Card in module 1, is being power-cycled off (Module not responding to Keep Alive polling)%C7600_PWR-SP-4-DISABLED: power to module in slot 1 set off (Module not responding to Keep Alive polling)There is no crashifo file created.Conditions:Observed on ES+ linecards of Cisco 7600 Series Routers. This bug is specific to a condition where no other explanations exist for the failure of Keep Alive polling.Workaround:There is no workaround.Further Problem Description:This fix does not prevent the line card crash, but it prevents the silent crash. This fix ensures that a crashifo will be written on the ES+ line card flash disk. It also ensures that the line card is reset as soon as the error condition is detected, as opposed to waiting for a Keep Alive failure.========================================CSCts25729ES+: PCI read hang causes Keep Alive failure, fails to write crashinfo----------------------------------------Symptom:%OIR-SP-3-PWRCYCLE: Card in module 1, is being power-cycled off (Module not responding to Keep Alive polling)%C7600_PWR-SP-4-DISABLED: power to module in slot 1 set off (Module not responding to Keep Alive polling)There is no crashifo file created.Conditions:Observed on ES+ linecards of Cisco 7600 Series Routers. This bug is specific to a condition where no other explanations exist for the failure of Keep Alive polling.Workaround:There is no workaround.Further Problem Description:This fix does not prevent the line card crash, but it prevents the silent crash. This fix ensures that a crashifo and mini crashinfo will be written on the ES+ line card flash disk. It also ensures that the line card is reset as soon as the error condition is detected, as opposed to waiting for a Keep Alive failure.========================================CSCtt13344ES+: Ingress traffic will not pass with > 7091 bytes packet size----------------------------------------Symptom:Traffic will not pass with greater than 7091 byte packet size.Conditions:When MTU is set greater than 7091, sending packet size with > 7092 bytes may hit the issue.There is no specific trigger for this. But when issue is hit , ifdma_status register last byte reads "C0". From ES+, run below command to read the ifdma_status register.show platform hardware npc 1 register all | i ifdma1 ifdma_config 297 0x340100001 ifdma_counter_base 298 0x000000011 ifdma_frame_length 299 0x3DC0003C1 ifdma_buffer_recycle 300 0x00F300CE1 ifdma_enable 301 0xFFFFFFFF1 ifdma_status 302 0x000200C0<<<<Where npc "1" is NP number.Workaround:Fix is to disable the buffer recovery mechanism. ======================================== ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download