Oracle T3-1 SP fault

Moderator: cah

Post Reply
cah
General of the Army / Fleet Admiral / General of the Air Force
General of the Army / Fleet Admiral / General of the Air Force
Posts: 1342
Joined: Sun Aug 17, 2008 5:05 am

Oracle T3-1 SP fault

Post by cah »

We got SP fault on T3-1 server.
The vendor, NCE, suggested the MB replacement.

After the first MB replacement, the server couldn't ever power up. The voltage is out of normal range. This was on Thursday (05/19/2016).
The second MB replacement occurred on Monday (05/23/2016). The server could power up but couldn't see the disks.
The third attempt was yesterday (05/24/2016). Same situation. Finally, NCE support figured out the volume needs to be activated first before probe-scsi-all can find the device. Then, boot device needs to have the right path and auto-boot needs to be set as true.

First, need to select the scsi.

Code: Select all

{0} ok select /pci@400/pci@1/pci@0/pci@4/scsi@0
Then, show-volumes command to determine the inactive volume:

Code: Select all

{0} ok show-volumes
Volume 0 Target 389  Type RAID1 (Mirroring)
  Name root_volume  WWID 0c7892682c764144
  Optimal  Enabled  Inactive
  2 Members                                         583983104 Blocks, 298 GB
  Disk 1
    Primary  Optimal
    Target 9      HITACHI  H103030SCSUN300G A2A8   PhyNum 0
  Disk 0
    Secondary  Optimal
    Target a      HITACHI  H103030SCSUN300G A2A8   PhyNum 2
Activate the volume using the volume number:

Code: Select all

ok 0 activate-volume
Probe all SCSI devices

Code: Select all

{0} ok probe-scsi-all
/pci@400/pci@2/pci@0/pci@f/pci@0/usb@0,2/hub@2/hub@3/storage@2
  Unit 0   Removable Read Only device    AMI     Virtual CDROM   1.00

/pci@400/pci@2/pci@0/pci@4/scsi@0

FCode Version 1.00.62, MPT Version 2.00, Firmware Version 5.00.17.00

Target a
  Unit 0   Removable Read Only device   TSSTcorp CDDVDW TS-T633A  SR00
  SATA device  PhyNum 6

/pci@400/pci@1/pci@0/pci@4/scsi@0

FCode Version 1.00.62, MPT Version 2.00, Firmware Version 5.00.17.00

Target 389 Volume 0
  Unit 0   Disk   LSI      Logical Volume   3000    583983104 Blocks, 298 GB
  VolumeDeviceName 3c7892682c764144  VolumeWWID 0c7892682c764144
Set up boot disk path

Code: Select all

ok nvramrc=devalias root-volume /pci@400/pci@1/pci@0/pci@4/scsi@0/disk@389 (old OBP syntax)
{0} ok nvalias root-volume /pci@400/pci@1/pci@0/pci@4/scsi@0/disk@389 (new OBP syntax)
ok setenv boot-device root-volume
Set up auto-boot

Code: Select all

ok setenv auto-boot? true
auto-boot? =            true
Boot up the server

Code: Select all

{0} ok boot
or 
{0} ok boot root-volume
The server finally booted up with the right disk.

However, it shows the old faulty message again today!

Code: Select all

appzone01:/%fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Feb 26 2013     7e85742b-44d6-616a-ef3e-90c3471ee4d3  SUN4V-8002-US  Critical  

Host        : appzone01
Platform    : sun4v     Chassis_id  : 1047BDR269
Product_sn  : 1047BDR269

Fault class : fault.sp.failed
Problem in  : "/SYS/MB/SP" (hc://:product-id=sun4v:product-sn=1047BDR269:server-id=appzone01:chassis-id=1047BDR269/chassis=0/sp=0)
                  faulted but still in service
FRU         : "/SYS/MB/SP" (hc://:product-id=sun4v:product-sn=1047BDR269:server-id=appzone01:chassis-id=1047BDR269/chassis=0/sp=0)
                  faulty

Description : The Service Processor failed.

Response    : No automated response.

Impact      : Some services such as Fault Diagnosis may be degraded as a
              result.

Action      : Use 'fmadm faulty' to provide a more detailed view of this event.
              Please refer to the associated reference document at
              http://sun.com/msg/SUN4V-8002-US for the latest service
              procedures and policies regarding this diagnosis.
CAH, The Great
cah
General of the Army / Fleet Admiral / General of the Air Force
General of the Army / Fleet Admiral / General of the Air Force
Posts: 1342
Joined: Sun Aug 17, 2008 5:05 am

Repairing ILOM faults

Post by cah »

We see amber light on T3-1 and T3-2 and we can find the faults by running “fmadm faulty” command as we know.

I tried to run the following command to “repair” the fault and that eliminated the amber light too.

Code: Select all

orazone01:/%fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Jan 06 20:25:24 9366375a-cc70-609c-8e5f-c7d4fa3d26a1  SUN4V-8002-US  Critical  

Host        : orazone01
Platform    : ORCL,SPARC-T3-2   Chassis_id  : 
Product_sn  : 

Fault class : fault.sp.failed
Problem in  : "/SYS/MB/SP" (hc://:product-id=ORCL,SPARC-T3-2:product-sn=1047BDR246:server-id=orazone01:chassis-id=1047BDR246/chassis=0/sp=0)
                  faulted but still in service
FRU         : "/SYS/MB/SP" (hc://:product-id=ORCL,SPARC-T3-2:product-sn=1047BDR246:server-id=orazone01:chassis-id=1047BDR246/chassis=0/sp=0)
                  faulty

Description : The Service Processor failed.
              Refer to http://sun.com/msg/SUN4V-8002-US for more information.

Response    : No automated response.

Impact      : Some services such as Fault Diagnosis may be degraded as a
              result.

Action      : Schedule a repair procedure for the Service Processor, or contact
              Sun for support.

orazone01:/%fmadm repaired /SYS/MB/SP
fmadm: recorded repair to of /SYS/MB/SP
orazone01:/%fmadm faulty             
orazone01:/% 
Fault is gone (may be temporarily) and amber light went off

I checked after one day and the faults didn't come back. Let's see how long it can last.
I checked again on 05/31/2016 (5 days later), still no faults.
CAH, The Great
cah
General of the Army / Fleet Admiral / General of the Air Force
General of the Army / Fleet Admiral / General of the Air Force
Posts: 1342
Joined: Sun Aug 17, 2008 5:05 am

Another MB replacement and issues for appzone01 (T3-1)

Post by cah »

After MB replacement on 07/02/2019, the chassis did not have the firmware high enough to work. I had to get the patch 152738.02 with the firmware (Sun_System_Firmware-8_3_40_a-SPARC_T3-1.pkg) in it.

ILOM web GUI had the Enter Update Mode greyed out so I was unable to update the firmware through web GUI.
Therefore, I had to set up a TFTP server (SolarWind) on my computer and placed the Sun_System_Firmware-8_3_40_a-SPARC_T3-1.pkg in its root (C:|TFTP-Root) and started tftp server. Then, I was able to load it from tftp server.

Code: Select all

-> load -source tftp://10.125.81.24/Sun_System_Firmware-8_3_40_a-SPARC_T3-1.pkg

NOTE: An upgrade takes several minutes to complete. ILOM
      will enter a special mode to load new firmware. No
      other tasks can be performed in ILOM until the
      firmware upgrade is complete and ILOM is reset.

Are you sure you want to load the specified file (y/n)? y
Preserve existing configuration (y/n)? y
.........................................................................................................................

Firmware update is complete.
ILOM will now be restarted with the new firmware.


Unrecognized Chassis: This module is installed in an unknown or unsupported
chassis. You must upgrade the firmware to a newer version that supports
this chassis.

-> /sbin/reboot
After the reboot, I was able to access ILOM via either CLI or web GUI! Yah!

However, the server couldn't boot up because the boot-device was gone after the MB replacement. I had to get the old document out (this BBS thread) and added a few steps to get it set up.

Force it to ok prompt from ILOM:

Code: Select all

-> set /HOST send_break_action=break

Press Enter.

Then type:

-> start /SP/console

{0} ok
'devalias' and 'show-disks' didn't help much.

Code: Select all

{0} ok devalias
screen                   /pci@400/pci@2/pci@0/pci@0/pci@0/display@0
mouse                    /pci@400/pci@2/pci@0/pci@f/pci@0/usb@0,2/hub@2/device@4/mouse@1
rcdrom                   /pci@400/pci@2/pci@0/pci@f/pci@0/usb@0,2/hub@2/hub@3/storage@2/disk@0
rkeyboard                /pci@400/pci@2/pci@0/pci@f/pci@0/usb@0,2/hub@2/device@4/keyboard@0
rscreen                  /pci@400/pci@2/pci@0/pci@0/pci@0/display@0:r1280x1024x60
net3                     /pci@400/pci@2/pci@0/pci@7/network@0,1
net2                     /pci@400/pci@2/pci@0/pci@7/network@0
net1                     /pci@400/pci@2/pci@0/pci@6/network@0,1
net0                     /pci@400/pci@2/pci@0/pci@6/network@0
net                      /pci@400/pci@2/pci@0/pci@6/network@0
disk7                    /pci@400/pci@2/pci@0/pci@4/scsi@0/disk@p3
disk6                    /pci@400/pci@2/pci@0/pci@4/scsi@0/disk@p2
disk5                    /pci@400/pci@2/pci@0/pci@4/scsi@0/disk@p1
disk4                    /pci@400/pci@2/pci@0/pci@4/scsi@0/disk@p0
cdrom                    /pci@400/pci@2/pci@0/pci@4/scsi@0/disk@p6
scsi1                    /pci@400/pci@2/pci@0/pci@4/scsi@0
disk3                    /pci@400/pci@1/pci@0/pci@4/scsi@0/disk@p3
disk2                    /pci@400/pci@1/pci@0/pci@4/scsi@0/disk@p2
disk1                    /pci@400/pci@1/pci@0/pci@4/scsi@0/disk@p1
disk0                    /pci@400/pci@1/pci@0/pci@4/scsi@0/disk@p0
disk                     /pci@400/pci@1/pci@0/pci@4/scsi@0/disk@p0
scsi0                    /pci@400/pci@1/pci@0/pci@4/scsi@0
scsi                     /pci@400/pci@1/pci@0/pci@4/scsi@0
virtual-console          /virtual-devices@100/console@1
name                     aliases


{0} ok show-disks
a) /pci@400/pci@2/pci@0/pci@f/pci@0/usb@0,2/hub@2/hub@3/storage@2/disk
b) /pci@400/pci@2/pci@0/pci@4/scsi@0/disk
c) /pci@400/pci@1/pci@0/pci@4/scsi@0/disk
d) /iscsi-hba/disk
q) NO SELECTION
Enter Selection, q to quit: q
I have to select the SCSI first before I can follow the old document.

Code: Select all

{0} ok select /pci@400/pci@1/pci@0/pci@4/scsi@0
{0} ok show-volumes
Volume 0 Target 389  Type RAID1 (Mirroring)
  Name root_volume  WWID 0c7892682c764144
  Optimal  Enabled  Inactive
  2 Members                                         583983104 Blocks, 298 GB
  Disk 1
    Primary  Optimal
    Target 9      HITACHI  H103030SCSUN300G A2A8   PhyNum 0
  Disk 0
    Secondary  Optimal
    Target a      HITACHI  H103030SCSUN300G A2A8   PhyNum 2
{0} ok 0 activate-volume
Volume 0 is now activated
{0} ok probe-scsi-all
/pci@400/pci@2/pci@0/pci@f/pci@0/usb@0,2/hub@2/hub@3/storage@2
  Unit 0   Removable Read Only device    AMI     Virtual CDROM   1.00

/pci@400/pci@2/pci@0/pci@4/scsi@0

FCode Version 1.00.62, MPT Version 2.00, Firmware Version 5.00.17.00

Target a
  Unit 0   Removable Read Only device   TSSTcorp CDDVDW TS-T633A  SR00
  SATA device  PhyNum 6

/pci@400/pci@1/pci@0/pci@4/scsi@0

FCode Version 1.00.62, MPT Version 2.00, Firmware Version 5.00.17.00

Target 389 Volume 0
  Unit 0   Disk   LSI      Logical Volume   3000    583983104 Blocks, 298 GB
  VolumeDeviceName 3c7892682c764144  VolumeWWID 0c7892682c764144

{0} ok nvalias root-volume /pci@400/pci@1/pci@0/pci@4/scsi@0/disk@389
{0} ok setenv boot-device root-volume
boot-device =           root-volume
{0} ok setenv auto-boot? true
auto-boot? =            true
Then, booted it from root-volume succeeded.

Code: Select all

{0} ok boot
or
{0} ok boot root-volume
After booting it up, the following message showed up on the console:

Code: Select all

appzone01 console login:
SUNW-MSG-ID: ZFS-8000-D3, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: 20
PLATFORM: ORCL,SPARC-T3-1, CSN: -, HOSTNAME: appzone01
SOURCE: zfs-diagnosis, REV: 1.0
EVENT-ID: 6237cda6-1eaf-4ddd-8ea9-aea55c991d35
DESC: A ZFS device failed.
AUTO-RESPONSE: No automated response will occur.
IMPACT: Fault tolerance of the pool may be compromised.
REC-ACTION: Run 'zpool status -x' for more information. Please refer to the associated reference document at http://sun.com/msg/ZFS-8000-D3 for the latest service procedures and policies regarding this diagnosis.

SUNW-MSG-ID: ZFS-8000-CS, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: 20
PLATFORM: ORCL,SPARC-T3-1, CSN: -, HOSTNAME: appzone01
SOURCE: zfs-diagnosis, REV: 1.0
EVENT-ID: ac40b742-21fe-420e-ce2f-a40e6341105b
DESC: A ZFS pool failed to open.
AUTO-RESPONSE: No automated response will occur.
IMPACT: The pool data is unavailable
REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Run 'zpool status -x' and attach any missing devices, follow any provided recovery instructions or restore from backup. Please refer to the associated reference document at http://sun.com/msg/ZFS-8000-CS for the latest service procedures and policies regarding this diagnosis.
I logged in and ran the suggested command and found:

Code: Select all

appzone01:/export/home/hsiaoc1%zpool status -x
  pool: zonepool
 state: UNAVAIL
status: One or more devices could not be opened.  There are insufficient
        replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-3C
 scan: none requested
config:

        NAME                     STATE     READ WRITE CKSUM
        zonepool                 UNAVAIL      0     0     0  insufficient replicas
          c2t38DA0EA08017CB7Ad0  UNAVAIL      0     0     0  cannot open

appzone01:/export/home/hsiaoc1%zpool list           
NAME      SIZE  ALLOC  FREE  CAP   HEALTH  ALTROOT
rpool     276G  23.8G  252G   8%   ONLINE  -
zonepool     -      -     -    -  FAULTED  -
The device "c2t38DA0EA08017CB7Ad0" became unavailable. That's the device for /zonepool. Why did it disappear? I don't know yet. Maybe the tech didn't connect all the SCSI cables to the controller when he replaced the MB. I will need to take a look at the server in the data center to see if the light on the disks is on or off next Monday (07/08/2019) to determine.

On July 8, 2019, I opened up the server and reseated all cables but still no luck.
So, I pulled disk #4 and #6 out and booted up again.
I then put these 2 into #1 and #3 slots and probe-scsi-all again. This time, select the scsi by full path /pci@400/pci@1/pci@0/pci@4/scsi@0:

Code: Select all

{0} ok select /pci@400/pci@1/pci@0/pci@4/scsi@0
{0} ok show-children

FCode Version 1.00.62, MPT Version 2.00, Firmware Version 5.00.17.00

Target 389 Volume 1
  Unit 0   Disk   LSI      Logical Volume   3000    583983104 Blocks, 298 GB
  VolumeDeviceName 3c7892682c764144  VolumeWWID 0c7892682c764144
Still shows one child.

Code: Select all

{0} ok show-volumes
Volume 0 Target 388  Type RAID1 (Mirroring)
  Name app_volume  WWID 08da0ea08017cb7a
  Optimal  Enabled  Inactive
  2 Members                                         583983104 Blocks, 298 GB
  Disk 3
    Primary  Optimal
    Target c      HITACHI  H103030SCSUN300G A2A8   PhyNum 1
  Disk 2
    Secondary  Optimal
    Target a      HITACHI  H103030SCSUN300G A2A8   PhyNum 3
Volume 1 Target 389  Type RAID1 (Mirroring)
  Name root_volume  WWID 0c7892682c764144
  Optimal  Enabled
  2 Members                                         583983104 Blocks, 298 GB
  Disk 1
    Primary  Optimal
    Target 9      HITACHI  H103030SCSUN300G A2A8   PhyNum 0
  Disk 0
    Secondary  Optimal
    Target b      HITACHI  H103030SCSUN300G A2A8   PhyNum 2
It shows 2 volumes!!!
The boot volume became volume 1. The 2 disks I put into #1 and #3 became volume 0 and showed volume 0. I then activate volume 0. PhyNum is the slot number on the chassis.

Code: Select all

{0} ok 0 activate-volume
Volume 0 is now activated
{0} ok show-volumes
Volume 0 Target 388  Type RAID1 (Mirroring)
  Name app_volume  WWID 08da0ea08017cb7a
  Degraded  Enabled  Resync In Progress
  2 Members                                         583983104 Blocks, 298 GB
  Disk 3
    Primary  Optimal
    Target c      HITACHI  H103030SCSUN300G A2A8   PhyNum 1
  Disk 2
    Secondary  Rebuilding  Out Of Sync
    Target a      HITACHI  H103030SCSUN300G A2A8   PhyNum 3
Volume 1 Target 389  Type RAID1 (Mirroring)
  Name root_volume  WWID 0c7892682c764144
  Optimal  Enabled
  2 Members                                         583983104 Blocks, 298 GB
  Disk 1
    Primary  Optimal
    Target 9      HITACHI  H103030SCSUN300G A2A8   PhyNum 0
  Disk 0
    Secondary  Optimal
    Target b      HITACHI  H103030SCSUN300G A2A8   PhyNum 2
Boot into OS

Code: Select all

{0} ok boot
Boot device: root-volume  File and args:
SunOS Release 5.10 Version Generic_150400-30 64-bit
Copyright (c) 1983, 2015, Oracle and/or its affiliates. All rights reserved.
Failed to configure IPv4 interface(s): igb1
Hostname: appzone01

appzone01 console login:
SUNW-MSG-ID: FMD-8000-4M, TYPE: Repair, VER: 1, SEVERITY: Minor
EVENT-TIME: 20
PLATFORM: ORCL,SPARC-T3-1, CSN: -, HOSTNAME: appzone01
SOURCE: zfs-diagnosis, REV: 1.0
EVENT-ID: 6237cda6-1eaf-4ddd-8ea9-aea55c991d35
DESC: All faults associated with an event id have been addressed.
AUTO-RESPONSE: Some system components offlined because of the original fault may have been brought back online.
IMPACT: Performance degradation of the system due to the original fault may have been recovered.
REC-ACTION: No action is required.

SUNW-MSG-ID: FMD-8000-6U, TYPE: Resolved, VER: 1, SEVERITY: Minor
EVENT-TIME: 20
PLATFORM: ORCL,SPARC-T3-1, CSN: -, HOSTNAME: appzone01
SOURCE: zfs-diagnosis, REV: 1.0
EVENT-ID: 6237cda6-1eaf-4ddd-8ea9-aea55c991d35
DESC: All faults associated with an event id have been addressed.
AUTO-RESPONSE: All system components offlined because of the original fault have been brought back online.
IMPACT: Performance degradation of the system due to the original fault has been recovered.
REC-ACTION: No action is required.
CAH, The Great
Post Reply