Page 1 of 1

T3-2 memory upgrade failed and solution

Posted: Tue Jun 04, 2013 3:00 pm
by cah
I was helping CareLink team to upgrade memory on both T3-1 and T3-2 this morning at 7.

T3-1 came back fine with 128 GB as expected. However, T3-2 came back with just 64 GB even after adding additional 64 GB in it.

Couldn't figure out the reason so I asked Mats to double checked the memory DIMMs. He said all was put in properly and there's no memory fault. So, I did a reset on it and hoped to see clues from POST.

Here is the output from POST:

Code: Select all

Serial console started.  To stop, type #.
[CPU 0:0:0] NOTICE:  Checking Flash File System
[CPU 0:0:0] NOTICE:  Initializing TOD: 2013/06/04 16:03:01
[CPU 0:0:0] NOTICE:  Loaded ASR status DB data. Ver. 3.
[CPU 0:0:0] NOTICE:  Initializing TPM with:
                        tpm_enable = false
                        tpm_activate = false
                        tpm_forceclear = false
[CPU 0:0:0] NOTICE:  TPM found: Ver 1.2, Rev 1.2, SpecLevel 2, errataRev 0, VendorId 'IFX'
[CPU 0:0:0] NOTICE:  TPM initialized successfully. Current state is: disabled
[CPU 0:0:0] NOTICE:  Version:   003e002821030607
[CPU 1:0:0] NOTICE:  Version:   003e002821030607
[CPU 0:0:0] NOTICE:  Serial#:   0000000000000000.00090280281bd76a
[CPU 1:0:0] NOTICE:  Serial#:   0000000000000000.000a0280281bd6a6
[CPU 1:0:0] NOTICE:  /SYS/MB/CMP1/MCU1 is disabled
[CPU 0:0:0] NOTICE:  MCU0: Memory Capacity is 32GB
[CPU 1:0:0] NOTICE:  MCU0: Memory Capacity is 32GB
[CPU 0:0:0] NOTICE:  MCU1: Memory Capacity is 32GB
[CPU 1:0:0] ERROR:   /SYS/MB/CMP1/MCU1: Unusable
[CPU 1:0:0] ERROR:   /SYS/MB/CMP1/L2T1: Not configured
[CPU 1:0:0] ERROR:   /SYS/MB/CMP1/L2T3: Not configured
[CPU 1:0:0] ERROR:   /SYS/MB/CMP1/L2T5: Not configured
[CPU 1:0:0] ERROR:   /SYS/MB/CMP1/L2T7: Not configured
[CPU 0:0:0] ERROR:   /SYS/MB/CMP1/L2T1: Unusable
[CPU 0:0:0] ERROR:   /SYS/MB/CMP1/L2T3: Unusable
[CPU 0:0:0] ERROR:   /SYS/MB/CMP1/L2T5: Unusable
[CPU 0:0:0] ERROR:   /SYS/MB/CMP1/L2T7: Unusable
[CPU 0:0:0] ERROR:   /SYS/MB/CMP0/L2T0: Not configured
[CPU 0:0:0] ERROR:   /SYS/MB/CMP0/L2T2: Not configured
[CPU 0:0:0] ERROR:   /SYS/MB/CMP0/L2T4: Not configured
[CPU 0:0:0] ERROR:   /SYS/MB/CMP0/L2T6: Not configured
[CPU 0:0:0] ERROR:   /SYS/MB/CMP0/L2T0: Unusable
[CPU 0:0:0] ERROR:   /SYS/MB/CMP0/L2T2: Unusable
[CPU 0:0:0] ERROR:   /SYS/MB/CMP0/L2T4: Unusable
[CPU 0:0:0] ERROR:   /SYS/MB/CMP0/L2T6: Unusable
[CPU 0:0:0] ERROR:   /SYS/MB/CMP0/MCU0: Not configured
[CPU 0:0:0] ERROR:   8 Cores cannot be configured due to degraded L2T configuration
[CPU 0:0:0] NOTICE:  Usable strands:    0000000000000000.ffffffffffffffff
[CPU 0:0:0] NOTICE:  System memory capacity is 64GB
[CPU 1:0:0] ERROR:   8 Cores cannot be configured due to degraded L2T configuration
[CPU 1:0:0] NOTICE:  Usable strands:    0000000000000000.ffffffffffffffff
[CPU 0:0:0] NOTICE:  Clocks: CMP: 1649 MHz DRAM: 533 MHz (6.4 Gbps) CL: 1466 MHz (8.8 Gbps)
[CPU 1:0:0] NOTICE:  Clocks: CMP: 1649 MHz DRAM: 533 MHz (6.4 Gbps) CL: 1466 MHz (8.8 Gbps)
[CPU 1:0:0] NOTICE:  Initializing TSR Hoovers
[CPU 0:0:0] NOTICE:  Initializing TSR Hoovers
[CPU 1:0:0] NOTICE:  Initializing FSR Hoovers
[CPU 0:0:0] NOTICE:  Initializing FSR Hoovers
[CPU 1:0:0] NOTICE:  Initializing LFU serdes 0
[CPU 0:0:0] NOTICE:  Initializing LFU serdes 1
[CPU 1:0:0] NOTICE:  Initializing LFU serdes 2
[CPU 0:0:0] NOTICE:  Initializing LFU serdes 3
[CPU 1:0:0] NOTICE:  Initializing MCU 0 serdes
[CPU 0:0:0] NOTICE:  Initializing MCU 1 serdes
[CPU 0:0:0] NOTICE:  Updating Config Information for Guest Manager
[CPU 1:0:0] NOTICE:  Starting MBIST
[CPU 0:0:0] NOTICE:  Starting MBIST
[CPU 0:0:0] NOTICE:  Issuing Host warm Reset
[CPU 1:0:0] NOTICE:  Starting MBISI
[CPU 0:0:0] NOTICE:  Starting MBISI
[CPU 0:0:0] NOTICE:  Issuing Host warm Reset
[CPU 1:0:0] WARNING: Partial cache mode, Running configuration code from ROM
[CPU 0:0:0] WARNING: Partial cache mode, Running configuration code from ROM
[CPU 1:0:0] NOTICE:  Initializing COU Regs
[CPU 0:0:0] NOTICE:  Initializing COU Regs
[CPU 1:0:0] NOTICE:  Initializing MCU 0
[CPU 0:0:0] NOTICE:  Initializing MCU 1
[CPU 1:0:0] NOTICE:  SMI Channel 0, SB Mapping 0 -- ERRCNT:     0x0     LNERR: 0x0
[CPU 0:0:0] NOTICE:  SMI Channel 0, SB Mapping 0 -- ERRCNT:     0x0     LNERR: 0x0
[CPU 1:0:0] NOTICE:  SMI Channel 0, SB Mapping 1 -- ERRCNT:     0x0     LNERR: 0x0
[CPU 0:0:0] NOTICE:  SMI Channel 0, SB Mapping 1 -- ERRCNT:     0x0     LNERR: 0x0
[CPU 1:0:0] NOTICE:  SMI Channel 1, SB Mapping 0 -- ERRCNT:     0x0     LNERR: 0x0
[CPU 0:0:0] NOTICE:  SMI Channel 1, SB Mapping 0 -- ERRCNT:     0x0     LNERR: 0x0
[CPU 1:0:0] NOTICE:  SMI Channel 1, SB Mapping 1 -- ERRCNT:     0x0     LNERR: 0x0
[CPU 0:0:0] NOTICE:  SMI Channel 1, SB Mapping 1 -- ERRCNT:     0x0     LNERR: 0x0
[CPU 1:0:0] WARNING: Partial cache mode, Disabling DDR3 Clock Delay Training
[CPU 0:0:0] WARNING: Partial cache mode, Disabling DDR3 Clock Delay Training
[CPU 0:0:0] WARNING: Partial cache mode, Disabling DDR3 Read/Write DQ-DQS cleanup
[CPU 1:0:0] WARNING: Partial cache mode, Disabling DDR3 Read/Write DQ-DQS cleanup
[CPU 1:0:0] NOTICE:  Initializing LFU Links
[CPU 0:0:0] NOTICE:  Initializing LFU Links
0:0:0>
0:0:0>POST 4.34.2 2012/10/04 12:58  
0:0:0>          
0:0:0>Copyright (c) 2012, Oracle and/or its affiliates. All rights reserved.
0:0:0>POST enabling CMP 0 threads: 00000000.00000000.ffffffff.ffffffff 
0:0:0>POST enabling CMP 1 threads: 00000000.00000000.ffffffff.ffffffff 
0:0:0>Diag mode      : 1 [Normal] 
0:0:0>Diag level     : 1 [Max]    
0:0:0>Diag verbosity : 2 [Normal]
0:0:0>Test Memory....Done
0:0:0>Setup POST Mailbox ....Done
0:0:0>Master CPU Tests Basic....Done
0:0:0>Init MMU.....
0:0:0>Setup POST Mailbox ....Done
1:0:0>NODE 1 present
0:0:0>Extended CPU Tests....Done
0:0:0>L2 Tests....Done
0:0:0>Scrub Memory....Done
0:0:0>Functional CPU Tests....Done
0:0:0>Extended Memory Tests....Done
0:0:0>SPU CWQ Tests...Done
0:0:0>MAU Tests...Done
0:0:0>IOS register tests....Done
0:0:0>Network Interface Unit Port 0 Tests ..Done
0:0:0>Network Interface Unit Port 1 Tests ..Done
2013-06-04 16:22:05.810 0:0:0>INFO:
2013-06-04 16:22:05.823 0:0:0>  POST Passed all devices.
2013-06-04 16:22:05.842 0:0:0>POST:     Return to Host Config.
[CPU 0:0:0] NOTICE:  Reconfiguring System
[CPU 1:0:0] NOTICE:  Reconfiguring System
[CPU 0:0:0] NOTICE:  MCU0: Memory Capacity is 32GB
[CPU 1:0:0] NOTICE:  /SYS/MB/CMP1/MCU1 is disabled
[CPU 1:0:0] NOTICE:  MCU0: Memory Capacity is 32GB
[CPU 0:0:0] NOTICE:  MCU1: Memory Capacity is 32GB
[CPU 1:0:0] ERROR:   /SYS/MB/CMP1/MCU1: Unusable
[CPU 1:0:0] ERROR:   /SYS/MB/CMP1/L2T1: Not configured
[CPU 1:0:0] ERROR:   /SYS/MB/CMP1/L2T3: Not configured
[CPU 1:0:0] ERROR:   /SYS/MB/CMP1/L2T5: Not configured
[CPU 1:0:0] ERROR:   /SYS/MB/CMP1/L2T7: Not configured
[CPU 0:0:0] ERROR:   /SYS/MB/CMP1/L2T1: Unusable
[CPU 0:0:0] ERROR:   /SYS/MB/CMP1/L2T3: Unusable
[CPU 0:0:0] ERROR:   /SYS/MB/CMP1/L2T5: Unusable
[CPU 0:0:0] ERROR:   /SYS/MB/CMP1/L2T7: Unusable
[CPU 0:0:0] ERROR:   /SYS/MB/CMP0/L2T0: Not configured
[CPU 0:0:0] ERROR:   /SYS/MB/CMP0/L2T2: Not configured
[CPU 0:0:0] ERROR:   /SYS/MB/CMP0/L2T4: Not configured
[CPU 0:0:0] ERROR:   /SYS/MB/CMP0/L2T6: Not configured
[CPU 0:0:0] ERROR:   /SYS/MB/CMP0/L2T0: Unusable
[CPU 0:0:0] ERROR:   /SYS/MB/CMP0/L2T2: Unusable
[CPU 0:0:0] ERROR:   /SYS/MB/CMP0/L2T4: Unusable
[CPU 0:0:0] ERROR:   /SYS/MB/CMP0/L2T6: Unusable
[CPU 0:0:0] ERROR:   /SYS/MB/CMP0/MCU0: Not configured
[CPU 0:0:0] ERROR:   8 Cores cannot be configured due to degraded L2T configuration
[CPU 0:0:0] NOTICE:  Usable strands:    0000000000000000.ffffffffffffffff
[CPU 0:0:0] NOTICE:  System memory capacity is 64GB
[CPU 1:0:0] ERROR:   8 Cores cannot be configured due to degraded L2T configuration
[CPU 1:0:0] NOTICE:  Usable strands:    0000000000000000.ffffffffffffffff
[CPU 1:0:0] NOTICE:  Starting MBISI
[CPU 0:0:0] NOTICE:  Starting MBISI
[CPU 0:0:0] NOTICE:  Issuing Host warm Reset
[CPU 1:0:0] WARNING: Partial cache mode, Running configuration code from ROM
[CPU 0:0:0] WARNING: Partial cache mode, Running configuration code from ROM
[CPU 1:0:0] NOTICE:  Initializing COU Regs
[CPU 0:0:0] NOTICE:  Initializing COU Regs
[CPU 1:0:0] NOTICE:  Initializing MCU 0
[CPU 0:0:0] NOTICE:  Initializing MCU 1
[CPU 1:0:0] NOTICE:  SMI Channel 0, SB Mapping 0 -- ERRCNT:     0x0     LNERR: 0x0
[CPU 0:0:0] NOTICE:  SMI Channel 0, SB Mapping 0 -- ERRCNT:     0x0     LNERR: 0x0
[CPU 1:0:0] NOTICE:  SMI Channel 0, SB Mapping 1 -- ERRCNT:     0x0     LNERR: 0x0
[CPU 0:0:0] NOTICE:  SMI Channel 0, SB Mapping 1 -- ERRCNT:     0x0     LNERR: 0x0
[CPU 1:0:0] NOTICE:  SMI Channel 1, SB Mapping 0 -- ERRCNT:     0x0     LNERR: 0x0
[CPU 0:0:0] NOTICE:  SMI Channel 1, SB Mapping 0 -- ERRCNT:     0x0     LNERR: 0x0
[CPU 1:0:0] NOTICE:  SMI Channel 1, SB Mapping 1 -- ERRCNT:     0x0     LNERR: 0x0
[CPU 0:0:0] NOTICE:  SMI Channel 1, SB Mapping 1 -- ERRCNT:     0x0     LNERR: 0x0
[CPU 1:0:0] WARNING: Partial cache mode, Disabling DDR3 Clock Delay Training
[CPU 0:0:0] WARNING: Partial cache mode, Disabling DDR3 Clock Delay Training
[CPU 0:0:0] WARNING: Partial cache mode, Disabling DDR3 Read/Write DQ-DQS cleanup
[CPU 1:0:0] WARNING: Partial cache mode, Disabling DDR3 Read/Write DQ-DQS cleanup
[CPU 1:0:0] NOTICE:  Initializing LFU Links
[CPU 0:0:0] NOTICE:  Initializing LFU Links
[CPU 0:0:0] NOTICE:  Copying code to Memory
[CPU 0:0:0] NOTICE:  Running from Memory
[CPU 1:0:0] NOTICE:  Running from Memory
[CPU 0:0:0] NOTICE:  Active strands:    0000000000000000.ffffffffffffffff
[CPU 1:0:0] NOTICE:  Active strands:    0000000000000000.ffffffffffffffff
[CPU 0:0:0] NOTICE:  Configuring MDs
[CPU 0:0:0] NOTICE:  Loading PRI template
[CPU 0:0:0] NOTICE:  Configuring PRI
[CPU 0:0:0] NOTICE:  Product serial number: 1047BDR246
[CPU 0:0:0] NOTICE:  Product part number: 4729237-6
[CPU 0:0:0] NOTICE:  Storing PRI to memory
[CPU 0:0:0] NOTICE:  Booting config = factory-default
[CPU 0:0:0] NOTICE:  Configuring Guest MD 
[CPU 0:0:0] NOTICE:  Storing Guest MD to Memory
[CPU 0:0:0] NOTICE:  Configuring HV MD 
[CPU 0:0:0] NOTICE:  Storing HV MD to Memory
[CPU 0:0:0] NOTICE:  Storing PRI to data flash ("factory-default")
[CPU 0:0:0] NOTICE:  Storing Guest MD to data flash ("factory-default")
[CPU 0:0:0] NOTICE:  Storing HV MD to data flash ("factory-default")
[CPU 0:0:0] NOTICE:  Storing mini MD to data flash ("factory-default")
[CPU 0:0:0] NOTICE:  Updating Config Information for Guest Manager
[CPU 0:0:0] NOTICE:  Jumping to hypervisor
Hypervisor version: @(#)Hypervisor 1.11.2.b 2012/11/02 17:29



SPARC T3-2, No Keyboard
Copyright (c) 1998, 2012, Oracle and/or its affiliates. All rights reserved.
OpenBoot 4.34.2.a, 65024 MB memory available, Serial #95468888.
Ethernet address 0:21:28:b0:bd:58, Host ID: 85b0bd58.



ERROR: One or more resources have been retired, please check the SP logs.
The following errors caught my eyes:

Code: Select all

[CPU 1:0:0] NOTICE:  /SYS/MB/CMP1/MCU1 is disabled
[CPU 0:0:0] NOTICE:  MCU0: Memory Capacity is 32GB
[CPU 1:0:0] NOTICE:  MCU0: Memory Capacity is 32GB
[CPU 0:0:0] NOTICE:  MCU1: Memory Capacity is 32GB
[CPU 1:0:0] ERROR:   /SYS/MB/CMP1/MCU1: Unusable
Also, the following looked strange as well:

Code: Select all

[CPU 0:0:0] ERROR:   /SYS/MB/CMP0/MCU0: Not configured
[CPU 0:0:0] ERROR:   8 Cores cannot be configured due to degraded L2T configuration
[CPU 0:0:0] NOTICE:  Usable strands:    0000000000000000.ffffffffffffffff
[CPU 0:0:0] NOTICE:  System memory capacity is 64GB
[CPU 1:0:0] ERROR:   8 Cores cannot be configured due to degraded L2T configuration
[CPU 1:0:0] NOTICE:  Usable strands:    0000000000000000.ffffffffffffffff
Apparently, /SYS/MB/CMP1/MCU1 looks to be problematic.
So, I logged into orazone01-ilom and checked.

Code: Select all

-> show /SYS/MB/CMP1/MCU1

 /SYS/MB/CMP1/MCU1
    Targets:

    Properties:
        type = Memory Controller
        component_state = Disabled

    Commands:
        cd
        show
While /SYS/MB/CMP0/MCU0, /SYS/MB/CMP0/MCU1 and /SYS/MB/CMP1/MCU0 all showed "Enabled" as the component_state.

From the above view, there's no "set" command to run.

I found the following link that gave me a hint:

http://docs.oracle.com/cd/E19332-01/E24 ... #scrolltoc

I checked the faulty components first:

Code: Select all

-> show faulty
Target              | Property               | Value                           
--------------------+------------------------+---------------------------------
/SP/faultmgmt/0     | fru                    | /SYS/MB                         
/SP/faultmgmt/0/    | class                  | fault.component.disabled        
 faults/0           |                        |                                 
/SP/faultmgmt/0/    | sunw-msg-id            | SPT-8000-HR                     
 faults/0           |                        |                                 
/SP/faultmgmt/0/    | component              | /SYS/MB/CMP1/MCU1               
 faults/0           |                        |                                 
/SP/faultmgmt/0/    | uuid                   | 5e0c6b90-43b1-412d-8747-dc559459
 faults/0           |                        | c703                            
/SP/faultmgmt/0/    | timestamp              | 2013-06-04/07:46:50             
 faults/0           |                        |                                 
/SP/faultmgmt/0/    | detector               | /SYS/MB/CMP1/MCU1               
 faults/0           |                        |                                 
/SP/faultmgmt/0/    | fru_part_number        | 541-4295                        
 faults/0           |                        |                                 
/SP/faultmgmt/0/    | fru_serial_number      | 1005LCB-1047TB00V1              
 faults/0           |                        |                                 
/SP/faultmgmt/0/    | product_serial_number  | 1047BDR246                      
 faults/0           |                        |                                 
/SP/faultmgmt/0/    | chassis_serial_number  | 1047BDR246                      
 faults/0           |                        |                                 
Then, I ran the following command and set the component_state to Enabled:

Code: Select all

-> set /SYS/MB/CMP1/MCU1 component_state=Enabled
Set 'component_state' to 'Enabled'
I checked the status again and it changed from Disabled to Enabled:

Code: Select all

-> show /SYS/MB/CMP1/MCU1

 /SYS/MB/CMP1/MCU1
    Targets:

    Properties:
        type = Memory Controller
        component_state = Enabled

    Commands:
        cd
        show

Code: Select all

-> show -level all -o table component_state     
Target              | Property               | Value                           
--------------------+------------------------+---------------------------------
/SYS/MB/CMP1/MCU1   | component_state        | Enabled                
Then, a reboot made the change take effect and T3-2 recognized 128 GB memory.

Code: Select all

orazone01:/export/home/hsiaoc1%prtdiag | grep "Memory size"
Memory size: 130560 Megabytes
After about an hour, top shows the following:

Code: Select all

load averages:  0.96,  1.13, 26.43                                     11:59:05
717 processes: 713 sleeping, 4 on cpu
CPU states: 99.7% idle,  0.1% user,  0.1% kernel,  0.0% iowait,  0.0% swap
Memory: 128G real, 71G free, 41G swap in use, 82G swap free
Still lots of free memory but 41 GB SWAP has been used. Maybe this is Oracle's design to use so much SWAP?