Key Users
  Product Detail
  Automated Brochure
  Automated Demos
  Press Releases
  Analyst Reviews
  Live Demo Request


Join Our Mailing List
    > Newsletter > December 2006 > Spotlight on Solaris: Sun Volume Manager (SVM) - The State Database

Spotlight On: Sun Volume Manager (SVM) - The State Database
by Bill Calkins

Last month I introduced you to the Solaris Volume Manager (SVM). I'll continue this month on the topic by describing the role of the SVM state database.

The SVM state database keeps track of configuration and status information for the SVM volumes (metadevices), hot spares and error conditions. In SVM, a state database must be created and initialized before volumes can be configured. As volumes are created or modified, SVM stores the configuration information in the state database. Therefore, the state database is critical to SVM and Sun recommends that you create a minimum of two additional copies of the state database, called replicas, for a total of three. You'll distribute these replicas across each drive to protect against a drive failure. If all of the replicas are stored on the same drive, and that drive were to fail, all copies of the state database would be lost and your SVM configuration would be lost with it. Without your SVM configuration information, all data stored on the SVM volumes becomes inaccessible.

Multiple replicas also protect the state database against corruption that could result from a system crash. When the state database is updated, replicas are modified one at a time so that if a crash occurs while the database is being updated, only one of the replicas is corrupted. If a state database replica would ever become corrupt, that database is ignored and the other copies could still be accessible.

At boot time, the SVM software reads all the replicas on the system. If a majority of the state database replicas are in agreement, then that configuration is used to start the Solaris Volume Manager software. The Solaris Volume Manager software determines which databases are correct by using a majority consensus algorithm. This algorithm requires that a majority (half + 1) of the state database replicas are available and are in agreement before any of them are considered valid. If your system loses a state database replica, the Solaris Volume Manager software must determine which state database replicas still contain valid data using the consensus algorithm. It is because of the majority consensus algorithm that you must create at least three state database replicas when you set up your disk configuration. A consensus can be reached as long as at least two of the three state database replicas are available.

It's important to plan the placement of your state database replicas. On each disk, you'll leave a small partition free to hold the replica. You cannot place a replica into a partition that already contains data. I like to create a 20mb partition to hold my database replicas. Although a single database replica is approx 4mb, I like to have room for more than one replica and I like extra space in case the replica ever increases in size. Here are a few bullets worth noting about storing your state database replicas:

  • If possible, create state database replicas on a dedicated slice that is at least 4 Mbytes in size for each planned replica
  • You cannon create state database replicas on slices containing existing file systems or data, such as the root (/) or /usr file systems, or the swap partition.
  • If possible, place state database replicas on slices that are on separate disk drives and connected through different host bus adapters.
  • Distribute your state databases as follows:
    • Create three replicas on one slice for a system with a single drive.
    • Create two replicas on each drive for a system with two to four drives.
    • Create one replica on each drive for a system with five or more drives.

The Solaris Volume Manager software cannot function unless half of all state database replicas are available. Here's what happens when some of the state database replicas are not available and the majority consensus is not met:

  • The system continues to run if at least half of all state database replicas are available.
  • The system panics if fewer than half of the state database replicas are available.
  • The system cannot reboot into multiuser mode unless a majority (half +1) of the total number of state database replicas are available.
Note: There are also performance issues to consider when creating state databases, but that's another article.

The state database and its replicas are managed using the metadb command. In the following example, I have reserved a slice (slice 4) on each of two disks to hold the copies of the state database, and I'll create two copies in each reserved disk slice, giving a total of four state database replicas. In this scenario, the failure of one disk drive will result in a loss of more than half of the operational state database replicas, but the system will continue to function. The system will panic only when more than half of the database replicas are lost. For example, if I had created only three database replicas and the drive containing two of the replicas fails, the system will panic. To create the state database and its replicas, using the reserved disk slices, enter the following command:
# metadb -a -f -c2 c0t0d0s4 c0t1d0s4

In the previous example, -a indicates a new database is being added, -f forces the creation of the initial database, -c2 indicates that two copies of the database are to be created, and the two cxtxdxsx entries describe where the state databases are to be physically located. The system returns the prompt; there is no confirmation that the database has been created.

The following example demonstrates how to remove the state database replicas from two disk slices, namely c0t0d0s4 and c0t1d0s4:
# metadb -d c0t0d0s4 c0t1d0s4

The system only detects problems with the replica during a configuration change or at bootup, therefore the system won't fail immediately if less than half of the replicas are not available. It's up to you to monitor that state databases occasionally by running a metadb -i now and then and watch the status flags. A normal status is a u (active and up-to-date). In the flags field, uppercase status letters indicate a problem and lowercase letters are informational only. Here's an example of a normal state database status:

# metadb -i
      flags          	    first blk      block count
     	a m  	p  	luo        16          8192        /dev/dsk/c0t0d0s4
     	a    	p 	luo        8208        8192        /dev/dsk/c0t0d0s4
     	a    	p  	luo        16          8192        /dev/dsk/c0t1d0s4
     	a    	p  	luo        8208        8192        /dev/dsk/c0t1d0s4
 r - replica does not have device relocation information
 o - replica active prior to last mddb configuration change
 u - replica is up to date
 l - locator for this replica was read successfully
 c - replica's location was in /etc/lvm/
 p - replica's location was patched in kernel
 m - replica is master, this is replica selected as input
 W - replica has device write errors
 a - replica is active, commits are occurring to this replica
 M - replica had problem with master blocks
 D - replica had problem with data blocks
 F - replica had format problems
 S - replica is too small to hold current data base
 R - replica had device read errors

Each line of output is divided into the following fields:

  • flags - This field will contain one or more state database status letters. A normal status is a "u" and indicates that the database is up-to-date and active. Uppercase status letters indicate a problem and lowercase letters are informational only.
  • first blk - The starting block number of the state database replica in its partition. Multiple state database replicas in the same partition will show different starting blocks.
  • block count - The size of the replica in disk blocks. The default length is 8192 blocks (4MB), but the size could be increased if you anticipate creating more than 128 metadevices, in which case you would need to increase the size of all state databases.

The last field in each state database listing is the path to the location of the state database replica. As the output shows, there is one master replica; all four replicas are active and up to date and have been read successfully.

Here's what you could see when a replica has failed:

# metadb -i
      flags          	     first blk     block count
     	a m  	p  	luo        16          8192        /dev/dsk/c0t0d0s4
     	a    	p 	luo        8208        8192        /dev/dsk/c0t0d0s4
     	M    	p  	           16          unknown     /dev/dsk/c0t1d0s4
     	M    	p  	           8208        unknown     /dev/dsk/c0t1d0s4

In the example, a disk failure or corruption occurs on the disk c0t1d0 and renders the two replicas unusable. The metadb -i command shows that there are write errors on the two replicas on c0t1d0s4.

When the system is rebooted, the following messages appear:
Insufficient metadevice database replicas located.
Use metadb to delete databases which are broken.
Ignore any Read-only file system error messages.
Reboot the system when finished to reload the metadevice database.
After reboot, repair any broken database replicas which were deleted.

The majority consensus algorithm is not met because half of the state database replicas are available. The system cannot reboot into multiuser mode unless a majority (half + 1) of the total number of state database replicas are available. To repair the situation, you will need to be in single-user mode, so boot the system with -s and then remove the failed state database replicas on c0t1d0s4.
# metadb -d c0t1d0s4

Now reboot the system again - it will boot with no problems, although you now have fewer state database replicas (only two), the majority consensus algorithm is now met. This will enable you to repair the failed disk and re-create the metadevice state database replicas on c0t1d0.

If you have questions or comments regarding this article or would like to submit a question or topic for future discussion, please email me at