Wednesday, March 18, 2009

Synchronization issues in San

GFS & DLM :

>>>>>>>>

GFS is the file system that runs on each of the nodes in the cluster. Like all file systems, it is basically a kernel module that runs on top of the vfs (virtual file system) layer of the kernel. It controls how and where the data is stored on a block device or logical volume. In order to make a cluster of computers ("nodes") cooperatively share the data on a SAN, you need GFS's ability to coordinate with a cluster locking protocol. One such cluster locking protocol is dlm, the distributed lock manager, which is also a kernel module. It's job is to ensure that nodes in the cluster who share the data on the SAN don't corrupt each other's data.

Many other file systems like ext3 are not cluster-aware, and therefore data kept on a volume that is shared between multiple computers, would quickly become corrupt otherwise.



>>>>>>>>>

With the DLM, the first node who opens any given file becomes the "lock master" for that file. Once a lock master has been established, if another node opens or locks that same file, it needs to coordinate the locking with the lock master. That means talking over the network, which is much slower than talking to the disk drives. But if the lock master needs to access the files it's "mastered" it may not have to ask anyone's permission, and therefore file access is much faster. Therefore, you might be able to design your environment or application such that the lock master will do the majority of accesses to its own files.



>>>>>>>>>>

If GFS doesn't protect my files from multiple data writers, then why use it? What good is it?

If you have shared storage that you need to mount read/write, then you still need it. Perhaps it's best to explain why with an example.

Suppose you had a fibre-channel linked SAN storage device attached to two computers, and suppose they were running in a cluster, but using EXT3 instead of GFS to access the data. Immediately after they mount, both systems would be able to see the data on the SAN. Everything would be fine as long as the file system was mounted as read-only. But without GFS, as soon as one node writes data, the other node's file system doesn't know what's happened.

Suppose node A creates a file and assigns inode number 4351 to it, and write 16K of data to it in blocks 3120 and 2240. As far as node B is concerned, there is no inode 4351, and blocks 3120 and 2240 are free. So now it is free to try to create inode 4351, and write data to block 2240, but still believing block 3120 is free. The file system maps of which data areas are used and unused would soon overlap, as would the inode numbers. It wouldn't take long before the while file system was hopelessly corrupt, along with the files inside it.

With GFS, when node A assigns inode 4351, node B automatically knows about the change, and the data is kept harmoniously on disk. When one data area is allocated, all nodes in the cluster are aware of the file allocations, and they don't bump into one another. If node B needs to create another inode, it wouldn't choose 4351, and file system would not be corrupted.

However, even with GFS, if nodes A and B both decide to operate on a file X, even though they both agree on where the data is located, they can still overwrite the data within the file unless the program doing the writing uses some kind of locking scheme to prevent it.


>>>>>>>>
It is possible to create GFS on an MD device as long as you are only using it for multipath. Software RAID is not cluster-aware and therefore not supported with GFS. The preferred solution is to use device mapper (DM) multipathing rather than md in these configurations.