Cluster File Systems Discussions

最新推荐文章于 2024-11-03 12:02:27 发布

zdmilan

最新推荐文章于 2024-11-03 12:02:27 发布

阅读量1.6k

点赞数

分类专栏： HPC-CLUSTER STORAGE 文章标签： file performance system report authentication linux

HPC-CLUSTER 同时被 2 个专栏收录

9 篇文章 0 订阅

订阅专栏

STORAGE

4 篇文章 0 订阅

订阅专栏

RAID and File Systems Discussions

Written by Jeff Layton
Tuesday, 13 June 2006

Some aid for those that use RAID

The Beowulf mailing list provides detailed discussions about issues concerning Linux HPC clusters. In this article we turn our attention to other mailinsg lists that also can provide useful information. In this article I review some postings in the Rocks-Discuss and LVM mailing lists where we report on RAID and file system preferences.

ROCKS: Raid
Most of the time the mailing lists for specific cluster applications or cluster distributions are devoted to specific questions about the application or distribution. However, some times you will see general questions and very good responses from knowledgeable people on these lists. Rocks is a popular cluster distribution. On January 6, 2004, a simple question to the Rocks mailing list gave rise to some good recommendations. Purushotham Komaravolu asked for recommendations for a RAID configuration for about 200 GB of data (recall that RAID stands for Redundant Array of Inexpensive Disks).

Greg Bruno provided the first answer. He said that for pure capacity (not necessarily throughput) you should use a 3ware 8006-2LP serial ATA controller with two 200 GB (Gigabyte) serial ATA drives that are configured for mirroring (RAID-1). He said that this should give about 80 Megabytes/sec (MB/s) in read performance and about 40 MB/s in write performance. For more performance, Greg recommended using a 3ware 8506-4LP serial ATA controller and four 100 GB ATA drives configured as RAID-10 (two sets of mirrored drives which are then striped over the two sets). Greg was estimating performance as 160 MB/s for read IO and 80 MB/s for write IO, if you use decent disks.

Jon Forrest joined in the discussion saying that he had a difficult time getting the Promise and Iwill RAID cards (RAID-0 or RAID-1) working with Linux. Greg Bruno responded that they had good luck with 3ware controllers and bad luck with the controllers that Jon mentioned. However, Tim Carlson joined in that he was not impressed with the RAID-5 performance of the 3ware controllers even using serial ATA (SATA) drives. Tim said that he had never gotten more than 50 MB/s using RAID-5 and SATA. He recommended going with SCSI drives and a SCSI RAID controller along with software RAID. Tim finally suggested using a box of IDE (ATA) disks with a back end controller that converts things over to SCSI or FC (Fibre Channel). He said that in his experience this solution scales nicely to tens of TB (terabytes).

Joe Landman jumped in to say that using RAID-5 for high performance is not a good idea. Rather one should be using something like RAID-0 (striping) for increased performance. Joe also took issue with the idea of using SCSI disks. Joe said that in his experience ATA drives were very good but suffered from an interrupt problem that leads to increased CPU load to the point that you could swamp a CPU by writing many, many small blocks at the same time (think of a cluster head node or NFS file server). SCSI controllers hide this behind a controller interface. Joe went on to discuss that current CPUs have much more power than the controller in a RAID card. However, combining software RAID over a cheap hardware controller is asking for trouble, particularly for large loads. Joe ended that he agreed with Tim's recommendation of using IDE disks with a back end controller that converts to SCSI or FC.

A little later Joe said that the important question was what file system people were running on their RAID disks. Joe said that XFS was the best and should be incorporated into ROCKS (note that XFS is now part of the standard 2.4 and 2.6 kernels from kernel.org). Joe Kaiser chimed in that he thought XFS was great and that they have had very good luck with it. Tim Carlson jumped back in to say that he has good luck with ext3. Joe Kaiser responded that they had some data corruption with ext3 for large arrays when the disk has been filled all of the way. Joe and Tim then discussed several aspects of design including the importance of understanding your data needs and your data layout.

This discussion points out that there are several important considerations when designing a file server for a cluster. Considerations such as your data layout, the host machine (CPU power), disk types, RAID controllers, monitoring capabilities, and file system choice, can all have a great effect on the resultant IO performance.

ROCKS: Using Other File Systems
A couple of months after the previous discussion about RAID, a discussion about alternative file systems was begun on the ROCKS-Discuss mailing list. On 16 April, 2004, Yaron Minsky asked about using something other than ext2 on the master node of his ROCKS cluster, particularly ReiserFS or XFS. Phillip Papadopoulos replied that this was a bug in ROCKS 3.1.0 forcing you to use ext2 and would be fixed in the next release. However he did say that you could convert the ext2 filesystem to ext3 using C to add a journal.

Laurence Liew responded that he thought ext2, ext3, ReiserFS, and XFS all had their strengths and weaknesses. He suggested using ext2 for a while to understand the application usage pattern. He also said that in some cases, modifying the layout of the cluster would have a bigger impact than changing file systems. Yaron replied back that he thought ext3 faired worse than XFS or JFS in benchmarks. Laurence replied that he remembered some SNAP benchmark results that showed ext3 winning in certain cases.

There was some discussion about whether Red Hat included ReiserFS and/or XFS in the version of RHEL (Red Hat Enterprise Linux) that ROCKS uses. It was finally determined that XFS was not included but ReiserFS was included but as an unsupported RPM. Later on, Josh Brandt mentioned that he thought ReiserFS would do better on lots of small files compared to other file systems. However, for large files ReiserFS performed worse than other file systems. Yaron, the original poster, posted his basic usage pattern (size of files, number of files, number of directories, etc.). Josh thought he should give ReiserFS a try.

While this discussion is brief it does show that there is a difference in file system performance among various people and groups.

The Beowulf mailing list provides detailed discussions about issues concerning Linux HPC clusters. In this column I report on using semi-public PC's for grid type applications and how we can handle large numbers of files. We also turn to the ganglia-developers mailings list to report on how one can add a "disk alive" metric to ganglia. You can consult the Beowulf archives, the bioclusters archives, and the ganglia archives for the actual conversations.

Using Semi-Public PCs
There was an interesting discussion a few months back on the bioclusters mailing list about using semi-public PC's for heavy computational jobs. On Feb. 15, 2004, Arnon Klein asked about running his jobs on semi-public machines that are running various flavors of Windows. Arnon is asking this question because he is doing his graduate research and needs computational power. He's already exhausted the machines easily available to him, so he was looking for suggestions about what to do next.

The first response came from Chris Dwan. Chris responded that he's in a similar boat but has managed to put together some systems from various campuses into something like a grid. He also provided a very useful ranking of systems in terms of access difficulty. For example, systems that he maintains were easiest to get into followed by systems running Linux or OS X (which Chris also runs). The lowest two ranked systems were Windows machines that either could be rebooted at night or could not be rebooted at all. Chris went on to talk about some schedulers that can steal cycles from idle workstations (e. g. SGE, torque, LSF). Although he said that integrating disparate schedulers can be very difficult. He did mention Condor from the University of Wisconsin as a possible solution. He also mentioned the grid software from United Devices, which runs on Windows machines but will use compute cycles from other machines.

Farud Ghazali also mentioned that's he's also looking for a solution to this type of problem. He pointed that there were many practical difficulties including authentication across disparate resources. Chris Dwan jumped in to explain how he has hacked up something to do authentication for him.

Ron Chen joined the conversation to mention that SGE (Sun Grid Engine) version 6.0 will integrate with JXTA which then offers Jgrid that offers P2P (Peer-to-Peer) workload management in a fashion similar to SETI@home. However he did say that SGE 6.0 won't be out until May of 2004 (and it may slip slightly from then). Until then, Ron recommended using boinc. This package starts jobs and transmits data using port 80, which makes it easier to get in and out of a firewall than other approaches. It also has versions for Windows, Linux, Solaris and OS X. John van Workum also mentioned GreenTea that offers a Java P2P client that gives grid capabilities for running jobs. Bruce Moxon also mentioned that the Cornell Theory Center, has some tools that might help with Windows machines.

While this is discussion was short it did offer some ideas that could help people in similar situations. There are many people and groups thinking about the same things that Arnon mentioned in his first posting.

Disk Alive Metric
I'm sure many readers are aware of ganglia. It is a scalable distributed monitoring system for high performance computing systems such as clusters and grids. It is open source and in use on over 500 clusters throughout the world. On December 22, 2003, on the Ganglia Developers mailing list Federico Sacerdoti asked about a metric that ganglia could watch that would report if a disk was alive or not. It seems that Federico was talking to a Purdue (my alma mater) system administrator about a cluster that is put together from old PCs. The disks in the machines keep failing but ganglia fails to report the disks as down since the ganglia daemon will still report a heartbeat even the node is basically down. Federico posted a possible solution that he worked out with the administrator but had not tried it.

Brooks Davis replied that he didn't think it would work, at least in FreeBSD, because of the way Unix and Unix-like systems work. He did offer another solution that read random blocks from a file system to make sure the drive was still functioning.

Robert Walsh responded that he has been trying to get information from the SMART (Self-Monitoring Analysis and Reporting Technology System) data in most hard drives into ganglia. Brooks Davis mentioned that he thought integrating smartmontools with ganglia might offer a solution. smartmontools is a package that allows you to control and monitor the SMART data contained in virtually all modern hard drives.

The discussion spilled over into January of 2004, where Sander van Vliet announced that he had a preliminary working version of a gmetric code that would test if the drives were alive. The code walks the /proc/mounts file looking for drives that are mounted and then attempts to write 4 bytes to the end of the current used file system to determine if the disk is alive. If there were no errors along the way, then the disk is alive. Sander then posted that he had a version of his code working that used the SMART data but the job as to be run as root. This problem was sorted out fairly quickly though. During all of the conversation, there was an effort to make the code work under Linux and the various BSD flavors, especially FreeBSD. At this point the thread died out, but it appears as though the code was working correctly for Linux and FreeBSD.