Introduction
I created this document based on my experience setting up a small test cluster for the Ceph distributed file system. I used VMWare Server for this setup as I found it to be quick and easy to get going with this and I do not have any spare machines lying around that I could use. Plus, the fact that VMWare Server is free sure doesn't hurt! It mainly serves as personal notes for myself as I tend to forget things like this rather quickly.
Machine Setup and Configuration
My simple cluster has 3 nodes in total so I will need to create 3 virtual machines. If you have never used VMWare before, here is a simple guide for creating a Linux virtual machine using the CentOS distribution. If you know how to create Linux virtual machines then just create 3 of them. Also, create another hard disk for the virtual machine. This will be used for the btrfs file system. This can be relatively small. In my case, I made it 1 GB.
After we have our 3 Linux nodes up and running, I like to modify the /etc/hosts file so I don't have to remember IP addresses all the time. My /etc/hosts file looks as follows.
# Do not remove the following line, or various programs # that require network functionality will fail. 127.0.0.1 localhost.localdomain localhost ::1 localhost6.localdomain6 localhost6 192.168.221.137 ceph0 192.168.221.138 ceph1 192.168.221.139 ceph2
Checking out Ceph Source Code
Now we are ready to check out Ceph. We will build it at a later stage when we are ready. If git is not present on your machine, you can follow these instructions to install it.
# cd /usr/src # git clone git://ceph.newdream.net/ceph.git Initialize ceph/.git Initialized empty Git repository in /usr/src/ceph/.git/ remote: Generating pack... remote: Counting objects: 498 remote: Done counting 37941 objects. remote: Deltifying 37941 objects... remote: 100% (37941/37941) done remote: Total 37941 (delta 30117), reused 34536 (delta 27139) Receiving objects: 100% (37941/37941), 8.46 MiB | 568 KiB/s, done. Resolving deltas: 100% (30117/30117), done. #
We need to export the ceph directory so that each node in the cluster can access the binaries and subdirectories which are needed. We will use NFS for this.
On the ceph0 host (or whichever host has the Ceph source code), edit the /etc/exports file. It should have a line similar to the following (assuming you modified the /etc/hosts file as I described; otherwise, you will need to enter IP addresses in this file):
/usr/src/ceph ceph1(rw,async,no_subtree_check) ceph2(rw,async,no_subtree_check)
This export entry is extremely simple. I am not taking security concerns into account here as this is a very simple test cluster we are setting up. Next, restart (or start if it was never started) the NFS service as follows.
# service nfs restart Shutting down NFS mountd: [ OK ] Shutting down NFS daemon: [ OK ] Shutting down NFS quotas: [ OK ] Shutting down NFS services: [ OK ] Starting NFS services: [ OK ] Starting NFS quotas: [ OK ] Starting NFS daemon: [ OK ] Starting NFS mountd: [ OK ] #
Now mount this directory on the other nodes in the cluster.
ceph1# mount -t nfs -o rw ceph0:/usr/src/ceph /usr/src/ceph ceph2# mount -t nfs -o rw ceph0:/usr/src/ceph /usr/src/ceph
Building the Kernel Client
Before we build Ceph, we want to get the kernel client up and running. For this guide, I am going to build the client into the kernel. We will do this on every node. First, we need to download the latest kernel source code using git.
# cd /usr/src/kernels # git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git Initialize linux-2.6/.git Initialized empty Git repository in /usr/src/kernels/linux-2.6/.git/ remote: Counting objects: 882421, done. remote: Compressing objects: 100% (155150/155150), done. remote: Total 882421 (delta 736090), reused 872174 (delta 725984) Receiving objects: 100% (882421/882421), 209.92 MiB | 336 KiB/s, done. Resolving deltas: 100% (736090/736090), done. Checking out files: 100% (24247/24247), done. # cd /usr/src/kernels/linux-2.6 # patch -p1 < /usr/src/ceph/src/kernel/kconfig.patch patching file fs/Kconfig Hunk #1 succeeded at 1557 with fuzz 2 (offset 38 lines). patching file fs/Makefile Hunk #1 succeeded at 122 (offset 4 lines). # ln -s /usr/src/ceph/src/kernel fs/ceph # ln -s /usr/src/ceph/src/include/ceph_fs.h fs/ceph/ceph_fs.h # cd /usr/src/kernels/linux-2.6 # make mrproper # make menuconfig
A lot of configuration options will be presented to you in this menu. Rather then go into that here, I'll point you to a much more in-depth guide here which discusses the configuration options in more detail.
Ensure that you enable Ceph. It should be the first item under File Systems->Network File Systems as shown below.
Now we are ready to build the kernel. This is only a few commands but will take quite some time depending on your machine.
# cd /usr/src/kernels/linux-2.6 # make bzImage # make modules # make modules_install # mkinitrd /boot/initrd-2.6.26.img 2.6.26 # cp /usr/src/kernels/linux-2.6/arch/i386/boot/bzImage /boot/bzImage-2.6.26 # cp /usr/src/kernels/linux-2.6/System.map /boot/System.map-2.6.26 # ln -s /boot/System.map-2.6.26 /boot/System.map
Finally, we need to configure the GRUB bootloader to be able to boot the new kernel. The GRUB configuration is located in
/boot/brub/menu.lst. Once finished editing on a fresh CentOS installation, the file should look as follows:
# grub.conf generated by anaconda # # Note that you do not have to rerun grub after making changes to this file # NOTICE: You have a /boot partition. This means that # all kernel and initrd paths are relative to /boot/, eg. # root (hd0,0) # kernel /vmlinuz-version ro root=/dev/VolGroup00/LogVol00 # initrd /initrd-version.img #boot=/dev/sda default=0 timeout=5 splashimage=(hd0,0)/grub/splash.xpm.gz hiddenmenu title CentOS (2.6.18-53.el5) root (hd0,0) kernel /vmlinuz-2.6.18-53.el5 ro root=/dev/VolGroup00/LogVol00 initrd /initrd-2.6.18-53.el5.img title LatestKernel (2.6.26) root (hd0,0) kernel /bzImage-2.6.26 ro root=/dev/VolGroup00/LogVol00 initrd /initrd-2.6.26.img
Now reboot and select the kernel you just built.
Installing btrfs and Creating btrfs File System
In this guide, I am using btrfs instead of ebofs for each OSD. I am only performing these steps on the storage nodes - ceph1 and ceph2. I will only show the steps for one node but you should obviously repeat this on both nodes. Mercurial is the SCM used by the btrfs developers. Some easy to follow instructions on installing this tool are provided by Mercurial here.
First, we obtain the latest sources:
# mkdir -p /usr/src/btrfs # cd /usr/src/btrfs # hg clone http://www.kernel.org/hg/btrfs/progs-unstable destination directory: progs-unstable requesting all changes adding changesets adding manifests adding file changes added 247 changesets with 888 changes to 58 files updating working directory 53 files updated, 0 files merged, 0 files removed, 0 files unresolved # hg clone http://www.kernel.org/hg/btrfs/kernel-unstable destination directory: kernel-unstable requesting all changes adding changesets adding manifests adding file changes added 650 changesets with 2137 changes to 64 files (+1 heads) updating working directory 54 files updated, 0 files merged, 0 files removed, 0 files unresolved #
After obtaining the latest sources, a patch by Sage Weil needs to be applied in order for btrfs to work correctly with Ceph. The original email with the patch from Sage is available here but I have a local copy of the patch available which can be used with wget as follows.
# cd /usr/src/btrfs/kernel-unstable # wget http://www.ece.umd.edu/~posulliv/ceph/sage_btrfs.patch # patch < sage_btrfs.patch patching file ctree.h patching file ioctl.c patching file transaction.c patching file transaction.h #
Now we are ready to build and install everything btrfs related.
# cd /usr/src/btrfs/kernel-unstable # make bash version.sh make -C /lib/modules/`uname -r`/build M=`pwd` modules make[1]: Entering directory `/usr/src/kernels/linux-2.6' CC [M] /usr/src/btrfs/kernel-unstable/super.o CC [M] /usr/src/btrfs/kernel-unstable/ctree.o CC [M] /usr/src/btrfs/kernel-unstable/extent-tree.o CC [M] /usr/src/btrfs/kernel-unstable/print-tree.o CC [M] /usr/src/btrfs/kernel-unstable/root-tree.o CC [M] /usr/src/btrfs/kernel-unstable/dir-item.o CC [M] /usr/src/btrfs/kernel-unstable/hash.o CC [M] /usr/src/btrfs/kernel-unstable/file-item.o CC [M] /usr/src/btrfs/kernel-unstable/inode-item.o CC [M] /usr/src/btrfs/kernel-unstable/inode-map.o CC [M] /usr/src/btrfs/kernel-unstable/disk-io.o CC [M] /usr/src/btrfs/kernel-unstable/transaction.o CC [M] /usr/src/btrfs/kernel-unstable/bit-radix.o CC [M] /usr/src/btrfs/kernel-unstable/inode.o CC [M] /usr/src/btrfs/kernel-unstable/file.o CC [M] /usr/src/btrfs/kernel-unstable/tree-defrag.o CC [M] /usr/src/btrfs/kernel-unstable/extent_map.o CC [M] /usr/src/btrfs/kernel-unstable/sysfs.o CC [M] /usr/src/btrfs/kernel-unstable/struct-funcs.o CC [M] /usr/src/btrfs/kernel-unstable/xattr.o CC [M] /usr/src/btrfs/kernel-unstable/ordered-data.o CC [M] /usr/src/btrfs/kernel-unstable/extent_io.o CC [M] /usr/src/btrfs/kernel-unstable/volumes.o CC [M] /usr/src/btrfs/kernel-unstable/async-thread.o CC [M] /usr/src/btrfs/kernel-unstable/ioctl.o CC [M] /usr/src/btrfs/kernel-unstable/locking.o CC [M] /usr/src/btrfs/kernel-unstable/orphan.o CC [M] /usr/src/btrfs/kernel-unstable/ref-cache.o CC [M] /usr/src/btrfs/kernel-unstable/acl.o LD [M] /usr/src/btrfs/kernel-unstable/btrfs.o Building modules, stage 2. MODPOST 1 modules CC /usr/src/btrfs/kernel-unstable/btrfs.mod.o LD [M] /usr/src/btrfs/kernel-unstable/btrfs.ko make[1]: Leaving directory `/usr/src/kernels/linux-2.6' # insmod /usr/src/btrfs/kernel-unstable/btrfs.ko # cd /usr/src/btrfs/progs-unstable # make # make install
Next, we need to get our disk on which we will create the btrfs file system ready. We will be using /dev/sdb. If you do not have another hard drive, it is quite easy to shut down your virtual machine and add another hard drive. Now, we will use fdisk to create a partition. This has to be done on all storage nodes.
# fdisk /dev/sdb Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 1 First cylinder (1-130, default 1): Using default value 1 Last cylinder or +size or +sizeM or +sizeK (1-130, default 130): Using default value 130 Command (m for help): p Disk /dev/sdb: 1073 MB, 1073741824 bytes 255 heads, 63 sectors/track, 130 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sdb1 1 130 1044193+ 83 Linux Command (m for help): w The partition table has been altered! Calling ioctl() to re-read partition table. Syncing disks. #
Now we are ready to create our btrfs file system. The steps to follow for this are:
# mkdir -p /mnt/btrfs # mkfs.btrfs /dev/sdb1 # mount -t btrfs /dev/sdb1 /mnt/btrfs # df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup00-LogVol00 6.7G 5.7G 662M 90% / /dev/sda1 99M 19M 76M 20% /boot tmpfs 192M 0 192M 0% /dev/shm /dev/sdb1 1020M 40K 1020M 1% /mnt/btrfs ceph0:/usr/src/ceph 6.7G 5.4G 1003M 85% /usr/src/ceph #
Building Ceph
We are now ready to perform the build. We will compile with debugging symbols since we will be interested in debugging. This only needs to be done on 1 node since this directory is exported via NFS.
# cd /usr/src/ceph # ./autogen.sh # CXXFLAGS="-g" # ./configure # make # cd src # mkdir out log
Setting up a small cluster
Now we can start setting up our cluster. The first step is to set up the monitor.
ceph0# cd /usr/src/ceph/src ceph0# ./monmaptool --create --clobber --add 192.168.221.137:12345 --print .ceph_monmap ceph0# ./mkmonfs --clobber mondata/mon0 --mon 0 --monmap .ceph_monmap
Now, we start up the monitor for the first time. We will enable extensive logging which will be produced in the /usr/src/ceph/src/log and /usr/src/ceph/src/out directories.
ceph0# ./cmon mondata/mon0 -d --debug_mon 10 --debug_ms 1
Next, we build the OSD cluster map which is defined as a compact, hierarchical description of the devices comprising the storage cluster. For this simple setup, I have 2 storage nodes - ceph1 and ceph2. After creating the cluster map, we must inform the monitor of the map.
ceph0# ./osdmaptool --clobber --createsimple .ceph_monmap 2 --print .ceph_osdmap ceph0# ./cmonctl osd setmap -i .ceph_osdmap
Now we move to the storage nodes. On each storage node, we first initialize the individual object stores.
ceph1# mkdir -p /mnt/btrfs/osd0 ceph2# mkdir -p /mnt/btrfs/osd1 ceph1# cd /usr/src/ceph/src ceph1# ./cosd --mkfs_for_osd 0 /mnt/btrfs/osd0 ceph2# cd /usr/src/ceph/src ceph2# ./cosd --mkfs_for_osd 1 /mnt/btrfs/osd1
Next the OSD daemons are started up on each storage node. Again, we will enable extensive logging so we can troubleshoot any issues that arise. Log files will be placed in the same directories as mentioned previously.
ceph1# cd /usr/src/ceph/src ceph1# ./cosd /mnt/btrfs/osd0 /mnt/btrfs/osd0 -d --debug_osd 10 ceph2# cd /usr/src/ceph/src ceph2# ./cosd /mnt/btrfs/osd1 /mnt/btrfs/osd1 -d --debug_osd 10
Finally, we start the meta data server on ceph0.
ceph0# cd /usr/src/ceph/src ceph0# ./cmds --debug_ms 1 --debug_mds 10 -d
Verification of the Cluster
Now, we want to verify the file system is up and working.
ceph0# cd /usr/src/ceph/src ceph0# ./cmonctl osd stat mon0 <- 'osd stat' mon0 -> 'e4: 2 osds: 2 up, 2 in' (0) ceph0# ./cmonctl pg stat mon0 <- 'pg stat' mon0 -> 'v27: 1152 pgs: 1152 active+clean; 4 MB used, 2035 MB / 2039 MB free' (0) ceph0# ./cmonctl mds stat mon0 <- 'mds stat' mon0 -> 'e3: 1 nodes: 1 up:active' (0) ceph0# ./csyn --syn makedirs 1 1 1 --syn walk starting csyn at 0.0.0.0:57466/13601/0 mounting and starting 1 syn client(s) waiting for client(s) to finish 10000000000 drwxr-xr-x 1 0 0 0 Sat Aug 2 04:24:15 2008 /syn.0.0 10000000002 drwxr-xr-x 1 0 0 0 Sat Aug 2 04:24:15 2008 /syn.0.0/dir.0 10000000001 -rw-r--r-- 1 0 0 0 Sat Aug 2 04:24:15 2008 /syn.0.0/file.0 10000000003 -rw-r--r-- 1 0 0 0 Sat Aug 2 04:24:15 2008 /syn.0.0/dir.0/file.0 ceph0#
If you do not see output similar to that shown above, then its time to start troubleshooting!!
Using the Kernel Client
Since we went to all that effort of building the kernel client, we may as well utilize it. We will mount the ceph file system on the 2 storage nodes. This is quite simple to do.
# mkdir -p /mnt/ceph # mount -t ceph 192.168.221.137:/ /mnt/ceph/ # df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup00-LogVol00 6.7G 5.3G 1.1G 84% / /dev/sda1 99M 19M 76M 20% /boot tmpfs 192M 0 192M 0% /dev/shm /dev/sdb1 1020M 3.9M 1016M 1% /mnt/btrfs 192.168.221.137:/usr/src/ceph 6.7G 5.4G 984M 85% /usr/src/ceph 192.168.221.137:/ 2.0G 7.0M 2.0G 1% /mnt/ceph # ls -l /mnt/ceph total 1 drwxr-xr-x 1 root root 0 Aug 2 05:01 syn.0.0 # ls -l /mnt/ceph/syn.0.0 total 0 drwxr-xr-x 1 root root 0 Aug 2 04:24 dir.0 -rw-r--r-- 1 root root 0 Aug 2 04:24 file.0 #
Testing the File System
iozone is a file system benchmarking tool. Download it and play around with it. Its a nice tool to stress a file system. I have not done too much with it in the context of a distributed file system so I'm still messing around with it. To install it is quite simple though. In the output below, I was compiling on an AMD64 platform.
# cd /usr/src/ # wget http://www.iozone.org/src/current/iozone3_308.tar # tar xvf iozone3_308.tar # cd /usr/src/iozone3_308/src/current # make linux-AMD64 # ./iozone Usage: For usage information type iozone -h # ./iozone -g 1024M -f /mnt/ceph/iozone-file.tmp Iozone: Performance Test of File I/O Version $Revision: 3.308 $ Compiled for 64 bit mode. Build: linux-AMD64 Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root. Run began: Sat Aug 2 05:27:50 2008 Using maximum file size of 1048576 kilobytes. Command line used: ./iozone -g 1024M -f /mnt/ceph/iozone-file.tmp Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 512 4 135952 585814 1854787 3481620 2600470 1672748 595397 2007358 1712772 95665 1815584 1968713 3325278 iozone test complete. #