I feel it's safe to say, that we now have a fair idea of what GlusterFS is, and we are pretty comfortable installing GlusterFS, and creating a volume.
Let's create a volume with two local directories as two bricks.
Now where does this file really get created in the backend. Let's have a look at the two directories we used as bricks(subvolumes): Let's create a volume with two local directories as two bricks.
# gluster volume create test-vol Gotham:/home/asengupt/node1 Gotham:/home/asengupt/node2
volume create: test-vol: success: please start the volume to access data
# gluster volume start test-vol;
volume start: test-vol: success
Let's mount this volume, and create a file in that volume.
# mount -t glusterfs Gotham:/test-vol /mnt/test-vol-mnt/
# touch /mnt/test-vol-mnt/file1
# cd /mnt/test-vol-mnt/
# ls -lrt
total 1
-rw-r--r--. 1 root root 0 Jan 10 14:40 file1
# cd /home/asengupt/node1
# ls -lrt
total 0
# cd ../node2/
# ls -lrt
total 1
-rw-r--r--. 1 root root 0 Jan 10 14:40 file1
So the file we created at the mount-point(/mnt/test-vol-mnt), got created in one of the bricks. But why in this particular brick, why not the other one? The answer to that question lies in the volume information.
# gluster volume info
Volume Name: test-vol
Type: Distribute
Volume ID: 5d28ca28-9363-4b79-b922-5f28d0c0db65
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: Gotham:/home/asengupt/node1
Brick2: Gotham:/home/asengupt/node2
It gives us a lot of info. While creating a volume we have the liberty of providing a number of options like the transport-type, the volume-type, etc. which eventually decides the behaviour of the volume. But at this moment what most interests us is the type. It says that our volume "test-vol" is a distributed volume. What does that mean?
The type of a volume decides, how exactly the volume stores the data in the bricks. A volume can be of the following types :
- Distribute : A distribute volume is one, in which all the data of the volume, is distributed throughout the bricks. Based on an algorithm, that takes into account the size available in each brick, the data will be stored in any one of the available bricks. As our "test-vol" volume is a distributed volume, so based on the algorithm "file1" was created in node2. The default volume type is distribute, hence test-vol is distribute.
- Replicate : In a replicate volume, the data is replicated(duplicated) over every brick, based on the brick number. The number of bricks must be a multiple of the replica count. So when "file1" is created in a replicate volume, having two bricks, it will be stored in brick1, and then replicated to brick2. So the file will be present in both the bricks. Let's create one and see for ourselves.
# gluster volume create test-vol replica 2 Gotham:/home/asengupt/node1 Gotham:/home/asengupt/node2
Multiple bricks of a replicate volume are present on the same server. This setup is not optimal.
Do you still want to continue creating the volume? (y/n) y
volume create: test-vol: success: please start the volume to access data
# gluster volume start test-vol
volume start: test-vol: success
# gluster volume info
Volume Name: test-vol
Type: Replicate
Volume ID: bfb685e9-d30d-484c-beaf-e5fd3b6e66c7
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: Gotham:/home/asengupt/node1
Brick2: Gotham:/home/asengupt/node2
# mount -t glusterfs Gotham:/test-vol /mnt/test-vol-mnt/
# touch /mnt/test-vol-mnt/file1
# cd /mnt/test-vol-mnt/
# ls -lrt
total 0
-rw-r--r--. 1 root root 0 Jan 10 14:58 file1
# ls -lrt /home/asengupt/node1/
total 0
-rw-r--r--. 2 root root 0 Jan 10 14:58 file1
# ls -lrt /home/asengupt/node2/
total 0
-rw-r--r--. 2 root root 0 Jan 10 14:58 file1 - Stripe : A stripe volume is one, in which the data being stored in the backend is striped into units of a particular size, among the bricks. The default unit size is 128KB, but it's configurable. If we create a striped volume of stripe count 3, and then create a 300 KB file at the mount point, the first 128KB will be stored in the first sub-volume(brick in our case), the next 128KB in the second, and the remaining 56KB in the third. The number of bricks should be a multiple of the stripe count.
# gluster volume create test-vol stripe 3 Gotham:/home/asengupt/node1 Gotham:/home/asengupt/node2 Gotham:/home/asengupt/node3
volume create: test-vol: success: please start the volume to access data
# gluster volume start test-vol
volume start: test-vol: success
# gluster volume info
Volume Name: test-vol
Type: Stripe
Volume ID: c5aa1590-2f6e-464d-a783-cd9bc222db30
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: Gotham:/home/asengupt/node1
Brick2: Gotham:/home/asengupt/node2
Brick3: Gotham:/home/asengupt/node3
# mount -t glusterfs Gotham:/test-vol /mnt/test-vol-mnt/
# cd /mnt/test-vol-mnt/
# ls -lrt
total 0
# cp /home/asengupt/300KB_File .
# ls -lrt
total 308
-rwxr-xr-x. 1 root root 307200 Jan 11 12:46 300KB_File
# ls -lrt /home/asengupt/node1/
total 132
-rwxr-xr-x. 2 root root 131072 Jan 11 12:46 300KB_File
# ls -lrt /home/asengupt/node2/
total 132
-rwxr-xr-x. 2 root root 262144 Jan 11 12:46 300KB_File
# ls -lrt /home/asengupt/node3/
total 48
-rwxr-xr-x. 2 root root 307200 Jan 11 12:46 300KB_File
That's because of holes. It means that the filesystem just pretends that at a particular place in the file there is just zero bytes, but no actual disk sectors are reserved for that place in the file. To proof this let's check the disk usage.
# cd /home/asengupt
# du | grep node.$
136 ./node2
136 ./node1
52 ./node3
0 ./node4
Apart from these three vanilla types of volume, we can also create a volume which is a mix of these types. We will go through these and the respective volume files in the next post.
10 comments:
Can we create and start multiple volumes at same server?
Yes Shyam, we can create multiple volumes at the same server. All the volumes should have unique names, and brickpaths, as they will be the part of the same namespace and same peer cluster.
Dumb question: What if you have a replica number of 2 with 6 nodes? Does that mean 3 of the servers will be identical to the other 3?
Hi!
Great instructions. I'm wondering if you have been able to succesfful created a Volume from a Brick(s) that are CIFS-mounted ZFS datasets? For instance, say you have ZFS-server:/dataset/subset mounted to /localhost/mnt, you then want to create a volume using "/localhost/mnt" as the Brick.
Have you been able to successfully create a volume over a mounted directory?
Thanks!
Thanks Eric! I have never tried CIFS-mounted ZFS datasets, but we have tried other mounts like lvms, aws instances as bricks, and lemme assure you it works pretty seamlessly. Infact the requirement for a gluster volume to support snapshots is that the underlying brick should be a lvm mount.
The replica of any volume has nothing to do with nodes, but everything with bricks. We recommend one brick per node, but that's not a hard requirement. So a replica 2 with 6 "bricks" will mean that the bricks will form replica group where each group will have 2 bricks(replica count). Since you have 6 bricks, you will have 3 such groups. So now if you look at that volume its a distribute-replicate volume, where there are 3 distribute sub-volumes, amogst which data will be divided, and each distribute sub-volume is actually a replica group consisiting of 2 bricks, amogst which the data will be replicated.
@Avra Sengupta
That explanation of volume replica should be at the top of the FAQ for GlusterFS! That made perfectly clear sense and was straight forward.
However, I am still having no success with creating the volume. I have shifted my test parameters from using a direct CIFS mount as the brick to a directory under the CIFS mount. Which ended in an error.
I have the ZFS dataset "dump" mounted over CIFS to /zfs/dump, and want to use the directory "gluster-test" within the mounted dataset.
This is the gluster version I am running:
root@gfs1# gluster --version
glusterfs 3.5.3 built on Nov 14 2014 11:23:37
This is the command I used:
gluster volume create gdump replica 2 transport tcp gfs1:/zfs/dump/gluster-test gfs2:/zfs/dump/gluster-test
This is the volume create error:
volume create: gdump: failed: Glusterfs is not supported on brick: gfs2:/zfs/dump/gluster-test.
And this was the reason for the error:
Setting extended attributes failed, reason: Operation not supported.
Below is the console output:
root@gfs1# gluster volume create gdump replica 2 transport tcp gfs1:/zfs/dump/gluster-test gfs2:/zfs/dump/gluster-test
volume create: gdump: failed: Glusterfs is not supported on brick: gfs2:/zfs/dump/gluster-test.
Setting extended attributes failed, reason: Operation not supported.
Under the ZFS dataset properties "xattr" is enabled. Below is the xattr property for the dataset.
dump xattr on default
Edit: Typos
Hi Avra,
Nice articles. I am trying to understand the source code for GlusterFS. Unfortunately, there are no enough comments in the source code to understand it and I couldn't find any online documentation for the code. Do you know of any such documents or online posts which explains an overview of the source code at least?
Sorry if this comment is not entirely related to your post.
Thanks
Toms
Thanks a lot .. You saved a ton of time to research on this topic..
Hey Tom,
I agree that the source code is lacking as many comments as it ideally should, but we are kind of working towards improving it, with every patch we merge. It will take a bit of time, but we will get there.
Regarding source code overview, you can have a look at the developer-guide, which is present in ./doc/developer-guide in the source code. Other gluster documentation can be found at http://gluster.readthedocs.org/en/latest/
Regards,
Avra
Post a Comment