Friday 11 January 2013

Volumes

I feel it's safe to say, that we now have a fair idea of what GlusterFS is, and we are pretty comfortable installing GlusterFS, and creating a volume.
Let's create a volume with two local directories as two bricks.
# gluster volume create test-vol Gotham:/home/asengupt/node1 Gotham:/home/asengupt/node2
volume create: test-vol: success: please start the volume to access data
# gluster volume start test-vol;
volume start: test-vol: success
Let's mount this volume, and create a file in that volume.
# mount -t glusterfs Gotham:/test-vol /mnt/test-vol-mnt/
# touch /mnt/test-vol-mnt/file1
# cd /mnt/test-vol-mnt/
# ls -lrt
total 1
-rw-r--r--. 1 root root   0 Jan 10 14:40 file1
Now where does this file really get created in the backend. Let's have a look at the two directories we used as bricks(subvolumes):
# cd /home/asengupt/node1
# ls -lrt
total 0
# cd ../node2/
# ls -lrt
total 1
-rw-r--r--. 1 root root   0 Jan 10 14:40 file1
So the file we created at the mount-point(/mnt/test-vol-mnt), got created in one of the bricks. But why in this particular brick, why not the other one? The answer to that question lies in the volume information.
# gluster volume info
 
Volume Name: test-vol
Type: Distribute
Volume ID: 5d28ca28-9363-4b79-b922-5f28d0c0db65
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: Gotham:/home/asengupt/node1
Brick2: Gotham:/home/asengupt/node2
It gives us a lot of info. While creating a volume we have the liberty of providing a number of options like the transport-type, the volume-type, etc. which eventually decides the behaviour of the volume. But at this moment what most interests us is the type. It says that our volume "test-vol" is a distributed volume. What does that mean?

The type of a volume decides, how exactly the volume stores the data in the bricks. A volume can be of the following types :
  • Distribute : A distribute volume is one, in which all the data of the volume, is distributed throughout the bricks. Based on an algorithm, that takes into account the size available in each brick, the data will be stored in any one of the available bricks. As our "test-vol" volume is a distributed volume, so based on the algorithm "file1" was created in node2. The default volume type is distribute, hence test-vol is distribute.
  • Replicate : In a replicate volume, the data is replicated(duplicated) over every brick, based on the brick number. The number of bricks must be a multiple of the replica count. So when "file1" is created in a replicate volume, having two bricks, it will be stored in brick1, and then replicated to brick2. So the file will be present in both the bricks. Let's create one and see for ourselves.
    # gluster volume create test-vol replica 2 Gotham:/home/asengupt/node1 Gotham:/home/asengupt/node2
    Multiple bricks of a replicate volume are present on the same server. This setup is not optimal.
    Do you still want to continue creating the volume?  (y/n) y
    volume create: test-vol: success: please start the volume to access data
    # gluster volume start test-vol
    volume start: test-vol: success
    # gluster volume info
     
    Volume Name: test-vol
    Type: Replicate
    Volume ID: bfb685e9-d30d-484c-beaf-e5fd3b6e66c7
    Status: Started
    Number of Bricks: 1 x 2 = 2
    Transport-type: tcp
    Bricks:
    Brick1: Gotham:/home/asengupt/node1
    Brick2: Gotham:/home/asengupt/node2
    # mount -t glusterfs Gotham:/test-vol /mnt/test-vol-mnt/
    # touch /mnt/test-vol-mnt/file1
    # cd /mnt/test-vol-mnt/
    # ls -lrt
    total 0
    -rw-r--r--. 1 root root 0 Jan 10 14:58 file1
    # ls -lrt /home/asengupt/node1/
    total 0
    -rw-r--r--. 2 root root 0 Jan 10 14:58 file1
    # ls -lrt /home/asengupt/node2/
    total 0
    -rw-r--r--. 2 root root 0 Jan 10 14:58 file1
  • Stripe : A stripe volume is one, in which the data being stored in the backend is striped into units of a particular size, among the bricks. The default unit size is 128KB, but it's configurable. If we create a striped volume of stripe count 3, and then create a 300 KB file at the mount point, the first 128KB will be stored in the first sub-volume(brick in our case), the next 128KB in the second, and the remaining 56KB in the third. The number of bricks should be a multiple of the stripe count.
    # gluster volume create test-vol stripe 3 Gotham:/home/asengupt/node1 Gotham:/home/asengupt/node2 Gotham:/home/asengupt/node3
    volume create: test-vol: success: please start the volume to access data
    # gluster volume start test-vol
    volume start: test-vol: success
    # gluster volume info
     
    Volume Name: test-vol
    Type: Stripe
    Volume ID: c5aa1590-2f6e-464d-a783-cd9bc222db30
    Status: Started
    Number of Bricks: 1 x 3 = 3
    Transport-type: tcp
    Bricks:
    Brick1: Gotham:/home/asengupt/node1
    Brick2: Gotham:/home/asengupt/node2
    Brick3: Gotham:/home/asengupt/node3
    # mount -t glusterfs Gotham:/test-vol /mnt/test-vol-mnt/
    # cd /mnt/test-vol-mnt/
    # ls -lrt
    total 0
    # cp /home/asengupt/300KB_File .
    # ls -lrt
    total 308
    -rwxr-xr-x. 1 root root 307200 Jan 11 12:46 300KB_File
    # ls -lrt /home/asengupt/node1/
    total 132
    -rwxr-xr-x. 2 root root 131072 Jan 11 12:46 300KB_File
    # ls -lrt /home/asengupt/node2/
    total 132
    -rwxr-xr-x. 2 root root 262144 Jan 11 12:46 300KB_File
    # ls -lrt /home/asengupt/node3/
    total 48
    -rwxr-xr-x. 2 root root 307200 Jan 11 12:46 300KB_File
    Why do we see that the first sub-volume indeed has 128kb of data, but the second and third sub-volumes contain 256KB, and 300KB respectively?
    That's because of holes. It means that the filesystem just pretends that at a particular place in the file there is just zero bytes, but no actual disk sectors are reserved for that place in the file. To proof this let's check the disk usage.
    # cd /home/asengupt
    # du | grep node.$
    136    ./node2
    136    ./node1
    52    ./node3
    0    ./node4
    Here we observe, that node1 and node2 indeed have 128KB of data, while node3 has 44KB. The additional 8KB present in these directories, are glusterfs system files.

Apart from these three vanilla types of volume, we can also create a volume which is a mix of these types. We will go through these and the respective volume files in the next post.

10 comments:

Unknown said...

Can we create and start multiple volumes at same server?

Avra Sengupta said...

Yes Shyam, we can create multiple volumes at the same server. All the volumes should have unique names, and brickpaths, as they will be the part of the same namespace and same peer cluster.

Eric said...

Dumb question: What if you have a replica number of 2 with 6 nodes? Does that mean 3 of the servers will be identical to the other 3?

Eric said...

Hi!

Great instructions. I'm wondering if you have been able to succesfful created a Volume from a Brick(s) that are CIFS-mounted ZFS datasets? For instance, say you have ZFS-server:/dataset/subset mounted to /localhost/mnt, you then want to create a volume using "/localhost/mnt" as the Brick.

Have you been able to successfully create a volume over a mounted directory?

Thanks!

Avra Sengupta said...

Thanks Eric! I have never tried CIFS-mounted ZFS datasets, but we have tried other mounts like lvms, aws instances as bricks, and lemme assure you it works pretty seamlessly. Infact the requirement for a gluster volume to support snapshots is that the underlying brick should be a lvm mount.

The replica of any volume has nothing to do with nodes, but everything with bricks. We recommend one brick per node, but that's not a hard requirement. So a replica 2 with 6 "bricks" will mean that the bricks will form replica group where each group will have 2 bricks(replica count). Since you have 6 bricks, you will have 3 such groups. So now if you look at that volume its a distribute-replicate volume, where there are 3 distribute sub-volumes, amogst which data will be divided, and each distribute sub-volume is actually a replica group consisiting of 2 bricks, amogst which the data will be replicated.

Eric said...
This comment has been removed by the author.
Eric said...

@Avra Sengupta

That explanation of volume replica should be at the top of the FAQ for GlusterFS! That made perfectly clear sense and was straight forward.

However, I am still having no success with creating the volume. I have shifted my test parameters from using a direct CIFS mount as the brick to a directory under the CIFS mount. Which ended in an error.

I have the ZFS dataset "dump" mounted over CIFS to /zfs/dump, and want to use the directory "gluster-test" within the mounted dataset.


This is the gluster version I am running:
root@gfs1# gluster --version
glusterfs 3.5.3 built on Nov 14 2014 11:23:37


This is the command I used:
gluster volume create gdump replica 2 transport tcp gfs1:/zfs/dump/gluster-test gfs2:/zfs/dump/gluster-test


This is the volume create error:
volume create: gdump: failed: Glusterfs is not supported on brick: gfs2:/zfs/dump/gluster-test.


And this was the reason for the error:
Setting extended attributes failed, reason: Operation not supported.


Below is the console output:
root@gfs1# gluster volume create gdump replica 2 transport tcp gfs1:/zfs/dump/gluster-test gfs2:/zfs/dump/gluster-test
volume create: gdump: failed: Glusterfs is not supported on brick: gfs2:/zfs/dump/gluster-test.
Setting extended attributes failed, reason: Operation not supported.


Under the ZFS dataset properties "xattr" is enabled. Below is the xattr property for the dataset.

dump xattr on default


Edit: Typos

Unknown said...

Hi Avra,

Nice articles. I am trying to understand the source code for GlusterFS. Unfortunately, there are no enough comments in the source code to understand it and I couldn't find any online documentation for the code. Do you know of any such documents or online posts which explains an overview of the source code at least?
Sorry if this comment is not entirely related to your post.

Thanks
Toms

Unknown said...

Thanks a lot .. You saved a ton of time to research on this topic..

Avra Sengupta said...

Hey Tom,

I agree that the source code is lacking as many comments as it ideally should, but we are kind of working towards improving it, with every patch we merge. It will take a bit of time, but we will get there.

Regarding source code overview, you can have a look at the developer-guide, which is present in ./doc/developer-guide in the source code. Other gluster documentation can be found at http://gluster.readthedocs.org/en/latest/

Regards,
Avra

Post a Comment