Thursday 31 January 2013

Volume Files and A Sneak Peak At Translators

In my last post, we went through the three vanilla types of volumes : Distribute, Replicate and Stripe, and now we have a basic understanding of what each of them does. I also mentioned we would be creating volumes, which are a mix of these three types. But before we do so, let's have a look at volume files.

Whenever a volume is created, subsequent volume files are also created. Ideally volume files are located in a directory(bearing the same name as the volume name) inside /var/lib/glusterd/vols/. Let's have a look at the volume files, for the distribute volume(test-vol) we created last time.
# gluster volume info

Volume Name: test-vol
Type: Distribute
Volume ID: 5d28ca28-9363-4b79-b922-5f28d0c0db65
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: Gotham:/home/asengupt/node1
Brick2: Gotham:/home/asengupt/node2

# cd /var/lib/glusterd/vols/
# ls -lrt
total 4
drwxr-xr-x. 4 root root 4096 Jan 31 14:20 test-vol
# cd test-vol/
# ls -lrt
total 32
-rw-------. 1 root root 1406 Jan 31 14:19 test-vol.Gotham.home-asengupt-node1.vol
-rw-------. 1 root root 1406 Jan 31 14:19 test-vol.Gotham.home-asengupt-node2.vol
-rw-------. 1 root root 1349 Jan 31 14:19 trusted-test-vol-fuse.vol
-rw-------. 1 root root 1121 Jan 31 14:20 test-vol-fuse.vol
drwxr-xr-x. 2 root root   80 Jan 31 14:20 run
-rw-------. 1 root root  305 Jan 31 14:20 info
drwxr-xr-x. 2 root root   74 Jan 31 14:20 bricks
-rw-------. 1 root root   12 Jan 31 14:20 rbstate
-rw-------. 1 root root   34 Jan 31 14:20 node_state.info
-rw-------. 1 root root   16 Jan 31 14:20 cksum
#
As I have the bricks in the same machine as the mount, hence we are seeing all the volume files here. The test-vol.Gotham.home-asengupt-node1.vol and test-vol.Gotham.home-asengupt-node2.vol are the volume files for Brick1 and Brick2 respectively. The volume file for test-vol volume is trusted-test-vol-fuse.vol. Let's have a look inside :
# cat trusted-test-vol-fuse.vol
volume test-vol-client-0
    type protocol/client
    option password 010f5d80-9d99-4b7c-a39e-1f964764213e
    option username 6969e53a-438a-4b92-a113-de5e5b7b5464
    option transport-type tcp
    option remote-subvolume /home/asengupt/node1
    option remote-host Gotham
end-volume

volume test-vol-client-1
    type protocol/client
    option password 010f5d80-9d99-4b7c-a39e-1f964764213e
    option username 6969e53a-438a-4b92-a113-de5e5b7b5464
    option transport-type tcp
    option remote-subvolume /home/asengupt/node2
    option remote-host Gotham
end-volume

volume test-vol-dht
    type cluster/distribute
    subvolumes test-vol-client-0 test-vol-client-1
end-volume

volume test-vol-write-behind
    type performance/write-behind
    subvolumes test-vol-dht
end-volume

volume test-vol-read-ahead
    type performance/read-ahead
    subvolumes test-vol-write-behind
end-volume

volume test-vol-io-cache
    type performance/io-cache
    subvolumes test-vol-read-ahead
end-volume

volume test-vol-quick-read
    type performance/quick-read
    subvolumes test-vol-io-cache
end-volume

volume test-vol-md-cache
    type performance/md-cache
    subvolumes test-vol-quick-read
end-volume

volume test-vol
    type debug/io-stats
    option count-fop-hits off
    option latency-measurement off
    subvolumes test-vol-md-cache
end-volume
#
This is what a volume file looks like. It's actually an inverted graph, of the path that the data from the mount point is supposed to follow. Savvy? No? Ok, Let's have a closer look at it. It is made up of a number of sections, each of which begin with "volume test-vol-xxxxx" and end with "end-volume". Each section stores information (type, options, subvolumes etc) for the respective translators(we will come back to what they are in a minute).

For example : Let's say a read fop(file-operation) was attempted at the mount point. The request information (type of fop:read, filename:the file the user tried to read etc.) will be passed on from one translator to another, starting from the io-stats translator at the bottom to either of the client translators at the top. Similarly the response will be transferred back from the client translator, all the way down to the io-stats and finally to the user.

So what are these translators. A translator is a module, which has a very specific purpose. It is to receive data, perform the necessary operations, and pass the data down to the next translator. That's about it in a nutshell. For example let's look at the dht-translator in the above volume file.
volume test-vol-dht
    type cluster/distribute
    subvolumes test-vol-client-0 test-vol-client-1
end-volume
"test-vol" is a distribute type volume, and hence has a dht translator. DHT a cluster translator, as is visible in it's "type". We know that in a distribute volume, a hashing algorithm, decides in which of the "subvolumes" is the data actually present. DHT translator is the one who does that for us.

The dht-translator will receive our read-fop, along with the filename. Based on the filename, the hashing algorithm, finds out the correct subvolume (among the two "test-vol-client-0" and "test-vol-client-1"), and passes the read-fop to the subsequent translator. That is how every other translator in the graph also works(recieve the data, perform it's part of the processing, and pass the data on to the next translator). As is quite visible here, the concept of translators provides us with a lot of modularity.

The volume files are created by the volume-create command, and based on the type of the volume and it's options, a graph is build with the appropriate translators. But we can edit an existing volume-file (add, remove, or modify a couple of translators), and the volume will change behaviour accordingly. Let's try that. Currently "test-vol" is a distribute volume. As per the behaviour, any file that is created, will be present in one of the bricks.
# mount -t glusterfs Gotham:/test-vol /mnt/test-vol-mnt/
# cd /mnt/test-vol-mnt/
# ls -lrt
total 0
# touch file1
# ls -lrt
total 0
-rw-r--r--. 1 root root 0 Jan 31 15:57 file1
# ls -lrt /home/asengupt/node1/
total 0
# ls -lrt /home/asengupt/node2/
total 0
-rw-r--r--. 2 root root 0 Jan 31 15:57 file1
#
The dht-translator, created the file in node2 only. Let's edit the volume-file for test-vol(/var/lib/glusterd/vols/test-vol/trusted-test-vol-fuse.vol). But before that we need to stop the volume and unmount it.
# gluster volume stop test-vol
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
volume stop: test-vol: success
# umount /mnt/test-vol-mnt 
Then let's edit the volume file, and replace the dht-translator with a replicate translator.
# vi /var/lib/glusterd/vols/test-vol/trusted-test-vol-fuse.vol 
**********Necessary Changes in the volume file************
volume test-vol-afr
    type cluster/replicate
    subvolumes test-vol-client-0 test-vol-client-1
end-volume

volume test-vol-write-behind
    type performance/write-behind
    subvolumes test-vol-afr
end-volume
**********************************************************
Now start the volume, and mount it again.
# gluster volume start test-vol
volume start: test-vol: success
# mount -t glusterfs Gotham:/test-vol /mnt/test-vol-mnt/
Let's create a file at this mount-point and check the behaviour of "test-vol".
# cd /mnt/test-vol-mnt/
# ls -lrt
total 0
-rw-r--r--. 1 root root 0 Jan 31 15:57 file1
# touch file2
# ls -lrt /home/asengupt/node1/
total 4
-rw-r--r--. 2 root root 0 Jan 31 16:06 file1
-rw-r--r--. 2 root root 0 Jan 31 16:06 file2
# ls -lrt /home/asengupt/node2/
total 4
-rw-r--r--. 2 root root 0 Jan 31 16:06 file1
-rw-r--r--. 2 root root 0 Jan 31 16:06 file2
#
Now we have, the same set of files in all the bricks, as is the behaviour of a replicate volume. We also observe that not only did it create the new file(file2) in all the bricks, but when we re-started the volume after changing the volume file, it also created a copy of the existing file(file1) in all the bricks. Let's check the volume info.
# gluster volume info 
  
Volume Name: test-vol 
Type: Distribute 
Volume ID: 5d28ca28-9363-4b79-b922-5f28d0c0db65 
Status: Started 
Number of Bricks: 2 
Transport-type: tcp 
Bricks: 
Brick1: Gotham:/home/asengupt/node1 
Brick2: Gotham:/home/asengupt/node2
It should be noted that, because of the changes we made in the volume file, though the behaviour of the volume changes, the volume-info still reflects the original details of the volume.
As promised we will create a mix of the vanilla volume-types, i.e a distributed-replicate volume.
# gluster volume create mix-vol replica 2 Gotham:/home/asengupt/node1 Gotham:/home/asengupt/node2 Gotham:/home/asengupt/node3 Gotham:/home/asengupt/node4
Multiple bricks of a replicate volume are present on the same server. This setup is not optimal.
Do you still want to continue creating the volume?  (y/n) y
volume create: mix-vol: success: please start the volume to access data
# gluster volume start mix-vol;
volume start: mix-vol: success
# gluster volume info

Volume Name: mix-vol
Type: Distributed-Replicate
Volume ID: 2fc6f11e-254e-444a-8179-43da62cc56e9
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: Gotham:/home/asengupt/node1
Brick2: Gotham:/home/asengupt/node2
Brick3: Gotham:/home/asengupt/node3
Brick4: Gotham:/home/asengupt/node4
#
As it is a distributed-replicate volume, the distribute translator should have two sub-volumes, each of which in turn is a replicate translator, having two bricks each as it's sub-volumes. A quick look at the volume file(/var/lib/glusterd/vols/mix-vol/trusted-mix-vol-fuse.vol) will give us more clarity. This is what the cluster translators in the volume file will look like :
volume mix-vol-replicate-0
    type cluster/replicate
    subvolumes mix-vol-client-0 mix-vol-client-1
end-volume

volume mix-vol-replicate-1
    type cluster/replicate
    subvolumes mix-vol-client-2 mix-vol-client-3
end-volume

volume mix-vol-dht
    type cluster/distribute
    subvolumes mix-vol-replicate-0 mix-vol-replicate-1
end-volume
As we can see the dht-translator has two replicate translators as its sub-volumes : "mix-vol-replicate-0" and "mix-vol-replicate-1". So every file created at the mount-point will be sent to either one of the replicate sub-volumes, by the dht-translator. Each replicate-subvolume has two bricks as it's own sub-volumes. After the write-fop is sent to the appropriate replicate-subvolume, the replicate translator will create a copy each, in both the bricks listed as it's sub-volumes. Let's check this behaviour :
# mount -t glusterfs Gotham:/mix-vol /mnt/test-vol-mnt/
# cd /mnt/test-vol-mnt/
# touch file1
# ls -lrt /home/asengupt/node1/
total 0
# ls -lrt /home/asengupt/node2/
total 0
# ls -lrt /home/asengupt/node3/
total 0
-rw-r--r--. 2 root root 0 Jan 31 16:31 file1
# ls -lrt /home/asengupt/node4/
total 0
-rw-r--r--. 2 root root 0 Jan 31 16:31 file1
#
Similarly, a distributed-stripe or a replicated-stripe volume can also be created.  Jeff Darcy's blog has an awesome set of articles on translators. It's  a great read :
EDIT : The above links are broken, but the same information is provided in glusterfs/doc/developer-guide/translator-development.md in the source code. Thanks to Jo for pointing out the same.

2 comments:

Saravanakumar said...

Thanks !
The links mentioned above are broken..but the same Lessons are included into glusterfs source in this location: glusterfs/doc/hacker-guide/en-US/markdown/translator-development.md

Please consider updating this info.

Avra Sengupta said...

Updated the same. Thanks Jo

Post a Comment