Gluster Hacker: Volume Files and A Sneak Peak At Translators

In my last post, we went through the three vanilla types of volumes : Distribute, Replicate and Stripe, and now we have a basic understanding of what each of them does. I also mentioned we would be creating volumes, which are a mix of these three types. But before we do so, let's have a look at volume files.

Whenever a volume is created, subsequent volume files are also created. Ideally volume files are located in a directory(bearing the same name as the volume name) inside /var/lib/glusterd/vols/. Let's have a look at the volume files, for the distribute volume(test-vol) we created last time.

# gluster volume info

Volume Name: test-vol
Type: Distribute
Volume ID: 5d28ca28-9363-4b79-b922-5f28d0c0db65
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: Gotham:/home/asengupt/node1
Brick2: Gotham:/home/asengupt/node2
# cd /var/lib/glusterd/vols/
# ls -lrt
total 4
drwxr-xr-x. 4 root root 4096 Jan 31 14:20 test-vol
# cd test-vol/
# ls -lrt
total 32
-rw-------. 1 root root 1406 Jan 31 14:19 test-vol.Gotham.home-asengupt-node1.vol
-rw-------. 1 root root 1406 Jan 31 14:19 test-vol.Gotham.home-asengupt-node2.vol
-rw-------. 1 root root 1349 Jan 31 14:19 trusted-test-vol-fuse.vol
-rw-------. 1 root root 1121 Jan 31 14:20 test-vol-fuse.vol
drwxr-xr-x. 2 root root 80 Jan 31 14:20 run
-rw-------. 1 root root 305 Jan 31 14:20 info
drwxr-xr-x. 2 root root 74 Jan 31 14:20 bricks
-rw-------. 1 root root 12 Jan 31 14:20 rbstate
-rw-------. 1 root root 34 Jan 31 14:20 node_state.info
-rw-------. 1 root root 16 Jan 31 14:20 cksum
#

As I have the bricks in the same machine as the mount, hence we are seeing all the volume files here. The test-vol.Gotham.home-asengupt-node1.vol and test-vol.Gotham.home-asengupt-node2.vol are the volume files for Brick1 and Brick2 respectively. The volume file for test-vol volume is trusted-test-vol-fuse.vol. Let's have a look inside :

# cat trusted-test-vol-fuse.vol
volume test-vol-client-0
type protocol/client
option password 010f5d80-9d99-4b7c-a39e-1f964764213e
option username 6969e53a-438a-4b92-a113-de5e5b7b5464
option transport-type tcp
option remote-subvolume /home/asengupt/node1
option remote-host Gotham
end-volume
volume test-vol-client-1
type protocol/client
option password 010f5d80-9d99-4b7c-a39e-1f964764213e
option username 6969e53a-438a-4b92-a113-de5e5b7b5464
option transport-type tcp
option remote-subvolume /home/asengupt/node2
option remote-host Gotham
end-volume
volume test-vol-dht
type cluster/distribute
subvolumes test-vol-client-0 test-vol-client-1
end-volume
volume test-vol-write-behind
type performance/write-behind
subvolumes test-vol-dht
end-volume
volume test-vol-read-ahead
type performance/read-ahead
subvolumes test-vol-write-behind
end-volume
volume test-vol-io-cache
type performance/io-cache
subvolumes test-vol-read-ahead
end-volume
volume test-vol-quick-read
type performance/quick-read
subvolumes test-vol-io-cache
end-volume
volume test-vol-md-cache
type performance/md-cache
subvolumes test-vol-quick-read
end-volume
volume test-vol
type debug/io-stats
option count-fop-hits off
option latency-measurement off
subvolumes test-vol-md-cache
end-volume
#

This is what a volume file looks like. It's actually an inverted graph, of the path that the data from the mount point is supposed to follow. Savvy? No? Ok, Let's have a closer look at it. It is made up of a number of sections, each of which begin with "volume test-vol-xxxxx" and end with "end-volume". Each section stores information (type, options, subvolumes etc) for the respective translators(we will come back to what they are in a minute).

For example : Let's say a read fop(file-operation) was attempted at the mount point. The request information (type of fop:read, filename:the file the user tried to read etc.) will be passed on from one translator to another, starting from the io-stats translator at the bottom to either of the client translators at the top. Similarly the response will be transferred back from the client translator, all the way down to the io-stats and finally to the user.

So what are these translators. A translator is a module, which has a very specific purpose. It is to receive data, perform the necessary operations, and pass the data down to the next translator. That's about it in a nutshell. For example let's look at the dht-translator in the above volume file.

volume test-vol-dht
type cluster/distribute
subvolumes test-vol-client-0 test-vol-client-1
end-volume

"test-vol" is a distribute type volume, and hence has a dht translator. DHT a cluster translator, as is visible in it's "type". We know that in a distribute volume, a hashing algorithm, decides in which of the "subvolumes" is the data actually present. DHT translator is the one who does that for us.

The dht-translator will receive our read-fop, along with the filename. Based on the filename, the hashing algorithm, finds out the correct subvolume (among the two "test-vol-client-0" and "test-vol-client-1"), and passes the read-fop to the subsequent translator. That is how every other translator in the graph also works(recieve the data, perform it's part of the processing, and pass the data on to the next translator). As is quite visible here, the concept of translators provides us with a lot of modularity.

The volume files are created by the volume-create command, and based on the type of the volume and it's options, a graph is build with the appropriate translators. But we can edit an existing volume-file (add, remove, or modify a couple of translators), and the volume will change behaviour accordingly. Let's try that. Currently "test-vol" is a distribute volume. As per the behaviour, any file that is created, will be present in one of the bricks.

# mount -t glusterfs Gotham:/test-vol /mnt/test-vol-mnt/
# cd /mnt/test-vol-mnt/
# ls -lrt
total 0
# touch file1
# ls -lrt
total 0
-rw-r--r--. 1 root root 0 Jan 31 15:57 file1
# ls -lrt /home/asengupt/node1/
total 0
# ls -lrt /home/asengupt/node2/
total 0
-rw-r--r--. 2 root root 0 Jan 31 15:57 file1
#

The dht-translator, created the file in node2 only. Let's edit the volume-file for test-vol(/var/lib/glusterd/vols/test-vol/trusted-test-vol-fuse.vol). But before that we need to stop the volume and unmount it.

# gluster volume stop test-vol
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
volume stop: test-vol: success
# umount /mnt/test-vol-mnt

Then let's edit the volume file, and replace the dht-translator with a replicate translator.

# vi /var/lib/glusterd/vols/test-vol/trusted-test-vol-fuse.vol
**********Necessary Changes in the volume file************
volume test-vol-afr
type cluster/replicate
subvolumes test-vol-client-0 test-vol-client-1
end-volume

volume test-vol-write-behind
type performance/write-behind
subvolumes test-vol-afr
end-volume
**********************************************************

Now start the volume, and mount it again.

# gluster volume start test-vol
volume start: test-vol: success
# mount -t glusterfs Gotham:/test-vol /mnt/test-vol-mnt/

Let's create a file at this mount-point and check the behaviour of "test-vol".

# cd /mnt/test-vol-mnt/
# ls -lrt
total 0
-rw-r--r--. 1 root root 0 Jan 31 15:57 file1
# touch file2
# ls -lrt /home/asengupt/node1/
total 4
-rw-r--r--. 2 root root 0 Jan 31 16:06 file1
-rw-r--r--. 2 root root 0 Jan 31 16:06 file2
# ls -lrt /home/asengupt/node2/
total 4
-rw-r--r--. 2 root root 0 Jan 31 16:06 file1
-rw-r--r--. 2 root root 0 Jan 31 16:06 file2
#

Now we have, the same set of files in all the bricks, as is the behaviour of a replicate volume. We also observe that not only did it create the new file(file2) in all the bricks, but when we re-started the volume after changing the volume file, it also created a copy of the existing file(file1) in all the bricks. Let's check the volume info.

# gluster volume info

Volume Name: test-vol
Type: Distribute
Volume ID: 5d28ca28-9363-4b79-b922-5f28d0c0db65
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: Gotham:/home/asengupt/node1
Brick2: Gotham:/home/asengupt/node2

It should be noted that, because of the changes we made in the volume file, though the behaviour of the volume changes, the volume-info still reflects the original details of the volume.

As promised we will create a mix of the vanilla volume-types, i.e a distributed-replicate volume.

# gluster volume create mix-vol replica 2 Gotham:/home/asengupt/node1 Gotham:/home/asengupt/node2 Gotham:/home/asengupt/node3 Gotham:/home/asengupt/node4
Multiple bricks of a replicate volume are present on the same server. This setup is not optimal.
Do you still want to continue creating the volume? (y/n) y
volume create: mix-vol: success: please start the volume to access data
# gluster volume start mix-vol;
volume start: mix-vol: success
# gluster volume info

Volume Name: mix-vol
Type: Distributed-Replicate
Volume ID: 2fc6f11e-254e-444a-8179-43da62cc56e9
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: Gotham:/home/asengupt/node1
Brick2: Gotham:/home/asengupt/node2
Brick3: Gotham:/home/asengupt/node3
Brick4: Gotham:/home/asengupt/node4
#

As it is a distributed-replicate volume, the distribute translator should have two sub-volumes, each of which in turn is a replicate translator, having two bricks each as it's sub-volumes. A quick look at the volume file(/var/lib/glusterd/vols/mix-vol/trusted-mix-vol-fuse.vol) will give us more clarity. This is what the cluster translators in the volume file will look like :

volume mix-vol-replicate-0
type cluster/replicate
subvolumes mix-vol-client-0 mix-vol-client-1
end-volume

volume mix-vol-replicate-1
type cluster/replicate
subvolumes mix-vol-client-2 mix-vol-client-3
end-volume

volume mix-vol-dht
type cluster/distribute
subvolumes mix-vol-replicate-0 mix-vol-replicate-1
end-volume

As we can see the dht-translator has two replicate translators as its sub-volumes : "mix-vol-replicate-0" and "mix-vol-replicate-1". So every file created at the mount-point will be sent to either one of the replicate sub-volumes, by the dht-translator. Each replicate-subvolume has two bricks as it's own sub-volumes. After the write-fop is sent to the appropriate replicate-subvolume, the replicate translator will create a copy each, in both the bricks listed as it's sub-volumes. Let's check this behaviour :

# mount -t glusterfs Gotham:/mix-vol /mnt/test-vol-mnt/
# cd /mnt/test-vol-mnt/
# touch file1
# ls -lrt /home/asengupt/node1/
total 0
# ls -lrt /home/asengupt/node2/
total 0
# ls -lrt /home/asengupt/node3/
total 0
-rw-r--r--. 2 root root 0 Jan 31 16:31 file1
# ls -lrt /home/asengupt/node4/
total 0
-rw-r--r--. 2 root root 0 Jan 31 16:31 file1
#

Similarly, a distributed-stripe or a replicated-stripe volume can also be created. Jeff Darcy's blog has an awesome set of articles on translators. It's a great read :

EDIT : The above links are broken, but the same information is provided in glusterfs/doc/developer-guide/translator-development.md in the source code. Thanks to Jo for pointing out the same.

2 comments:

Saravanakumar said...: Thanks !
The links mentioned above are broken..but the same Lessons are included into glusterfs source in this location: glusterfs/doc/hacker-guide/en-US/markdown/translator-development.md

Please consider updating this info.; 28 January 2015 at 22:13
Avra Sengupta said...: Updated the same. Thanks Jo; 29 December 2015 at 21:44

Gluster Hacker

Thursday, 31 January 2013

Volume Files and A Sneak Peak At Translators

2 comments:

Post a Comment

Total Pageviews

Blog Archive

Labels

About Me