In my last post, we went through the
three vanilla types of volumes : Distribute, Replicate and Stripe,
and now we have a basic understanding of what each of them does. I
also mentioned we would be creating volumes, which are a mix of these
three types. But before we do so, let's have a look at volume files.
Whenever a volume is created,
subsequent volume files are also created. Ideally volume files are
located in a directory(bearing the same name as the volume name)
inside /var/lib/glusterd/vols/. Let's have a look at the volume
files, for the distribute volume(test-vol) we created last time.
# gluster volume info
Volume Name: test-vol
Type: Distribute
Volume ID: 5d28ca28-9363-4b79-b922-5f28d0c0db65
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: Gotham:/home/asengupt/node1
Brick2: Gotham:/home/asengupt/node2
# cd /var/lib/glusterd/vols/
# ls -lrt
total 4
drwxr-xr-x. 4 root root 4096 Jan 31 14:20 test-vol
# cd test-vol/
# ls -lrt
total 32
-rw-------. 1 root root 1406 Jan 31 14:19 test-vol.Gotham.home-asengupt-node1.vol
-rw-------. 1 root root 1406 Jan 31 14:19 test-vol.Gotham.home-asengupt-node2.vol
-rw-------. 1 root root 1349 Jan 31 14:19 trusted-test-vol-fuse.vol
-rw-------. 1 root root 1121 Jan 31 14:20 test-vol-fuse.vol
drwxr-xr-x. 2 root root 80 Jan 31 14:20 run
-rw-------. 1 root root 305 Jan 31 14:20 info
drwxr-xr-x. 2 root root 74 Jan 31 14:20 bricks
-rw-------. 1 root root 12 Jan 31 14:20 rbstate
-rw-------. 1 root root 34 Jan 31 14:20 node_state.info
-rw-------. 1 root root 16 Jan 31 14:20 cksum
#
As I have the bricks in the same
machine as the mount, hence we are seeing all the volume files here.
The test-vol.Gotham.home-asengupt-node1.vol and
test-vol.Gotham.home-asengupt-node2.vol are the volume files for
Brick1 and Brick2 respectively. The volume file for test-vol volume
is trusted-test-vol-fuse.vol. Let's have a look inside :
# cat trusted-test-vol-fuse.vol
volume test-vol-client-0
type protocol/client
option password 010f5d80-9d99-4b7c-a39e-1f964764213e
option username 6969e53a-438a-4b92-a113-de5e5b7b5464
option transport-type tcp
option remote-subvolume /home/asengupt/node1
option remote-host Gotham
end-volume
volume test-vol-client-1
type protocol/client
option password 010f5d80-9d99-4b7c-a39e-1f964764213e
option username 6969e53a-438a-4b92-a113-de5e5b7b5464
option transport-type tcp
option remote-subvolume /home/asengupt/node2
option remote-host Gotham
end-volume
volume test-vol-dht
type cluster/distribute
subvolumes test-vol-client-0 test-vol-client-1
end-volume
volume test-vol-write-behind
type performance/write-behind
subvolumes test-vol-dht
end-volume
volume test-vol-read-ahead
type performance/read-ahead
subvolumes test-vol-write-behind
end-volume
volume test-vol-io-cache
type performance/io-cache
subvolumes test-vol-read-ahead
end-volume
volume test-vol-quick-read
type performance/quick-read
subvolumes test-vol-io-cache
end-volume
volume test-vol-md-cache
type performance/md-cache
subvolumes test-vol-quick-read
end-volume
volume test-vol
type debug/io-stats
option count-fop-hits off
option latency-measurement off
subvolumes test-vol-md-cache
end-volume
#
This is what a volume file looks like.
It's actually an inverted graph, of the path that the data from the
mount point is supposed to follow. Savvy? No? Ok, Let's have a closer
look at it. It is made up of a number of sections, each of which
begin with "volume test-vol-xxxxx" and end with
"end-volume". Each section stores information (type,
options, subvolumes etc) for the respective translators(we will come
back to what they are in a minute).
For example : Let's say a read fop(file-operation) was attempted at the mount point. The request information (type of fop:read, filename:the file the user tried to read etc.) will be passed on from one translator to another, starting from the io-stats translator at the bottom to either of the client translators at the top. Similarly the response will be transferred back from the client translator, all the way down to the io-stats and finally to the user.
For example : Let's say a read fop(file-operation) was attempted at the mount point. The request information (type of fop:read, filename:the file the user tried to read etc.) will be passed on from one translator to another, starting from the io-stats translator at the bottom to either of the client translators at the top. Similarly the response will be transferred back from the client translator, all the way down to the io-stats and finally to the user.
So what are these translators. A
translator is a module, which has a very specific purpose. It is to receive data, perform the necessary operations, and pass the data
down to the next translator. That's about it in a nutshell. For
example let's look at the dht-translator in the above volume file.
volume test-vol-dht
type cluster/distribute
subvolumes test-vol-client-0 test-vol-client-1
end-volume
"test-vol" is a distribute
type volume, and hence has a dht translator. DHT a cluster
translator, as is visible in it's "type". We know that in a
distribute volume, a hashing algorithm, decides in which of the
"subvolumes" is the data actually present. DHT translator
is the one who does that for us.
The dht-translator will receive our read-fop, along with the filename. Based on the filename, the hashing algorithm, finds out the correct subvolume (among the two "test-vol-client-0" and "test-vol-client-1"), and passes the read-fop to the subsequent translator. That is how every other translator in the graph also works(recieve the data, perform it's part of the processing, and pass the data on to the next translator). As is quite visible here, the concept of translators provides us with a lot of modularity.
The dht-translator will receive our read-fop, along with the filename. Based on the filename, the hashing algorithm, finds out the correct subvolume (among the two "test-vol-client-0" and "test-vol-client-1"), and passes the read-fop to the subsequent translator. That is how every other translator in the graph also works(recieve the data, perform it's part of the processing, and pass the data on to the next translator). As is quite visible here, the concept of translators provides us with a lot of modularity.
The volume files are created by the volume-create command, and based on the type of the volume and it's options, a graph is build with the appropriate translators. But we can edit an existing volume-file (add, remove, or modify a couple of translators), and the volume will change behaviour accordingly. Let's try that. Currently "test-vol" is a distribute volume. As per the behaviour, any file that is created, will be present in one of the bricks.
# mount -t glusterfs Gotham:/test-vol /mnt/test-vol-mnt/
# cd /mnt/test-vol-mnt/
# ls -lrt
total 0
# touch file1
# ls -lrt
total 0
-rw-r--r--. 1 root root 0 Jan 31 15:57 file1
# ls -lrt /home/asengupt/node1/
total 0
# ls -lrt /home/asengupt/node2/
total 0
-rw-r--r--. 2 root root 0 Jan 31 15:57 file1
#
The dht-translator, created the file in
node2 only. Let's edit the volume-file for
test-vol(/var/lib/glusterd/vols/test-vol/trusted-test-vol-fuse.vol).
But before that we need to stop the volume and unmount it.
# gluster volume stop test-vol
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
volume stop: test-vol: success
# umount /mnt/test-vol-mnt
Then let's edit the volume file, and
replace the dht-translator with a replicate translator.
# vi /var/lib/glusterd/vols/test-vol/trusted-test-vol-fuse.vol
**********Necessary Changes in the volume file************
volume test-vol-afr
type cluster/replicate
subvolumes test-vol-client-0 test-vol-client-1
end-volume
volume test-vol-write-behind
type performance/write-behind
subvolumes test-vol-afr
end-volume
**********************************************************
Now start the volume, and mount it
again.
# gluster volume start test-vol
volume start: test-vol: success
# mount -t glusterfs Gotham:/test-vol /mnt/test-vol-mnt/
Let's create a file at this mount-point
and check the behaviour of "test-vol".
# cd /mnt/test-vol-mnt/
# ls -lrt
total 0
-rw-r--r--. 1 root root 0 Jan 31 15:57 file1
# touch file2
# ls -lrt /home/asengupt/node1/
total 4
-rw-r--r--. 2 root root 0 Jan 31 16:06 file1
-rw-r--r--. 2 root root 0 Jan 31 16:06 file2
# ls -lrt /home/asengupt/node2/
total 4
-rw-r--r--. 2 root root 0 Jan 31 16:06 file1
-rw-r--r--. 2 root root 0 Jan 31 16:06 file2
#
Now we have, the same set of files in
all the bricks, as is the behaviour of a replicate volume. We also
observe that not only did it create the new file(file2) in all the
bricks, but when we re-started the volume after changing the volume
file, it also created a copy of the existing file(file1) in all the
bricks. Let's check the volume info.
# gluster volume info
Volume Name: test-vol
Type: Distribute
Volume ID: 5d28ca28-9363-4b79-b922-5f28d0c0db65
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: Gotham:/home/asengupt/node1
Brick2: Gotham:/home/asengupt/node2
It should be noted that, because of the
changes we made in the volume file, though the behaviour of the
volume changes, the volume-info still reflects the original details
of the volume.
As promised we will create a mix of the
vanilla volume-types, i.e a distributed-replicate volume.
# gluster volume create mix-vol replica 2 Gotham:/home/asengupt/node1 Gotham:/home/asengupt/node2 Gotham:/home/asengupt/node3 Gotham:/home/asengupt/node4
Multiple bricks of a replicate volume are present on the same server. This setup is not optimal.
Do you still want to continue creating the volume? (y/n) y
volume create: mix-vol: success: please start the volume to access data
# gluster volume start mix-vol;
volume start: mix-vol: success
# gluster volume info
Volume Name: mix-vol
Type: Distributed-Replicate
Volume ID: 2fc6f11e-254e-444a-8179-43da62cc56e9
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: Gotham:/home/asengupt/node1
Brick2: Gotham:/home/asengupt/node2
Brick3: Gotham:/home/asengupt/node3
Brick4: Gotham:/home/asengupt/node4
#
As it is a distributed-replicate
volume, the distribute translator should have two sub-volumes, each
of which in turn is a replicate translator, having two bricks each as
it's sub-volumes. A quick look at the volume
file(/var/lib/glusterd/vols/mix-vol/trusted-mix-vol-fuse.vol) will
give us more clarity. This is what the cluster translators in the
volume file will look like :
volume mix-vol-replicate-0
type cluster/replicate
subvolumes mix-vol-client-0 mix-vol-client-1
end-volume
volume mix-vol-replicate-1
type cluster/replicate
subvolumes mix-vol-client-2 mix-vol-client-3
end-volume
volume mix-vol-dht
type cluster/distribute
subvolumes mix-vol-replicate-0 mix-vol-replicate-1
end-volume
As we can see the dht-translator has
two replicate translators as its sub-volumes : "mix-vol-replicate-0"
and "mix-vol-replicate-1". So every file created at the
mount-point will be sent to either one of the replicate sub-volumes,
by the dht-translator. Each replicate-subvolume has two bricks as
it's own sub-volumes. After the write-fop is sent to the appropriate
replicate-subvolume, the replicate translator will create a copy
each, in both the bricks listed as it's sub-volumes. Let's check this
behaviour :
# mount -t glusterfs Gotham:/mix-vol /mnt/test-vol-mnt/
# cd /mnt/test-vol-mnt/
# touch file1
# ls -lrt /home/asengupt/node1/
total 0
# ls -lrt /home/asengupt/node2/
total 0
# ls -lrt /home/asengupt/node3/
total 0
-rw-r--r--. 2 root root 0 Jan 31 16:31 file1
# ls -lrt /home/asengupt/node4/
total 0
-rw-r--r--. 2 root root 0 Jan 31 16:31 file1
#
Similarly, a distributed-stripe or a
replicated-stripe volume can also be created. Jeff Darcy's blog has an awesome set of articles on translators. It's a great read :
- Translator 101 Lesson 1: Setting the Stage
- Translator 101 Lesson 2: init, fini, and private context
- Translator 101 Lesson 3: This Time For Real
- Translator 101 Lesson 4: Debugging a Translator
- ss