Redis cluster version 2
This commit is contained in:
parent
8ff6a48b99
commit
5bdb384ff0
281
design-documents/REDIS-CLUSTER-2
Normal file
281
design-documents/REDIS-CLUSTER-2
Normal file
@ -0,0 +1,281 @@
|
|||||||
|
Redis Cluster - Alternative 1
|
||||||
|
|
||||||
|
28 Apr 2010: Ver 1.0 - initial version
|
||||||
|
|
||||||
|
Overview
|
||||||
|
========
|
||||||
|
|
||||||
|
The motivations and design goals of Redis Cluster are already outlined in the
|
||||||
|
first design document of Redis Cluster. This document is just an attempt to
|
||||||
|
provide a completely alternative approach in order to explore more ideas.
|
||||||
|
|
||||||
|
In this document the alternative explored is a cluster where communication is
|
||||||
|
performed directly from client to the target node, without intermediate layer.
|
||||||
|
|
||||||
|
The intermediate layer can be used, in the form of a proxy, in order to provide
|
||||||
|
the same functionality to clients not able to directly use the cluster protocol.
|
||||||
|
So in a first stage clients can use a proxy to implement the hash ring, but
|
||||||
|
later this clients can switch to a native implementation, following a
|
||||||
|
specification that the Redis project will provide.
|
||||||
|
|
||||||
|
In this new design fault tolerance is achieved by replicating M-1 times every
|
||||||
|
data node instead of storing the same key M times across nodes.
|
||||||
|
|
||||||
|
From the point of view of CAP our biggest sacrifice is about "P", that is
|
||||||
|
resistance to partitioning. Only M-1 nodes can go down for the cluster still
|
||||||
|
be functional. Also when possible "A" is somewhat sacrificed for "L", that
|
||||||
|
is, Latency. Not really in the CAP equation but a very important parameter.
|
||||||
|
|
||||||
|
Network layout
|
||||||
|
==============
|
||||||
|
|
||||||
|
In this alternative design the network layout is simple as there are only
|
||||||
|
clients talking directly to N data nodes. So we can imagine to have:
|
||||||
|
|
||||||
|
- K Redis clients, directly talking to the data nodes.
|
||||||
|
- N Redis data nodes, that are, normal Redis instances.
|
||||||
|
|
||||||
|
Data nodes are replicate M-1 times (so there are a total of M copies for
|
||||||
|
every node). If M is one, the system is not fault tolerant. If M is 2 one
|
||||||
|
data node can go off line without affecting the operations. And so forth.
|
||||||
|
|
||||||
|
Hash slots
|
||||||
|
==========
|
||||||
|
|
||||||
|
The key space is divided into 1024 slots.
|
||||||
|
|
||||||
|
Given a key, the SHA1 function is applied to it.
|
||||||
|
The first 10 bytes of the SHA1 digest are interpreted as an unsigned integer
|
||||||
|
from 0 to 1023. This is the hash slot of the key.
|
||||||
|
|
||||||
|
Data nodes
|
||||||
|
==========
|
||||||
|
|
||||||
|
Data nodes are normal Redis instances, but a few additional commands are
|
||||||
|
provided.
|
||||||
|
|
||||||
|
HASHRING ADD ... list of hash slots ...
|
||||||
|
HASHRING DEL ... list of hash slots ...
|
||||||
|
HASHRING REHASHING slot
|
||||||
|
HASHRING SLOTS => returns the list of configured slots
|
||||||
|
HSAHRING KEYS ... list of hash slots ...
|
||||||
|
|
||||||
|
By default Redis instances are configured to accept operations about all
|
||||||
|
the hash slots. With this commands it's possible to configure a Redis instance
|
||||||
|
to accept only a subset of the key space.
|
||||||
|
|
||||||
|
If an operation is performed against a key hashing to a slot that is not
|
||||||
|
configured to be accepted, the Redis instance will reply with:
|
||||||
|
|
||||||
|
"-ERR wrong hash slot"
|
||||||
|
|
||||||
|
More details on the HASHRING command and sub commands will be showed later
|
||||||
|
in this document.
|
||||||
|
|
||||||
|
Additionally three other commands are added:
|
||||||
|
|
||||||
|
DUMP key
|
||||||
|
RESTORE key <dump data>
|
||||||
|
MIGRATE key host port
|
||||||
|
|
||||||
|
DUMP is used to output a very compact binary representation of the data stored at key.
|
||||||
|
|
||||||
|
RESTORE re-creates a value (storing it at key) starting from the output produced by DUMP.
|
||||||
|
|
||||||
|
MIGRATE is like a server-side DUMP+RESTORE command. This atomic command moves one key from the connected instance to another instance, returning the status code of the operation (+OK or an error).
|
||||||
|
|
||||||
|
The protocol described in this draft only uses the MIGRATE command, but this in turn will use RESTORE internally when connecting to another server, and DUMP is provided for symmetry.
|
||||||
|
|
||||||
|
Querying the cluster
|
||||||
|
====================
|
||||||
|
|
||||||
|
1) Reading the cluster config
|
||||||
|
-----------------------------
|
||||||
|
|
||||||
|
Clients of the cluster are required to have the cluster configuration loaded
|
||||||
|
into memory. The cluster configuration is the sum of the following info:
|
||||||
|
|
||||||
|
- Number of data nodes in the cluster, for instance, 10
|
||||||
|
- A map between hash slots and nodes, so for instnace:
|
||||||
|
hash slot 1 -> node 0
|
||||||
|
hash slot 2 -> node 5
|
||||||
|
hash slot 3 -> node 3
|
||||||
|
... and so forth ...
|
||||||
|
- Physical address of nodes, and their replicas.
|
||||||
|
node 0 addr -> 192.168.1.100
|
||||||
|
node 0 replicas -> 192.168.1.101, 192.168.1.105
|
||||||
|
- Configuration version: the SHA1 of the whole configuration
|
||||||
|
|
||||||
|
The configuration is stored in every single data node of the cluster.
|
||||||
|
|
||||||
|
A client without the configuration in memory is require, as a first step, to
|
||||||
|
read the config. In order to do so the client requires to have a list of IPs
|
||||||
|
that are with good probability data nodes of the cluster.
|
||||||
|
|
||||||
|
The client will try to get the config from all this nodes. If no node is found
|
||||||
|
responding, an error is reported to the user.
|
||||||
|
|
||||||
|
2) Caching and refreshing the configuration
|
||||||
|
-------------------------------------------
|
||||||
|
|
||||||
|
A node is allowed to cache the configuration in memory or in a different way
|
||||||
|
(for instance storing the configuration into a file), but every client is
|
||||||
|
required to check if the configuration changed at max every 10 seconds, asking
|
||||||
|
for the configuration version key with a single GET call, and checking if the
|
||||||
|
configuration version matches the one loaded in memory.
|
||||||
|
|
||||||
|
Also a client is required to refresh the configuration every time a node
|
||||||
|
replies with:
|
||||||
|
|
||||||
|
"-ERR wrong hash slot"
|
||||||
|
|
||||||
|
As this means that hash slots were reassigned in some way.
|
||||||
|
|
||||||
|
Checking the configuration every 10 seconds is not required in theory but is
|
||||||
|
a good protection against errors and failures that may happen in real world
|
||||||
|
environments. It is also very cheap to perform, as a GET operation from time
|
||||||
|
to time is going to have no impact in the overall performance.
|
||||||
|
|
||||||
|
3) Read query
|
||||||
|
-------------
|
||||||
|
|
||||||
|
To perform a read query the client hashes the key argument from the command
|
||||||
|
(in the intiial version of Redis Cluster only single-key commands are
|
||||||
|
allowed). Using the in memory configuration it maps the hash key to the
|
||||||
|
node ID.
|
||||||
|
|
||||||
|
If the client is configured to support read-after-write consistency, then
|
||||||
|
the "master" node for this hash slot is queried.
|
||||||
|
|
||||||
|
Otherwise the client picks a random node from the master and the replicas
|
||||||
|
available.
|
||||||
|
|
||||||
|
4) Write query
|
||||||
|
--------------
|
||||||
|
|
||||||
|
A write query is exactly like a read query, with the difference that the
|
||||||
|
write always targets the master node, instead of the replicas.
|
||||||
|
|
||||||
|
Creating a cluster
|
||||||
|
==================
|
||||||
|
|
||||||
|
In order to create a new cluster, the redis-cluster command line utility is
|
||||||
|
used. It gets a list of available nodes and replicas, in order to write the
|
||||||
|
initial configuration in all the nodes.
|
||||||
|
|
||||||
|
At this point the cluster is usable by clients.
|
||||||
|
|
||||||
|
Adding nodes to the cluster
|
||||||
|
===========================
|
||||||
|
|
||||||
|
The command line utility redis-cluster is used in order to add a node to the
|
||||||
|
cluster:
|
||||||
|
|
||||||
|
1) The cluster configuration is loaded.
|
||||||
|
2) A fair number of hash slots are assigned to the new data node.
|
||||||
|
3) Hash slots moved to the new node are marked as "REHASHING" in the old
|
||||||
|
nodes, using the HASHRING command:
|
||||||
|
|
||||||
|
HASHRING SETREHASHING 1 192.168.1.103 6380
|
||||||
|
|
||||||
|
The above command set the hash slot "1" in rehashing state, with the
|
||||||
|
"forwarding address" to 192.168.1.103:6380. As a result if this node receives
|
||||||
|
a query about a key hashing to hash slot 1, that *is not present* in the
|
||||||
|
current data set, it replies with:
|
||||||
|
|
||||||
|
"-MIGRATED 192.168.1.103:6380"
|
||||||
|
|
||||||
|
The client can then reissue the query against the new node.
|
||||||
|
|
||||||
|
Instead even if the hash slot is marked as rehashing but the requested key
|
||||||
|
is still there, the query is processed. This allows for non blocking
|
||||||
|
rehashing.
|
||||||
|
|
||||||
|
Note that no additional memory is used by Redis in order to provide such a
|
||||||
|
feature.
|
||||||
|
|
||||||
|
4) While the Hash slot is marked as "REHASHING", redis-cluster asks this node
|
||||||
|
the list of all the keys matching the specified hash slot. Then all the keys
|
||||||
|
are moved to the new node using the MIGRATE command.
|
||||||
|
5) Once all the keys are migrated, the hash slot is deleted from the old
|
||||||
|
node configuration with "HASHRING DEL 1". And the configuration is update.
|
||||||
|
|
||||||
|
Using this algorithm all the hash slots are migrated one after the other to the new node. In practical implementation before to start the migration the
|
||||||
|
redis-cluster utility should write a log into the configuration so that
|
||||||
|
in case of crash or any other problem the utility is able to recover from
|
||||||
|
were it left.
|
||||||
|
|
||||||
|
Fault tolerance
|
||||||
|
===============
|
||||||
|
|
||||||
|
Fault tolerance is reached replicating every data node M-1 times, so that we
|
||||||
|
have one master and M-1 replicas for a total of M nodes holding the same
|
||||||
|
hash slots. Up to M-1 nodes can go down without affecting the cluster.
|
||||||
|
|
||||||
|
The tricky part about fault tolerance is detecting when a node is failing and
|
||||||
|
signaling it to all the other clients.
|
||||||
|
|
||||||
|
When a master node is failing in a permanent way, promoting the first slave
|
||||||
|
is easy:
|
||||||
|
1) At some point a client will notice there are problems accessing a given node. It will try to refresh the config, but will notice that the config is already up to date.
|
||||||
|
2) In order to make sure the problem is not about the client connectivity itself, it will try to reach other nodes as well. If more than M-1 nodes appear to be down, it's either a client networking problem or alternatively the cluster can't be fixed as too many nodes are down anyway. So no action is taken, but an error is reported.
|
||||||
|
3) If instead only 1 or at max M-1 nodes appear to be down, the client promotes a slave as master and writes the new configuration to all the data nodes.
|
||||||
|
|
||||||
|
All the other clients will see the data node not working, and as a first step will try to refresh the configuration. They will successful refresh the configuration and the cluster will work again.
|
||||||
|
|
||||||
|
Every time a slave is promoted, the information is written in a log that is actually a Redis list, in all the data nodes, so that system administration tools can detect what happened in order to send notifications to the admin.
|
||||||
|
|
||||||
|
Intermittent problems
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
In the above scenario a master was failing in a permanent way. Now instead
|
||||||
|
let's think to a case where a network cable is not working well so a node
|
||||||
|
appears to be a few seconds up and a few seconds down.
|
||||||
|
|
||||||
|
When this happens recovering can be much harder, as a client may notice the
|
||||||
|
problem and will promote a slave to master as a result, but then the host
|
||||||
|
will be up again and the other clients will not see the problem, writing to
|
||||||
|
the old master for at max 10 seconds (after 10 seconds all the clients are
|
||||||
|
required to perform a few GETs to check the configuration version of the
|
||||||
|
cluster and update if needed).
|
||||||
|
|
||||||
|
One way to fix this problem is to delegate the fail over mechanism to a
|
||||||
|
failover agent. When clients notice problems will not take any active action
|
||||||
|
but will just log the problem into a redis list in all the reachable nodes,
|
||||||
|
wait, check for configuration change, and retry.
|
||||||
|
|
||||||
|
The failover agent constantly monitor this logs: if some client is reporting
|
||||||
|
a failing node, it can take appropriate actions, checking if the failure is
|
||||||
|
permanent or not. If it's not he can send a SHUTDOWN command to the failing
|
||||||
|
master if possible. The failover agent can also consider better the problem
|
||||||
|
checking if the failing mode is advertised by all the clients or just a single
|
||||||
|
one, and can check itself if there is a real problem before to proceed with
|
||||||
|
the fail over.
|
||||||
|
|
||||||
|
Redis proxy
|
||||||
|
===========
|
||||||
|
|
||||||
|
In order to make the switch to the clustered version of Redis simpler, and
|
||||||
|
because the client-side protocol is non trivial to implement compared to the
|
||||||
|
usual Redis client lib protocol (where a minimal lib can be as small as
|
||||||
|
100 lines of code), a proxy will be provided to implement the cluster protocol
|
||||||
|
as a proxy.
|
||||||
|
|
||||||
|
Every client will talk to a redis-proxy node that is responsible of using
|
||||||
|
the new protocol and forwarding back the replies.
|
||||||
|
|
||||||
|
In the long run the aim is to switch all the major client libraries to the
|
||||||
|
new protocol in a native way.
|
||||||
|
|
||||||
|
Supported commands
|
||||||
|
==================
|
||||||
|
|
||||||
|
Because with this design we talk directly to data nodes and there is a single
|
||||||
|
"master" version of every value (that's the big gain dropping "P" from CAP!)
|
||||||
|
almost all the redis commands can be supported by the clustered version
|
||||||
|
including MULTI/EXEC and multi key commands as long as all the keys will hash
|
||||||
|
to the same hash slot. In order to guarantee this, key tags can be used,
|
||||||
|
where when a specific pattern is present in the key name, only that part is
|
||||||
|
hashed in order to obtain the hash index.
|
||||||
|
|
||||||
|
EOF
|
Loading…
Reference in New Issue
Block a user