Awesome
YFS
In this project, we implement yet another file system labs followed by course work. http://pdos.csail.mit.edu/6.824/labs/index.html
In consist of
- extent server (accessed from extent client) to store file system data.
- lock server (accessed from lock client) to manage locks
- We cache extent server and cache/replicate lock servers. Below gives some implementation details.
Caching Locks
Design
Lock Client Cache
clientcacheMap: Stores the state of locks(as per below struct) in a Map.
struct lock_cache_value {
l_cache_state lock_cache_state; // State of the lock cache
thread_mutex_t client_lock_mutex; // To protect this structure
pthread_cond_t client_lock_cv; // CV to be notified of state chang
pthread_cond_t client_revoke_cv; // CV to be notified if lock is revoked
};
client_cache_mutex: To protect clientcacheMap
It supports acquire, release and listens on revoke and retry.
Releaser, Retryer Threads
- On init, we spawn two threads, releaser and retryer (mutex and CVs are dedicated for each for synchronization)
- Releaser will listen on a list and release the lock if it is in FREE state or set the state to RELEASING and wait on client_revoke_cv to be set by the next release request from the client. Once it is released by the client, it sends release request to server
- Retryer will listen on a list and sends acquire request to server for each lockid. Upon successful response, it notifies threads that are waiting on client_lock_cv.
Acquire
The course of action depends on the state of the lockid.
- NONE: We send acquire to server and set the state to ACQUIRING. On OK response, state is set to LOCKED. On RETRY response, we wait on client_lock_cv. Upon wakeup, if the state is not FREE, we loop back so that the next course of action depends on the current state
- FREE: State is set to LOCKED
- Otherwise: we wait on client_lock_cv. Upon wakeup, if the state is not FREE, we loop back so that the next course of action depends on the current state
Release
If the state of lock is RELEASING, we signal client_revoke_cv. For LOCKED, we signal client_lock_cv.
Lock Server Cache
Server is generally designed to be non-blocking and care is taken not to make RPC calls within acquire/release calls.
tLockMap: Stores the state of locks(as per below struct) in a Map:
struct lock_cache_value {
l_state lock_state;
std::string owner_clientid; // used to send revoke
std::string retrying_clientid; // used to match with incoming acquire request
std::list<std::string> waiting_clientids; // need to send retry
};
lmap_mutex: Global lock to synchronize
It supports acquire, release
Releaser, Retryer Threads
- On init, we spawn two threads, releaser and retryer (mutex and CVs are dedicated for each for synchronization)
- Releaser will listen on a list and will send revoke RPC to client who owns the lock.
- Retryer will listen on a list and will send retry RPC to client who is scheduled to own next
Acquire
Lock is granted only if current state is
- LOCKFREE (or) (RETRYING and incoming id matches the retrying_clientid so that it is granted to the appropriate client. Also, waiting_clientids is probed and if there is any client waiting, we schedule a revoke request to be sent by Releaser thread to avoid starvation
- LOCKED we schedule a revoke request to be sent by Releaser thread and client to added to waiting_clientids list
Release
State is set to LOCKFREE and if there are clients waiting for this lock, then we schedule a retry request to be sent to the scheduled owner
Changes to rpc/rpc.cc
This is to fix the solution that addresses atmost once delivery of RPC on server side. Essentially, we respond back with INPROGRESS if cb_present is false.
Validation
It passes all tests from lab3/lab4 scripts with/witout RPC LOSSY
Caching Extent
Caching at Extent Client
In this exercise, we cache extent data/attributes at client end to avoid roundtrips to extent server. We use lock server/client to acquire locks before caching at the extent client side.
get()
If the inode is not present in the hash map, then we invoke load which gets the attributes and data from extent server and populates struct extent_value { bool dirty; // To track puts std::string data; extent_protocol::attr ext_attr; };
put()
If the inode is not present in extent server, we simply cache it in client and finally if the inode is still around, then we dump it back to server on flush() call.
remove()
We remove the inode from hash map
flush()
On release of lock from releaser thread in lock_client_cache, the hook dorelease invokes flush() in extent client. If the data is dirty, we "put" it to server, if not, we "remove" it from the server and delete the entry from client cache if needed.
BUGS FIXED
- getattr on extent server was returning OK in all scenarios even if the inode is not in the system
- Out of Order revoke/retry messages. On lock client, dorelease will be called only "if" the locked has been acquired atleast "once". doflush flag is set to flag in NONE case when lock state is ACQUIRING. It is then set to true once the state changes to LOCKED state. This ensures that extent_client infact has acquired the lock and so it is valid to call doflush()
test.sh
This script simply does regression test using all test scripts with LOSSY set to 0 and 5.
Paxos
Design
In this exercise, we implement PAXOS algorithm. The stack consists of RSM, Config and Paxos module. Config runs an heartbeat thread to monitor other nodes and invokes remove if necessary. One important convention that is used is that higher layers can hold mutex while calling down but not vice versa.
Paxos Module
This consists of Proposer and Acceptor.
Proposer
Implements below algorithm
n(prop_t): proposal number
v: value to be agreed on (list of alive nodes)
instance#: unique instance number
n_h: highest prepare seen (set during prepare and accept)
instance_h: highest instance accepted (set during decide)
n_a, v_a: highest accept seen
proposer run(instance, v):
choose n, unique and higher than any n seen so far
send prepare(instance, n) to all servers including self
if oldinstance(instance, instance_value) from any node:
commit to the instance_value locally
else if prepare_ok(n_a, v_a) from majority:
v' = v_a with highest n_a; choose own v otherwise
send accept(instance, n, v') to all
if accept_ok(n) from majority:
send decided(instance, v') to all
Acceptor
acceptor prepare(instance, n) handler:
if instance <= instance_h
reply oldinstance(instance, instance_value)
else if n > n_h
n_h = n
reply prepare_ok(n_a, v_a)
else
reply prepare_reject
acceptor accept(instance, n, v) handler:
if n >= n_h
n_a = n
v_a = v
reply accept_ok(n)
else
reply accept_reject
acceptor decide(instance, v) handler:
paxos_commit(instance, v)
RSM
Design
Replicated State Machine (RSM)
In this exercise, we implement RSM version of lock service.
1) Replicable caching lock server
In this step, we ported the previous cache implementation. This guarantees there will not be any deadlock caused by RSM layer So, we dont hold locks across RPC calls. Use background threads with retryer/releaser threads to update state on server side that needs response from client
2) RSM without failures
- The RSM client sends its request to the master by calling the invoke RPC.(handle if RSM is in view change undergoing paxos or if this is not primary anymore)
- The master assigns the client request the next viewstamp in sequence, and sends the invoke RPC to each slave
- If the request has the expected viewstamp, the slave executes the request locally and replies OK to the master.
- The master executes the request locally, and replies back to the client.
Only the primary lock_server_cache_rsm will communicate directly to clients.
3) Cope with Backup Failures and Implement state transfer
Upon detecting failure(heartbeat thread in config layer)/addition of new node(recovery thread in rsm layer), paxos will be kicked off and once it settles, each replica should get state from new master. Master will hold still until all backups have synced (implemented with cond var).
- We serialize the state in master and implement marshal_state and unmarshal_state methods. This will be used on master and replica ends
- sync_with_backups in recovery thread in primary will wait for all replicas to sync up.
- sync_with_primary in recovery thread in backups will sync seperately to master.
- statetransferdone is called from backup to convey that sync is done
- transferdonereq is handler for above on primary
4) Cope with Primary Failures
- Clients call init_members() if primary returns NOTPRIMARY response. this way we reinitialize the view
- If there is no response from primary at all, then, we assume the next member from the view as primary and try until succeeds.
5) Complicated Failures
- Master fails after 1 slave gets the request and second slave is in process of updating state.
- Fail one of slave before responding
Conclusion
With this, we conclude this series of exercises. Happy Coding!
Best, srned