PingFS

Source
Behram Mistree (bmistree@gmail.com)
1 Intro/Motivation
A few years ago, Keith Winstein told me about a cool idea of his. According to RFC 792, which describes the ping protocol, when you send an echo request (ping) to another host with a data payload, that host must send the same data payload back. From this, he reasoned, you could make an entire filesystem that used pings to store and retrieve data.
This post describes PingFS, a filesystem whose storage media are literally ping’s echo request and echo response packets. PingFS stores no data on disks or solid state drives and only keeps a few bytes of filesystem metadata in memory. Even with these restrictions, PingFS has many of the features of a modern filesystem, e.g., hierachical directories, file-level permissions, and large file storage. In addition, because of its design, it also supports unusual higher-level features such as version control/checkpointing and being able to run in distributed mode.
I come from an academic background. And a good bit of this post ended up structured similarly to an academic paper. In an academic paper, this is usually the place that someone would say how the technology they are introducing is faster, more secure, or better because of some new and brilliant feature.
PingFS is none of these things. Designing a filesystem on top of echo request and response packets has several severe limitations — most importantly, performance and durability on network disconnect. We describe these limitations as well as countermeasures to ameliorate their affects. However, even with these countermeasures, from a purely-practical standpoint PingFS is uniformly worse than many decades-old technologies.
2 PingFS structure
PingFS is composed of three primary components depicted in Figure 1.
- The block manager is responsible for putting data blocks on the wire and pulling them off the wire. It provides a simple interface, allowing the block mapper to register new blocks and retrieve or deregister already-registered blocks.
- The block mapper queries the block manager for data blocks. Using the metadata contained in these blocks, it assembles a virtual tree-like data structure representing the filesystem. This structure and the operations permitted on it are further described in Section 4.
- Finally, the FUSE wrapper receives calls from the operating system to perform operations on the filesystem. It translates these calls into operations on the tree-structure that the block mapper maintains. We call this component the FUSE wrapper because it is built on top the filesystem in userspace Linux interfce, which allows us to simply add a filesystem without directly editing kernel code.

3 Great. I get you you structured your code, but what did you say was the storage medium again???
RFC 792, which describes the ping protocol, requires that
"The data [payload] received in the echo request message must be returned in the echo reply message."
PingFS uses this guarantee to store filesystem data and metadata in echo request and response packets actively traversing the network, roughly mapping a single filesystem block to a single packet. Except for metadata for the filesystem’s root block, no filesystem data or metadata are stored on disk or persistently held in memory. The only time that these data reside in memory are when PingFS generates and reads echo requests and responses, respectively or when collecting data to respond to a specific user request (e.g., when a user opens a text editor and reads a file, PingFS loads the contents of the file from received echo responses before returning it to the text editor program).
Packets as storage media present two obvious problems:
- Ephemerality A request/response packet pair only exist for an RTT.
- Performance After “writing” a block — sending a ping request — that block cannot be read for a full RTT.
PingFS’ block manager addresses the ephemerality of packets simply by retransmitting received packets. As shown Figure 2, when the block manager service receives an echo reply, after performing basic integrity and liveness checks on that packet’s data, it generates a new echo request containing the same data and retransmits it. In this way, assuming that no packets are dropped or corrupted, PingFS never loses data.
Figure 2: PingFS’s block manager stores filesystem block data in echo request data payloads and retransmits requests and responses to ensure data are never lost.
Section 5 discusses extensions to handle drops and data corruption and Section 4 discusses liveness checks (e.g., not retransmitting data corresponding to a file that was deleted by a user).
Section 6 discusses PingFS performance and optimizations. However, even with these optimizations, PingFS greatly underperforms decades-old technologies.
4 How do we turn an unordered stream of blocks into a filesystem?
As it services echo replies, the block manager produces an unordered stream of blocks. The block mapper consumes this stream. It is the job of the block mapper to:
- Organize this stream into a data structure that supports typical filesystem read operations (e.g., read a file, list the contents of a directory)
- Register new blocks with the block manager in response to file creations and modifications; these newly-registered blocks will later repeatedly appear in the stream produced by the block manager
- Deregister blocks with the block manager that are no longer in use (e.g., those blocks associated with a deleted file).
The block mapper assigns each block a unique id before registering it with the block manager. Inspired by ZFS, conceptually, the block mapper uses these ids to map blocks into a tree strucutre. The block mapper walks and modifies this structure to perform filesystem operations and create new blocks and garbage collect old ones.
Importantly, the block mapper never keeps all blocks in the entire file system in memory, and instead relies on the block manager’s stream to eventually provide a block required for a file system operation.
4.1 Tree data structure
The block mapper distinguishes three types of blocks to build its tree data structure:
- Head block: A head block signifies the beginning of a file or directory. It contains the associated file’s/directory’s name1 relative to its parent directory, as well as other file/directory metadata (permissions, size, etc.). Finally, it contains the ids of children blocks. In the case that a head block represents a directory, these children blocks may be other head blocks (representing sub files and directories) or link blocks (described below). In the case that a head block represents a file, these children blocks may be either link blocks of contents blocks.
- Contents block: A contents block contains the actual bytes users write to files. A contents block can only ever be a leaf of the tree, and therefore has no children. A single file may be comprised of several contents blocks if it is so large that it would exceed the size of a single block (~1000B to ensure that a block does not exceed a standard maximum segment size for a TCP packet). To read a file composed of multiple contents block, find all leaf ancestors of the file’s associated head block and read them in left-to-right order.
- Link block: A link block contains pointers to additional children blocks. It contains no other data and can neither be the root or leaf of a tree. Link blocks remove file system scaling restrictions by effectively serving as child block multipliers. For example, assuming block ids of 32b and a maximum block size of 1Kb, without link blocks a file system user could only have 31 (1000 / 32) files in a directory: there would not be room within a head block to store the ids of any other children. Similarly, assuming the same block sizes as above, without link blocks, each file head block could at most point to 31 1000b contents blocks and file system users would be unable to store files larger than 3,100b.
Figure 3 shows the overall tree that the block mapper maintains for an example layout. Importantly, this is a conceptual representation; at no point does the block mapper store all blocks in memory at once.

4.2 Example tree operations
This section describes how to read, delete, and modify files using PingFS’s tree data structure for the filesystem logically-depicted in Figure 3. Again, these are conceptual and at no point does the block mapper store all blocks in memory at once.
4.2.1 Reading a file
To read the contents of /etc/tmp.txt, PingFS starts at the root head block (block 1) and reads both its children (blocks 5 and 7). Because block 7’s name is “home,” it does not continue reading block 7’s children. PingFS keeps reading block 5’s children, to get the head block of the file (block 8). After this, it reads all the contents blocks under block 8 (just block 12) to get the contents of the full file and returns it.
4.2.2 Deleting a file
If PingFS blocks were only stored in memory, file deletion would be a simple operation. The head block associated with the file’s parent directory would modify its children field, removing the child node associated with the file to be deleted.
This approach does not work for a filesystem whose blocks are network packets that can be reordered, delayed, dropped, or duplicated. As an example, consider Figure 4A, in which a single file, f.txt, is represented by two filesystem blocks (1 and 2), which PingFS issues echo requests for. If a user updates f.txt (Figure 4B), and PingFS mutates block 2, updating its data, PingFS issues a new echo request for a block with id 2. However, if the network duplicated an echo response associated with the previous block 2, then PingFS will receive two echo responses with different data and the same block id. Without any way to distinguish these two responses, PingFS is in an inconsistent state: it continually recyclces blocks for both echo responses. Reads to the same file without any intervening writes could produce different results, depending on whether PingFS reads the echo response associated with the original block or the modified block.

PingFS prevents this inconsistency by making all blocks immutable. Figure 5 illustrates an example. When deleting a file, PingFS creates a new block (a block with a never-before issued id) and copies the contents of the original parent directory’s block into the new block, with one change: it updates the children_id field of the new block to exclude the file that was deleted.
It then performs the same update for all ancestor blocks up until the root block. Instead of modifying any existing blocks, this approach instead creates a copy of an entire branch of PingFS’s file system tree.
Figure 5: Steps to delete /etc/tmp.txt in an example PingFS filesystem.
It’s important to note that with this approach, PingFS never needs to copy any file data. However, compared to the first approach, this approach is more inefficient. Instead of modifying a single block in place, PingFS’s approach requires creating multiple new blocks for a single delete operation. Similarly, because any modification/deletion ultimately affects the root block, PingFS supports only serial modifications/deletes.
Section 7 discusses some of the benefits that immutability allows. (Versioned file system; performance improvements by sending out duplicates.)
4.2.3 Modifying a file
Figure 6 shows how PingFS supports file modifications while maintaining immutability. Similarly to deleting a file, PingFS creates new blocks with the updated data and then updates all ancestor blocks.
Figure 6: Steps to modify /etc/tmp.txt.
4.3 Example tree operations from stream
The previous section showed example filesystem operations on a fully-formed tree data structure. However, this tree is a simplification for exposition. At no point does PingFS actually store an entire tree in memory at once. Instead, it relies on a stream of blocks from the block manager to perform operations.
Figure 7 shows a more-complete example of deleting a file from a filesystem.
Figure 7: Example of deleting file /etc/tmp.txt for the filesystem conceptually shown in Figure 3.
5 Drops and corruption
The algorithms described in Section 4, fail spectacularly in the face of drops and corruption. For instance, if an intermediate router drops an echo response associated with the head block of a directory, no subdirectory or file within that directory will ever be readable again by the user.
Most of the experiments presented in Section 6 bounce echo requests off of loopback, which did not drop or corrupt packets. However, in testing, issuing repeated echo requests between two hosts even on a wired LAN results in a dropped packet within a few seconds.
To address these issues, we could take one of two approaches:
- Mitigate the effects of a drop by replicating data (e.g., issue two pings for each block or use fountain coding to encode blocks across packets)
- Try to prevent the the packet drop in the first place
In practice, issuing duplicate pings resulted in lost blocks within ~ten seconds of starting PingFS on the wired LAN described above.
We assume that these drops were due to a queue on the intermediate switch’s filling and dropping subsequently sent echo requests or responses. From this, we reason that if we can rate limit PingFS to only issue echo requests so that they never saturate the intermediate switch’s queue, we will no longer get any drops.
After adding this rate-limiting logic to the block manager, we did not notice any drops on PingFS running either on a wired LAN or when bouncing off of remote hosts (google.com). The current rate-limiter is static: the user must specify the maximum rate at which to send echo requests. Presumably, a future implementation of PingFS’s rate limiter could automatically and dynamically determine stable rates to issue pings while avoiding drops.
Rate-limiting obviously impacts filesystem bandwidth. Additionally, if the user selects an incorrect rate limit or a packet is dropped for other reasons, PingFS fails non-gracefully.
6 Evaluation
The performance of PingFS is limited by three factors:
- The round-trip time of packets used to store filesystem blocks
- CPU processing time to create, validate, and re-issue packets
- Filesystem structure and contention
We briefly analyze performance characteristics of the PingFS implementation and then highlight each of these factors separately through microbenchmarks. Finally, we conclude our evaluation with a few more-realistic benchmarks.
Unless otherwise noted, the experiments described in the following section were run on a i7-4770 CPU @ 3.90GHz running Ubuntu 16.04, kernel 4.4.0-104-generic, and all pings were sent to the loopback interface without any of the rate limiting mentioned in Section 5 enabled.
6.1 Basic PingFS performance analysis
Mean | Median | Std Dev | 5 %-ile | 95 %-ile | n |
---|---|---|---|---|---|
0.0257ms | 0.0280ms | 0.00523ms | 0.0100ms | 0.0300ms | 995 |
Table 1: RTT statistics for direct pings to loopback, as reported by call to ping utility.
Opening and reading the contents of a small file in the root directory should require three, sequential block reads:
- An initial read to the file’s head block in response to a stat call; this returns the file’s permissions and other metadata
- A subsequent read to the file’s head block to respond to a read call; PingFS uses the head block to identify the file’s content blocks to load
- A final read to the file’s content block. Because the file is small, no intervening link blocks need to be read to service the read request.
As shown in Table 1, a ping to loopback takes ~.028ms on our test setup. Assuming negligible CPU overhead, a lower bound for the time to read a small file should therefore be ~.084ms.
However, as shown in the first row of Table 2, we actually see a median read latency of .265ms for small files, three times greater than our lower bound.
Added delay in each direction | Mean | Median | Std Dev | 5 %-ile | 95 %-ile | n |
---|---|---|---|---|---|---|
0 ms | 0.260 ms | 0.265 ms | 0.0474 ms | 0.204 ms | 0.319 ms | 1000 |
1 ms | 4.99 ms | 4.23 ms | 1.04 ms | 4.11 ms | 6.39 ms | 1000 |
2 ms | 10.1 ms | 8.54 ms | 2.11 ms | 8.13 ms | 12.8 ms | 1000 |
4 ms | 20.3 ms | 16.6 ms | 4.28 ms | 16.2 ms | 24.9 ms | 1000 |
Table 2: Time running with artificial loopback RTT delay. Note that added delay impacts both echo requests and echo responses, i.e., an added delay of 1ms should increase the RTT by 2ms (delaying a request by 1ms and a response by an additional 1ms).
There are three potential causes for this increase:
- Additional overhead of Fuse
- Cost of additional file system operations (serializing/deserializing blocks, etc.)
- Poor implementation/library overhead compared to raw ping command.
To distinguish these, we replace the block manager with an in-memory version. The in-memory version of PingFS retains all Fuse calls and additional file system operations, but removes the potential costs of a poor implementation/library overhead of block manager.
In-memory PingFS runs the same repeated 100B file read with a median latency of 0.0510ms, which, even disregarding the necessary loopback RTTs (0.084ms) is substantially faster than 0.260ms median latency shown in the first row of Table 2, suggesting that a future implementation may be tuned for better performance.
However, as discussed in the next section, potential performance benefits from block manager improvements are likely negligible when compared to latency arising from echo request/response RTTs.
6.2 Increased RTT
Single operations can require reading multiple blocks, with delays up to almost a full RTT for each one.
As a result, PingFS performance is very sensitive to RTT. Table 2 demonstrates this dependency, following the same small file read experiment described above, but artificially imposing a delay on loopback traffic using Linux’s tc command.
As expected, for each doubling of artifical delay in rows 2-4 of Table 2, we see a corresponding doubling in median latency.
Additionally, we similarly expect the latency changes from no artificial delay (Row 1) to 1ms of artificial delay (Row 2): remember from Section 6.1 that each run of this experiment requires three block reads. We expect to wait anywhere from 0-2ms for the first head block read. Because our next read is also for the head block and follows almost instantly after the previous, we expect to wait 2ms (a full RTT) for the read head block to be recirculated to block manager. Finally, we’d expect to wait anywhere from 0-2ms to read the content block, finishing a run of the experiment.
In contrast, with no artificial delays these times are almost-instantaneous. I.e., on average, we’d expect each run to take approximately 4ms more with an artificial delay of 1ms than with no artificial delay. We see this pattern when comparing the medians of each run (.265ms vs 4.23ms). However, we do not see this exact separation when comparing averages in each row and imagine the reason we do not is due to running our experiments sequentially rather than truly independently.
The random wait for initial head block and content block likely also explains the increase in standard deviation we see across rows as RTT grows. (Remember this for Section Optimization: Duplicate pings!)
6.3 Effects of CPU
PingFS is not only IO bound, but also CPU bound: PingFS uses the CPU to generate and read echo requests and responses; serialize and deserialize filesystem blocks; and generally respond to filesystem requests through the Fuse API.
To demonstrate this dependence, we run an experiment, pinning a PingFS process to a single core and scaling that core’s frequency. Like in the previous experiment, we issue pings to loopback and measure the time to repeatedly read a small file in the root directory.
Table 3 presents the results of this experiment. PingFS performance scales non-linearly with CPU frequency. In particular, we see two distinct regimes. In the first (800MHz-2000MHz), increased clock frequencies lead to strong latency improvements (e.g., between 800MHz and 1200MHz, mean latency decreases by 1.10ms). In the second, latency decreases more slowly with increases in CPU frequency (e.g., latency only decreases by .067ms between 3200MHz and 3600MHz). Presumably, in this regime RTT begins to dominate overall processing time, reducing the effects of CPU improvements.
MHz | Mean | Median | Std Dev | 5 %-ile | 95 %-ile | n |
---|---|---|---|---|---|---|
800MHz | 2.85ms | 1.46ms | 4.07ms | 1.15ms | 9.33ms | 1000 |
1200MHz | 1.75ms | 0.947ms | 2.84ms | 0.764ms | 7.26ms | 1000 |
1600MHz | 1.30ms | 0.710ms | 2.17ms | 0.578ms | 5.86ms | 1000 |
2000MHz | 0.970ms | 0.563ms | 2.05ms | 0.456ms | 4.15ms | 1000 |
2400MHz | 0.885ms | 0.473ms | 2.22ms | 0.382ms | 4.01ms | 1000 |
2800MHz | 0.688ms | 0.402ms | 1.66ms | 0.325ms | 0.446ms | 1000 |
3200MHz | 0.634ms | 0.358ms | 1.84ms | 0.337ms | 0.392ms | 1000 |
3600MHz | 0.567ms | 0.313ms | 1.77ms | 0.299ms | 0.340ms | 1000 |
Table 3: Effect of running PingFS at different CPU frequencies. The PingFS process gets pinned to a single CPU, and we set the allowed frequencies for that CPU.
6.4 Filesystem structure and contention
PingFS uses immutable blocks. Because of this, as shown in Figure 6, updating even a single byte in a single file can require a cascade of block updates from the block storing that file’s contents all the way to the root directory’s head block. Similarly, because of PingFS’ tree structure any read of a file requires sequentially walking all blocks from the root’s head block to that file’s content blocks.
This means that the performance of an operation on a file in PingFS depends not just on the operation, but also the file’s location in the filesystem: if it is close to the root directory, operations on it will proceed faster than if it is farther away.
To demonstrate this effect, we run an experiment in which we read a small file located in deeper and deeper subdirectories of a PingFS filesystem issuing requests to localhost with a random packet delay between 3-5ms per packet.
We introduce this delay for two reasons. First, each additional subdirectory requires sending and servicing additional pings. Adding a packet delay ensures that CPU utilization stays relatively low even as we add more directories. This separates the performance impact of file location (the effect we’re trying to measure) from the additional CPU work necessary to service additional directories. Second, randomizing delay reduces the effects of packet synchronization2.
Nesting depth | Mean | Median | Std Dev | 5 %-ile | 95 %-ile | n |
---|---|---|---|---|---|---|
0 | 17.2ms | 16.6ms | 2.51ms | 14.7ms | 24.3ms | 1000 |
1 | 21.7ms | 23.0ms | 5.40ms | 15.2ms | 31.9ms | 1000 |
2 | 29.1ms | 30.0ms | 7.98ms | 15.8ms | 41.8ms | 1000 |
3 | 38.5ms | 39.0ms | 9.65ms | 24.2ms | 54.9ms | 1000 |
4 | 47.5ms | 47.0ms | 11.6ms | 32.0ms | 67.0ms | 1000 |
5 | 58.3ms | 56.6ms | 14.0ms | 39.7ms | 83.1ms | 1000 |
6 | 69.0ms | 65.7ms | 17.1ms | 47.3ms | 98.8ms | 1000 |
7 | 81.2ms | 76.6ms | 21.4ms | 55.4ms | 121ms | 1000 |
8 | 94.6ms | 87.7ms | 29.5ms | 63.3ms | 146ms | 1000 |
9 | 109ms | 98.1ms | 36.4ms | 71.9ms | 179ms | 1000 |
Table 4: Effect of location on time to read contents of a file. Added RTT of 4ms +/- 1ms.
Table 4 presents the results of this experiment. As expected, latency increases approximately linearly with nesting.
7 Interesting extensions
PingFS is a filesystem whose immutable blocks are stored in the payloads of echo request and response packets traversing the network. This design allows for a series of interesting extensions that this section will explain in four sections:
- Checkpointed filesystem
- Distributed filesystem
- Unimplemented, half-baked additional thoughts
7.1 Checkpointed filesystem
As described in Section 2, PingFS blocks are immutable. When an operation updates the filesystem, PingFS creates new blocks and updates the pointer to its root block to point at the new structure. It then directs its block manager to stop recirculating echo requests and responses for blocks that are no longer reachable from the root block.
Assume however, instead that the block manager continues recirculating blocks and PingFS saves the block id of the old root block. In this case, no data are lost: both the contents of files before the update operation are preserved as well as the structure of the filesystem. By replacing the post-update root block id with the pre-update root block id, we can restore the filesystem to the point before the edit.
We have added this checkpoint/restore functionality in a crude way to PingFS: if a user creates a file prefixed with the name “__checkpoint_”, PingFS will not remove any additional blocks and saves the current filesystem’s root block id. If the user subsequently deletes the file, PingFS reverts the filesystem back to its state when the initial file was created.
There are two things worth highlighting in PingFS’s checkpointing:
- More than anything else, checkpointing leverages the immutability of blocks. If PingFS instead used mutable blocks, it is unclear how we would have ensured the consistency of checkpointed versions in the face of multiple file operations. Underscoring this point, given the basic PingFS implementation that used immutable blocks, it literally took ~150 lines of additional code to add checkpointing.
- Preserving references to immutable blocks no longer in use and using preventing their garbage collection is not some new algorithm that we discovered. This approach to checkpointing is very similar to how ZFS “snapshots” work.
7.2 Distributed filesystem
A simple extension to PingFS allows it to run as a distributed filesystem where multiple machines mount a common filesystem, performing operations on shared files and directories. When building this distributed filesystem extension we had to answer two questions:
- How do all n machines that mount the distributed filesystem access blocks?
- What form of consistency does the filesystem provide and how does it provide it?
We deal with each separately in describing the distributed PingFS extension.
7.2.1 How do all n machines access blocks?
RFC 792 requires machines that receive echo requests to craft an echo response destined to the IP address in the request’s IP source field. This requirement creates a simple way to extend PingFS so that a single PingFS filesystem can be concurrently mounted on n, distributed machines.
In single-node PingFS, when a new filesystem block is created, the block manager sends a single echo request. However, in distributed-mode, we change this logic so that when a new filesystem block is created the BlockMangerP system sends n echo requests. The first is sourced from the IP address of the sending machine. The other n-1 are sent with spoofed IP addresses corresponding to the n-1 other machines the PingFS filesystem is mounted on.
This way, we can guarantee that the blocks created on any one of the n machines will always be sent to every other machine (assuming no corruption or drops). These machines can then recirculate the blocks using their standard logic.
7.2.2 Consistency
Distributed PingFS uses the same block immutability and update mechanics described in Section 4, with one key difference.
Single-node PingFS only updates the root block pointer as a result of a local filesystem operation. In contrast, distributed PingFS may need to update its root block pointer in response to a filesystem change initiated on a remote node.
The mechanism that we choose for this update is important: designed poorly, filesystems on two nodes could be forever out of sync; if one filesystem always showed different content than the other it would defeat the purpose of having a distributed filesystem in the first place. Similarly, if we choose an algorithm naively e.g., only accepting root block changes from one node, one system’s operations could unfairly always clobber all others’.
Our solution is simple. Each machine maintains a monotonically-increasing counter. A machine increments this counter by one and assigns its value to the high-order bits of the id any new block it creates. If the machine receives a block from another machine with an id whose high-order bits are greater than its local counter, it updates the counter to the newer value. To ensure uniqe block ids, we assign each machine attached to a distributed PingFS filesystem a unique, integral id. The lower order bits of any block created on a machine is prefixed with this id.
When a machine receives an echo response containing a root block, it examines response’s block id. If response’s block id is greater than the id of the block that the machine’s current root block id, the machine updates its root block id to match the new root block.
Because blocks are totally-ordered, this ensures eventual consistency: all machines will eventually update to the same root block (assuming no drops/corruption). Because each machine updates its id-generating counter when it sees larger block ids, assuming identical network conditions and workloads, during contention, all nodes have a practically-equal3 chance of their root-block updates’ winning. Finally, because ids correlate with freshness, this algorithm provides the desirable property that more recent filesystem changes tend to be preserved.
7.2.3 One more thing
Block immutability and the update algorithm described in Section 4, ensure that any operation that mutates a PingFS filesystem is only truly made visible to other machines after it has fully-completed. Partial states are never visible to other machines (atomicity), guaranteeing integrity of the filesystem; even if a node crashes in the midst of a filesystem change (e.g., deleting a directory, adding a file), all other nodes still are able to continue operating on a well-formed version of the filesystem.
7.2.4 Unresolved questions
The above implementation ignored the question of how to deregister blocks no longer deleted. Probably the most reasonable approach for solving this problem is for a machine to send control messages to other machines (potentially via pings) that the other machines should delete a given block. Following the leading machine woudl additionally send messages to delete the children of that block, its children, etc.
7.3 Unimplemented, half-baked additional thoughts
PingFS was a fun side project. It gave me a chance to play with some cool code and APIs. There are some neat extensions that I unfortunately did not and likely will never get to. This section discusses a few of them.
7.3.1 Optimization: duplicate pings
Table 2 shows that increases in RTT tend to increase latency for filesystem operations. This is because PingFS issues a single echo request per block. As a result, a block cannot be read for a full RTT after it is created or recirculated. However, there is no fundamental reason that PingFS should only generate one echo request per block.
If PingFS generates two echo requests per block, one when the block is initially created and one at half an RTT, instead of requiring a full RTT to access each block, PingFS requires only half an RTT, thereby speeding filesystem operations. Practically-speaking, a basic approach for issuing multiple requests per block requires few algorithmic changes: when a block is initially created, multi-request PingFS issues an echo request, waits a user-specified amount of time, and then issues an additional, identical echo request. All other logic is the same. Because blocks are immutable, multi-request PingFS does not require a complex algorithm to version or update existing blocks.
7.3.2 Partial-caching
All PingFS data other than the root block are stored in echo request and response packets. However, as alluded to in Section BASIC PINGFS Performance Analysis, the core data structures of PingFS work if we store all blocks in memory instead of in packets. Reading a small file from the root directory of an in-memory version of PingFS has a median latency of 0.0510ms compared to 0.260ms for reading the same small file from a PingFS filesystem that issues echo requests to localhost. As either the file gets larger or more deeply nested or if PingFS issues requests to a more-distant host, we expect the performance discrepancy between in-memory and echo request/response versions of PingFS to grow.
Taking an alternate, hybrid approach where some blocks are stored in packets in-transit and some are stored in-memory could improve PingFS performance while preserving some of the curiosity of an ping-based filesystem. I did not implement this optimization, and therefore do not know its precise performance contours. The exact benefit likely depends heavily on the strategy for deciding which blocks to store in-memory versus which in packets: if we only store infrequently-accessed data in-memory, while frequently-accessed data must be read from packets, we would expect no benefit to caching.
7.3.3 Multiple servers
The previous section discussed caching frequently-used blocks in-memory. We could extend this caching concept beyond a single box to think about a “network cache.” If we know that a block is frequently needed, we can issue an echo request to a “nearby” host in the network. The corresponding echo response returns sooner than that of more distant hosts and the block can be accessed quickly. In contrast, blocks that are likely to be infrequently accessed can be sent to distant hosts. This approach would improve CPU efficiency: because infrequently-used blocks have a longer round-trip, they are more rarely processed on the local CPU, leaving additional cycles for more-frequently used blocks.
- Tying filenames directly to blocks was, in retrospect, a bad decision. It ended up meaning that I couldn’t symlink files and do a few other things that I might have liked to.
- In retrospect, I realize that I could have alos probably just added unrelated files to balance the additional work out. That would have probably been a better comparison; too bad I didn’t think of it earlier!
- I realize that using fixed lower-bits to break ties does not guarantee tru fairness. However, this unfairness will only be used as a tiebreaker. If it made any practical difference, we could introduce a consensus algorithm between nodes that periodically updated each node’s lower bits, giving each a period where it could win ties.