<$BlogRSDUrl$>

Sunday, December 21, 2003

Filesystem Brain Damage for FoxFS


Copy-on-Write File Size Reduction


So, I was thinking about filesystems, and compression, and fork() implimentations, and I suddenly come up with the idea of Copy-on-Write File Size Reduction (CWrFSzR, "CWeer Seizure"). The basic concept is that of copy-on-write: two independant pieces of data just happen to be identical, and thus exist in the same space. When one changes independant of the other, the other must remain the same. The solution is to copy the data before changing it, thus creating two copies.

The concept could only be efficiently applied to a filesystem with an active background scanner that ran a heuristic to identify identical segments of files and fragment the files around them, then combine the two fragments into one and free the other. CWr fragments would be marked for Copy-on-Write, thus encountering a Copy-on-Write fragment during a write to a file (delete, overwrite, truncate) would be doable without any driver enhancements (other than those that allow the driver to recognize and act on the data). Exact duplicates of files (those made with cp) would effectively (not literally) be hardlinks (a hard link is a directory entry that points at an inode) to the original data (not the original inode) until one of the copies was altered.

I'll mill over how to do this safely later. It'll be beneath compression, though, and act on the original data. I think I'll inherit the compression rules by compressing the segment if it would normally be compressed by any of the inodes it belongs to. The data for how many CWr links are pointing at it will need to be a part of the file data itself if the CWr acted directly on the data (fast way, but expensive worst-case scenario). If the CWr points at another inode, the best/worst case scenarios get pretty bad (it'd be like seeking to multiple places in many files, rather than in one), so I think I'll try to find an efficient way to work this in with the CWr data in the actual file data, and then write it to the specs.

What I may do is turn the filesystem into an object-oriented filesystem, wihch will give me so much more flexibility at the cost of a little overhead. Of course, at levels of 1T+, we're talking gigabytes of potential overhead (4 billion files, for each 1 byte overhead, 4 gigabytes of space is lost; and I'm going for >32 bit inodes natively), but I hope I can cancel it out.

Tuesday, December 16, 2003

File Translation Lookaside Buffer


Pseudocode


The flowchart from this morning has been expanded, and translates to the following pseudocode:


char* fs_seek(
pid_t process,
long fileHandle,
uint64_t spos) {
uint64_t fPageIdx;
uint64_t newPageIdx;
if (isReadOpen && spos > fileLen) {
sendEOF(process, fileHandle);
return NULL;
}
fPageIdx = spos / fPageSz; /*get what page we are seeking into*/
if (fPageIdx != currentPage) {
return (spos - (fPageIdx * fPageSz)) + ptrToPage (fileHandle, fPageIdx);
}
if (newPageIdx = findPgEntry(fileHandle, fPageIdx)) {
switchFpage(fileHandle, newPageIdx);
return (spos - (fPageIdx * fPageSz)) + ptrToPage(fileHandle, fPageIdx);
}
if (openForWrite && fPageIdx > fileLen / fPageSz) /*writing past EOF page*/
createBlankPage(fileHandle, fPageIdx);
else
getPageFromSvr(fileHandle, fPageIdx);
switchFpage(fileHandle, fPageIdx);
return (spos - (fPageIdx * fPageSz)) + ptrToPage(fileHandle, fPageIdx);
}

void switchFpage(
long fileHandle,
uint64_t fPageIdx) {
pid_t forkPid;
uint64_t oldpage = currentPage;
setCurrentPage(fileHandle, fPageIdx);
if (forkPid = fork()) {
if (pageAltered(fileHandle, oldPage)) {
flushPageToSvr(fileHandle, oldPage);
exit(0);
}
}
return;
}

void getPageFromSvr(
long fileHandle,
uint64_t fPageIdx) {
struct fpage fpageent;
fpageent = getPageRemote(Server, fileHandle, fPageIdx);
if (isFullFTLB(fileHandle)) {
killFirstUsedFTLB(fileHandle);
}
addToFTLB(fileHandle, fpageent);
}


You'll notice, the flowchart explains the logic of the buffering fairly well, while the pseudocode explains the logic of the implimentation (code) in more useful detail than the flowchart. It also *forces* me to take into account how I must structure functions. Anyway, point is, that's my pseudocode. I prefer pseudocode for programming, and flowcharts for processes. Why do I say this here? BECAUSE IN PROGRAMMING CLASS THEY FORCED US TO DIAGRAM OUR PROGRAMS WITH FLOWCHARTS. Think about it. A program of 10-15 lines, in 10-15 flowchart icons, sometimes more icons that lines of code (Program Start, Program End). Flowcharts should be used to figure out WHAT you're going to do, not HOW you're going to do it.

File Translation Lookaside Buffer


Preliminary Flowcharts


I'm working on the fTLB flowcharts to show how this is going to work. Keep in mind that this is a general logical flow, and that this will be implimented with a mesh of functions. Right now, I've got request for data (SEEK operation in read/write mode) done, and it is as below.


Monday, December 15, 2003

ANL NFS


I'm still debating over what to name my network filesystem. The two names it came down to were "Advanced Network As Local FS" and "Advanced Network Utilization and Security FS." So, ANALFS/ANLFS or ANUSFS. I'm still debating on whether I should go with ANAL/ANL or ANUS. Both describe my design goals -- An advanced network filesystem with full equivalence to the server side security and performance that can almost match a local hard drive with equivalent transfer rate to the network link. Thus, it's a toss-up.

Well, I'll be in #linux and #research_group on irc.freenode.net if anyone wants to comment. And yes, I know the docs talk about mounting anus right now.

--Bluefox Phoenix Lucid

Sunday, December 14, 2003

I got a blog


I now have a blog. Great. I will blog here for now. This Blogger thing is nice but it looks like crap. Fair level of complexity though, and it appears to work fairly well, so I have to give 'em credit.

Network Filesystems


First order of business. NFS. NFS is the living encarnation of brain damage. Let me show you something.

---FIRENET TOPOLOGY---
[FlameRouter]
|
|
[FlameSwitch]
| | | | | | | | | |-[InfernoSvr]
| | | | | | | | |-[FlameBox]
| | | | | | | |-[FlameChip]
| | | | | | |-[FlameSlab]
| | | | | |
| | | | |
| | | |
| | |
| |
|

Now, let's say the InfernoSvr shares /home over NFS. Users are firefox, flamedrake, and burner.

Watch this.

---FIRENET TOPOLOGY---
[FlameRouter]
|
|
[FlameSwitch]
| | | | | | | | | |-[InfernoSvr]
| | | | | | | | |-[FlameBox]
| | | | | | | |-[FlameChip]
| | | | | | |-[FlameSlab]
| | | | | |
| | | | |
| | | |
| | |
| |
|-[Icechip]

I've snuck Icechip, my laptop, into FIRENET. Now watch this. Let's say FlameSlab is a laptop, and we disconnect it.


---FIRENET TOPOLOGY---
[FlameRouter]
|
|
[FlameSwitch]
| | | | | | | | | |-[InfernoSvr]
| | | | | | | | |-[FlameBox]
| | | | | | | |-[FlameChip]
| | | | | | | [FlameSlab]
| | | | | |
| | | | |
| | | |
| | |
| |
|-[Icechip]+FlameSlab{ip,mac,hostname}

I've used ifconfig to change my ip and mac address. Harly went home with her laptop. Now, useradd a few users *taptaptap* local authentication *clickity click* and . . . bingo. `mount InfernoSvr:/home /mnt/home -t nfs`. Click. Bingo. Now, I can steal Harly's stuff *clickity click* `su firefox -c "cp -pR /mnt/home/firefox /home/firefox" && chown -R bluefox.users /home/firefox && mv /home/firefox /home/bluefox`. Bye guys. Try tracing that call.

Okay so I can trace where I was based on switch port, but what about with wireless? Thought so. Besides, there's drops EVERYWHERE, I can get in. Harly won't notice because HEY! GUESS WHAT! I didn't log into anything but my laptop as firefox, so the PAM immediate auditing data will look just as she expects.

NFS is useless in all production environments because it provides absolutely no security. I've heard complaints that NFS is slow as well, and I KNOW it doesn't go over secure socket layer.

I have a copy of the NFS specs. I plan to duplicate its functionality, but enhance its security and hopefully its speed. The authentication handshakes are already being planned and plotted, and impose minimal overhead. There will be client-side security evaluation acceleration, which costs some CPU overhead but saves network overhead in cases of denied access due to the lack of need to query the server; however, the server will process all requests through all server side security systems, even those not supported on the client and those already checked on the client, to ensure the exact same security benefits as local access to the share. The system's current name is Tailwave. As it stands, all actions are required to be performed over secure socket layer; the final revision will be more flexible, and allow server-evaluated security measures for any encryption pipe with simply the assurance of at least one common connection method. Possibly, it may even allow IPSEC to be used if it can detect that the connection is occurring over IPSEC.

As for performance, I'm working on it. General buffering techniques should suffice, but I'm thinking of immitating the TLB on i386 CPUs and CPU caching; i.e., instead of buffering X bytes ahead, have 4k or 8k "pages", and when you "fault" accross them, send down the whole block. The advantage here is that I could combine standard "Stay X bytes ahead" buffering with the fault concept, and simply make sure I always have the next page buffered on the client already. I know immitating a completely different system designed for a completely different application is in itself considered brain damaged, but the idea is nifty and may work.

Here's my thinking. On the CPU, you have a Translation Lookaside Buffer, which stores the virtual memory mappings for vm pages. When the executing code faults accross these pages, the CPU looks up the entry in the TLB where the new address exists. It then pulls the entire 4k page into CPU cache, and continues execution. When the entry isn't in the TLB, the OS is queried, the page_table entry is sent to the CPU, and if the TLB is full, the first used (i.e. the one that hasn't been accessed for the longest time) entry in the TLB is discarded.

I could do similar for Tailwave. On the client, you have a File Translation Lookaside Buffer for each file, which stores the pointer mappings for buffered pages of the file to the file's offsets, which we will call fpages. The server has a copy of this as well. When the fpage is altered on the server, the client is alerted immediately. The server then holds all writes to that fpage in buffer for a given time period, and then flushes the fpage to the client, UNLESS the client acceses the fpage, in which case the client will request an immediate sync. When the fpage is altered on the client, the same process occurs in reverse, with the client notifying the server that the fpage has changed, except that the fpage will be flushed to the server after either a given timeperiod of inactivity, or after an fpagefault (file pagefault) out of the page occurs.

Also, when an fpagefault occurs (a process seek()s to a different page by fread() or seek() or some other method), the fTLB on the client scans through to see if that fpage is buffered, and if so, feeds it to the process as the fpage, without retrieving it from the server. If the fpage is NOT buffered, it requests it from the server and makes an fTLB entry. If the fTLB fills up, then the first used fpage is flushed and freed, the server is notified that this page no longer needs to be synced to the client, and the entry is removed.

The added trick here is the File Translation Lookaside Buffer Lookahead Buffering (fTLBLB) scheme. When an fpage is faulted to, the client tries to assure that a given number of pages ahead of it are also entered in the fTLB. So, if the fpage at 16k (16k-20k) is faulted to, and the fTLBLB count is 2, then the fpages at 20k and 24k (20k-24k and 24k-28k) are faulted accross as well, into the fTLB, if they aren't already buffered. This causes the entire 16k-28k range to be buffered. Upon faulting into the 20k range, the 28k fpage (28k-32k) will be faulted in. The purpose of this is so that if the network lags during something such as dd, then it should be able to catch up relatively quickly (THEY'RE ONLY 4k!).

fTLB could possibly support my goal of pushing all data over a single network pipe (one socket connection), including all authentication, control, and file data. I do not intend to hold to the single network pipe goal if it becomes a performance issue.

The work on this Networking Filesystem will continue at http://foxfs.sf.net/anlfs/anlfs.txt as a subproject of FoxFS.

--Bluefox Phoenix Lucid

This page is powered by Blogger. Isn't yours?