<$BlogRSDUrl$>

Saturday, January 31, 2004

OOFS


Filesystems Revisited



I still haven't reworked my disk based FS into an object oriented FS. Object Relational Location Filesystem (ORLFS) will work as follows though:

Header Area
|--Superblock
|----Address of Object 0
|------Block containing Object 0
|------Byte Offset in block
|----Object for Beginning of Inode List
|--Journal List
|----Objects that function as Concatonated Journals, in order
Data Area
|--Objects

The idea is to allow objects to freely move around. Objects are referenced by their object id, which is used to locate the block and byte offset in the block at which the object resides. The object itself has its object ID embedded in its header, as well as the object ID of the previous and next physical objects. The access methods required to use this costs disk reads and seeks, but it provides a large amount of flexibility. Filesystems can easily be fscked, resized, defragmented, cyphered, encrypted, and compressed.

The entire disk is used by objects. Freespace is free-marked objects. Directories are directory-marked objects. Files are file-marked objects. Objects are split and redone as necessary, usually this would be free space objects losing pieces and/or becoming multiple objects. Objects do not have to be in order on the disk.

Here is an example:

{0000: Object list}{001A: Free space.................................}{0CF4: / inode directory entry}{0001: /boot/bzImage-2.6.1..................................}{002A: journal chunk............................}{19F7: Free space.............................................................................}


Now we can add a small file:

{0000: Object list}{001A: Free space}{0024: /a.out}{0025: free space}{0CF4: / directory{0001: /boot/bzImage-2.6.1..................................}{002A: journal chunk............................}{19F7: Free space.............................................................................}

Next, we can defrag:

{0000: Object list}{0024: /a.out}{0CF4: / directory{0001: /boot/bzImage-2.6.1..................................}{002A: journal chunk............................}{19F7: Free space.......................................................................................................................................}

A few bytes in the object link change in each of these operations, as well as forward/backward pointers to the next/previous objects. Of course, to move a journal chunk, you need to start a second journal chunk, flush everything from the first, and then free the first.

This will make on-the-fly shrinkage/growth and defragmentation easy, and will leave objects fairly managable. Going to any object is a matter of jumping to the Object List Index of that object (1 seek/read for a good driver); and reading that object off the disk (1 more seek/read for a good driver). This means O(2) to read any object based on knowing its object ID; and O(2d) to read a directory entry, assuming none of the directroy objects along the way are fragmented. So, /usr/src/linux/Makefile (first fragment only) would be read off disk after 10 seek/read pairs (/, /usr, /usr/src, /usr/src/linux, /usr/src/linux/Makefile; 2 for each) 190-210 mS for a 5400 RPM drive (19-21 mS seek time).

Sound plausable?

Thursday, January 15, 2004

P2P


Make it better


Here's an idea. I'm going to try to draw out a sort of peer to peer backbone network protocol for generic use, something that can perform well and self-exist, with no master server, nat transparent, and with swarming (like bittorrent). The idea is to get the highest bw/lowest ping near the top of the mesh-tree, while not loading all of the load on one node. The major concept points are outlined below.

- Bandwidth devoted is in terms of connections. Each connection is assumed to take up roughly 1/2 kb/s, directly. This 1/2 is seen as 1/4 coming up and 1/4 passing through. The bandwidth of a node is only what they can devote, and is only checked at eval time. If a nodes bandwidth changes, this is reflected at eval time. A node should NOT devote more than 50% of its real bw to the network as connections.
- Data passing through a node is for location of wanted content only. Major passings of data should not be pushed through the tree, or else the design collapses.
- Low ping high bw nodes go closer to the root. The root node should have a low ping, but should also have the highest bandwidth possible. Generally, whatever node with a ping lower than 100 to its nearest child and with the highest bandwidth should be the root node. Note that if this appears near the bottom, it will take a while to get moved to the top; but all networks start with two nodes, and so the shifting around should remain balanced.
- Nodes are inserted into place by bandwidth. When a node is first inserted, it does not effectively appear in the network until it is properly placed. First, it exchanges ping timings with its nearest bandwidth matches. These ping timings are used to instant eval where the node should move. If it has too low of a ping (say, it has 50 connection bw, but a ping of 780mS), it is moved down a level, and exchanges ping timings with the lower bandwidth nodes.
- Bandwidth can be retracted on the fly, but it must be done in a gentle fashion for the design to work. When a node wishes to retract bandwidth, it lowers the amoun of bandwidth it is supplying, but continues to allow connections through. It must wait for the eval packet to reach it for its bw restrictions to be activated.
- Client bandwidth can also be retracted. Clients using the node for nat traversal reflection can be notified that they should find another node. Their bandwidth can be constricted, but should not be dropped. Warnings can be issued if the connection will be severed, but the warning must have a duration of 2 minutes before severence occurs.
- Clients behind nats can signal eachother via the network, and then choose a node that is available as a reflection node via the network. Nodes must mark themselves as reflection nodes for this, and the reflection uses part of their non-node-connection bw.

The trick here is that a regular tree won't do it:

o
/ \
o o
/|\ |\
o o o o o


Notice that the massive load of about half the network goes onto the root node (at the top; this is an upside down tree). What I want to do is more of a concentric ring tree, or "Cast Tree." A "CAST TREE" is instigated like a cast system; the lower nodes all form a gorup, then the upper ones form their own group, and so on.

|--------o-------|
| | |
| |-----o----| |
| | | | |
o | o | o
| | | | |
| |-----o----| |
| | |
o--------o-------|

We call the center cast 0, and each cast going outward goes up one. So, there are 3 casts here: 0 (1 member), 1 (2 members), and 2 (5 members).

The highest in each cast talks to the second highest in the next cast, and so on down. An attempt to maintain as many connections as posible to the next cast (ring) is put into place. Communication inside one's own cast is done by simple relay (it's a gigantic ring). A packet need only go up to the next cast one time, and then is marked as relayed. When a packet reaches a cast, it splits in both directions. When it is sent up (or down) to the next cast, the copy at the node uploading is marked. If the other one is sent upward/downward, it too is marked. One of two things happen with this then: either the node in that cast has not received the packet, and splits and continues the process; or the node has gotten it, and kills it. Packets are all ID'd, so it is actually advantageous for the packet to hit the upper cast in two points and thus cover it in twice the speed (assuming they are at opposite ends of the cast, and the cast is perfectly balanced).

A packet is marked with direction it must go as well. If it came from down, it is marked to go up; if it came from up, it is marked to go down; if it came from the current cast, it is marked to go both ways. It is unmarked after delivery in said direction. So, if the lower node in cast 1 of the above diagram originated a packet, it would go up to cast 0, down to cast 2, and be marked as not needing to go up or down. Then it would go directly to the upper node in cast 1, and stop circulating cast 1. If a packet originates at cast 2 in the bottom left node, then it will be marked for up and down, go right and up, and circulate around until it hits the nodes that both have a copy and is killed.

All packets need a time to live (about 5 minutes should suffice), a creation time, and an ID. The time to live is how long the nodes will hold record of relaying a packet once they get it, so that they can kill any other copies coming to them. The TTL and Creation Time both designate at what point the packet will be killed regardless. The packet ID is generated using the originating node's MAC address and the date and time that the packet is sent, to be universally unique.

This kind of design, once refined down, should create a network scalable enough to handle almost anything. Heirarchial casting will be my design of choice, however; casts can be made of subcasts so that they scale better. This will be done by only allowing a cast ring to get so large, and then stacking the newer nodes on top of the old ring (not at a higher and lower level, but extruding "out" of the screen in the above diagram), thus creating a second ring at the same cast level, which follows the same interface rules with its above and below subcasts as the entire cast follows with the lower and higher casts. The entire cast itself would decide which cast members attatch to the higher and lower cast; but the higher and lower casts will see these as single casts.

What do you think? Refinable basic idea? It should provide high scalability even with many slow nodes on the outer casts (i.e. dial-up), because those casts will be huge, but will not be one gargantuant ring. Passing around a ring of 10 stacked 10 high takes less time than passing around a ring of 100 (9 relays vertical, 4-5 horizontal per ring, so after 14 relays, all 100 would have the data; as opposed to 50 relays in a single cast ring).

Friday, January 02, 2004

Security and ACLs


Just a random thought


Well, I was thinking about LSM and RSBAC and grsecurity ACLs and such, and I came up with a question. How exactly does it work? So I sat down for 10 minutes with Dia and came up with how I would do a modular ACL system if I had come up with it. Here's my brain-dump.



Keep in mind that this isn't meant to be based on anything out there, nor is it meant to be an actual idea for a new one; it's just something I randomly spit out.

The basic idea is this. There is a single exported function sec_register_module(secMdFctnCheck *smfctn) that registers the decision making function of the security module and returns the moduleId of the module. When a module is loaded, it registers itself via this function. The module is not placed at any level and is never consulted because it has yet to be placed. The placement is not done with a parameter to sec_register_modules() because the module may by some twist be malicious (think about loading modules via /dev/kmem and such; even though these may be secured via ACL, this is not a certainty).

Then the security administrator runs a process which calls the sec_set_module_level(moduleId, moduleLevel) syscall and moves the module to an appropriate level. Now the module is ready to run.

Now to explain the process relavent to the above flow-chart.

The lowest (i.e. most embedded) level, 0, takes top priority. From there, priority steps down to whatever the highest (i.e. most removed) level is. Thus, an allow or deny at level 0 is effective regardless of whatever is at any other level; as is an allow or deny at level 5 regardless of anything at levels 6 and above.

Each access to a file runs through all modules in no particular order at each level, from the lowest level to the highest. Each module is given the User ID and Group ID of the user; the Filesystem that the object is on and its Inode (to positively identify the object); and the Operation the user wishes to perform (read, write, execute, delete, append). After processing, each module returns one of three states. The three states any module may return are Explicite Allow, Explicite Deny, and No Allow.

Explicite Allow means that this module affirms that the user has the right to access the data in this manner. An Explicite Allow is subject only to modules at its own level or lower levels, and thus may only be retracted by an Explicite Deny. Explicite Allows are effective if and only if they occur on a level less than any and all Explicite Denies. In the basic sense, if there are no explicite denies at lower levels, and if there are no denies at the current level, then the operation is GRANTED (allowed).

Explicite Deny means that the module affirms that the user does NOT have the right to access the data in this manner. An Explicite Deny immediately stops the processing of security modules and exits the security check loop, denying access. The operation is DENIED. Explicite Denies are not effective at levels higher than any existing Explicite Allows; Explicite Denies are effective if and only if they occur at a level greater than than or equal to that of any and all Explicite Allows.

No Allow means that the module does not grant access to the data for the given operation to the user. It also means that the module itself does not deny access. At any level, if all modules return No Allow, that level is a No Allow level. No Allow has no effect on Explicite Allow and Explicite Deny states in other modules; however, if after all modules have been processed the final state is No Allow (no modules return Allow or Deny), then the operation is DENIED.

At any level, the state that is reached at the end of all relavent processing of modules is that Level State. For example, if level 3 reaches an Explicite Allow from any modules and all other modules return a No Allow state, that level has an Explicite Allow Level State. Processing proceeds to the next level if and only if the Level State is No Allow. When processing ends, whatever the Level State of the last processed level is becomes the Final State.

Under normal circumstances, the highest level would contain a module to anylize normal POSIX permissions (those set by chmod). This way a final state of No Allow will never be reached; if no modules have any decision on an object, the DAC system will make the decision. However, this module could optionally not be used, thus requiring a MAC system to be in place.

The basic aim here is to find the central connection point of each filesystem operation and insert hooks there that simply ask for any of these three states. Hopefully, there'd be no constraints put in placeby something like this; each module would bring all of its own rules, configuration syscalls, decision making functions, etc. The only thing that would need to be supplied would be a kernel-level filesystem object access function that evaded the security checks, so that the modules could read their data from and write it to the hard drive.

The operations would have to include things such as actually displaying and accessing certain filesystem objects (i.e. the /rsbac directory made by RSBAC, which RSBAC likes to make sure ls doesn't display), but the whole thing could just be expanded as needed.

Once again, this is just something I spit out randomly, and has nothing to do with any existing systems, nor does it pretend to wish to compete with anything currently in place or in development.

This page is powered by Blogger. Isn't yours?