Author
others
View
0
Download
0
Embed Size (px)
Tehnishe UniversitätMünhenFakultät für InformatikLehrstuhl für Rehnertehnik und RehnerorganisationDiplomarbeitTreatment of Arbitrary-Size Data inAutonomous DisksMihael BorgwardtThemensteller: Prof. Dr. Arndt BodeBetreuer: PD. Dr. Thomas LudwigProf. Haruo YokotaAbgabetermin: 15. Oktober 2001
ErklärungIh versihere, daÿ ih diese Diplomarbeit selbständig verfasst und nur dieangegebenen Quellen und Hilfsmittel verwendet habe.Münhen, den 15. Oktober 2001_____________________________________(Untershrift des Kandidaten)
ZusammenfassungDie vorliegende Diplomarbeit beshäftigt sih mit einer Erweiterung eines bestehendenProjekts, den Autonomous Disks. Dabei handelt es sih um �intelligente� Festplatten die,über ein Netzwerk verbunden, einen Cluster bilden, der selbstständig und automatishhohenwikelte Funktionen zur Datenspeiherung zur Verfügung stellt, die transparenteAusfallsiherheit und Lastausgleih beinhalten. Eine zentrale Rolle spielt dabei der Fat-Btree, eine in hohem Maÿe skalierbare parallele Indexstruktur.Die Implementation des Konzepts war jedoh vor Beginn dieses Projekts niht inder Lage, Datenobjekte (�Ströme� genannt) zu verarbeiten, die gröÿer als eine disk pagewaren. Es ging nun darum, erstens diese Beshränkung zu überwinden um beliebig groÿeStröme speihern zu können, und zweitens das System um die Fähigkeit zu erweitern,einzelne Ströme über mehrere Platten zu verteilen, um dann beim Auslesen die Über-tragungsraten der einzelnen Platten zu kombinieren, sogenanntes delustering. Dabeisollten selbstverständlih die anderen Fähigkeiten der Autonomous Disks in vollem Um-fang erhalten bleiben.Diese Ziele wurden erreiht; zudem waren auf demWeg dorthin einige Hilfsfunktionenzu implementieren, die von sih aus auh nützlih sind: zuerst muÿte ein komplettesAllokationssubsystem für die Verwaltung des Plattenplatzes erstellt werden, dann wurdedas Transportprotokoll um die Fähigkeit erweitert, groÿe Ströme in kleineren Teilen zuübertragen und auf jeden beliebigen Punkt innerhalb eines Stroms zugreifen zu können,ohne den gesamten Strom übertragen zu müssen.Shlieÿlih wurde auh das Delustering implementiert, und zwar auf eine allgemeineArt und Weise die es erlaubt, zwishen mehreren vershiedenen Verteilungsstrategien zuwählen und neue Strategien einfah hinzuzufügen. Experimentelle Ergebnisse bestätigendass dies gelang, und die erwarteten Leistungszunahmen auh wirklih erzielt werdenkönnen.5
AknowledgmentsThis diploma thesis would not have been possible without the ontribution and supportfrom the following people and organizations:� Professor Haruo Yokota from the Tokyo Institute of Tehnology, who provided thefoundation on whih the pratial part of this projet was built and helped meduring my work with vital advie and ritiism.� The Japanese eduation ministery and the Tokyo Institute of Tehnology, whomade it possible for me to stay in Japan and work on this projet.� Dr. Thomas Ludwig, who provided advie and support for the writing of thisthesis.
6
Contents1 Introdution 112 Prior Environment 132.1 The Autonomous Disks Conept . . . . . . . . . . . . . . . . . . . . . . . 132.1.1 Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1.2 Internal Struture of an Autonomous Disk . . . . . . . . . . . . . 132.1.3 ECA Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1.4 Interfae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1.5 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.1.6 Comparison to Other Conepts . . . . . . . . . . . . . . . . . . . 182.1.7 Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2 The Fat-Btree Parallel Index . . . . . . . . . . . . . . . . . . . . . . . . . 192.2.1 Data Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2.2 Index Struture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.3 Read Performane . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.4 Write Performane . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.5 Comparison to Other Parallel Indexes . . . . . . . . . . . . . . . . 213 Goals 233.1 Treatment of Arbitrary-Size Data . . . . . . . . . . . . . . . . . . . . . . 233.2 Alloation and Dealloation . . . . . . . . . . . . . . . . . . . . . . . . . 233.3 Delustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.4 Consisteny with Autonomous Disks Properties . . . . . . . . . . . . . . 264 Related Work 274.1 Alloation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.1.1 Alloation Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 274.1.2 Traditional File Systems . . . . . . . . . . . . . . . . . . . . . . . 294.1.3 Buddy Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.1.4 XFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2 File Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2.1 BPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.2.2 Staked Filesystems . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2.3 Galley . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
4.3 Delustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.3.1 One-Dimensional Data . . . . . . . . . . . . . . . . . . . . . . . . 374.3.2 Two-Dimensional Data . . . . . . . . . . . . . . . . . . . . . . . . 404.3.3 Higher Dimensional Data . . . . . . . . . . . . . . . . . . . . . . . 434.3.4 Similarity Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 Design 475.1 Alloation and Dealloation . . . . . . . . . . . . . . . . . . . . . . . . . 475.1.1 Main Data Strutures . . . . . . . . . . . . . . . . . . . . . . . . 475.1.2 Internal Alloation . . . . . . . . . . . . . . . . . . . . . . . . . . 485.1.3 Deferred Coalesing . . . . . . . . . . . . . . . . . . . . . . . . . . 495.1.4 Clustering Alloation . . . . . . . . . . . . . . . . . . . . . . . . . 505.2 Arbitrary-Length Streams, Random A
ess . . . . . . . . . . . . . . . . . 505.3 Management of Delustered Streams . . . . . . . . . . . . . . . . . . . . 515.4 Management of Substreams . . . . . . . . . . . . . . . . . . . . . . . . . 525.4.1 Plaement of Substreams . . . . . . . . . . . . . . . . . . . . . . . 525.4.2 Finding Substreams . . . . . . . . . . . . . . . . . . . . . . . . . . 535.4.3 Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.5 Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.5.1 Alloation Mehanisms . . . . . . . . . . . . . . . . . . . . . . . . 535.5.2 Storage Outside the Index . . . . . . . . . . . . . . . . . . . . . . 545.5.3 StreamManager . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546 Implementation and Interfaes 556.1 Alloation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.1.1 Pakage Struture . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.1.2 Hardware Interfae . . . . . . . . . . . . . . . . . . . . . . . . . . 556.1.3 System Interfae . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.2 Random A
ess Within Streams . . . . . . . . . . . . . . . . . . . . . . . 576.2.1 Downwards-ompatible Interfae . . . . . . . . . . . . . . . . . . . 576.2.2 Division of Requests . . . . . . . . . . . . . . . . . . . . . . . . . 586.3 Delustering Funtionality . . . . . . . . . . . . . . . . . . . . . . . . . . 586.4 Intergration of Delustering Methods . . . . . . . . . . . . . . . . . . . . 596.4.1 Name Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.4.2 Interfae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597 Delustering Methods 617.1 Uniform Pseudo-Random Distribution (UPRD) . . . . . . . . . . . . . . 617.2 Round-Robin Distribution (RRD) . . . . . . . . . . . . . . . . . . . . . . 627.3 Multi-dimensional data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638 Performane 658.1 Transfer Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658.2 Jitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658
8.3 Response Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678.4 In�uene of Substream Size . . . . . . . . . . . . . . . . . . . . . . . . . 689 Conlusion 7110 Future Work 7310.1 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7310.2 More Delustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . 7310.3 Automated Delustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 7310.4 Improved Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
9
10
1 IntrodutionThis doument desribes a projet whih aimed at adding the ability to e�iently handlelarge data items to an existing projet alled �Autonomous Disks�. Autonomous Disksform a data repository onsisting of a luster of �intelligent� high-funtion disks that areattahed to a network. They ooperate to o�er parallel indexing, fault tolerane andskew handling to lients in a ompletely transparent fashion.However, data items were previously restrited to one disk page in size. Removingthis restrition required a number of modi�ations and extensions of the system, �rst ofall a fully featured alloation and dealloation subsystem. Furthermore, delustering, thedistribution of single data items over several disks, was also a desired funtionality. Thee�etiveness of this tehnique depends greatly on how exatly the parts are distributed,and so it is bene�ial to have a hoie of di�erent delustering methods.The struture of this paper is as follows: setion 2 desribes the prior environment ofthe Autonomous Disks, their design and apabilities, as well as those of the Fat-Btree,the index struture used by the Autonomous Disks. Setion 3 explains the motivationand various goals of this projet. In setion 4, a survey of related work is provided,showing similar e�orts that in�uened my work, and others that were of interest but notdiretly appliable.Then, in setion 5, the strutural design of my ontribution is shown, while setion 6gives insight into the atual implementation and the internal and external interfaes.Setion 7 provides information about the onrete delustering methods implemented(or designed) in the projet. The performane of the implementation is shown andanalyzed in setion 8. Setion 9 onludes the desription of the projet, and setion 10suggests possible future work.11
12
2 Prior EnvironmentThe projet desribed here did not require or allow a ompletely independant new design;the work was based on an existing design, the Autonomous Disks desribed in [1℄ (ofwhih the Fat-Btree [2℄ is a vital part), and an existing implementation of that design.In order to provide a proper bakground for the rest of the paper, this hapter desribesthat environment.2.1 The Autonomous Disks ConeptAutonomous Disks [1℄ are high-funtion disks that realize indexed shared-nothing paral-lel data storage, o�ering funtionality that borrows from both �le system and databaseroots, and featuring fault tolerane as well as the handling of data distribution anda
ess skews, transparent to the lient systems.2.1.1 ClustersThe Autonomous Disks at as a luster of ooperating nodes without any permanent�master� or �oordinator� node1 and oordinate the above-mentioned tasks among eahother, so that lient systems don't need to be aware of the number, state and loationof individual disks at all; the lient an simply address the luster as a single abstratentity. In pratie, eah disk a
epts all possible requests and redirets them as neessary.Furthermore, dynami reon�guration of the luster is possible, either deliberately (byadding disks or inreasing data redundany) or when a disk fails.2.1.2 Internal Struture of an Autonomous DiskFigure 2.2 shows the logial internal struture of an Autonomous Disk; eah box or-responds to a Java objet on the runtime level. The EIS ommands (see setion 2.1.4)that make up the ommuniation between lients and disks as well as among disksare reeived by a CommuniationManager, whih also dispathes outgoing ommands.They are passed through FIFO Queues to either a RuleManager or a RuleManagerLog,depending on their type. The latter is used only on log disks (see setion 2.1.5 and�gure 2.1).1However, it may be neessary for one of the disks to temporarily assume suh a role for ertain tasksto be ompleted 13
Log DiskData Disk 3 Data Disk 2 Data Disk 1
Primary 1Primary 2Primary 3
Backup 2Backup 3Backup 1
Backup 3Backup 1Backup 2
Log
A u t o n o m o u s D i s k C l u s t e r
Clients
Network
cb a
Figure 2.1: Autonomous Disk Cluster with double data redundany and data path foran update operation. a: original update request � b: synhronous logging� : asynhronous submission to bakupCommunication
Manager
Queue Queue
Queue
Rules Rules
RuleManager (Log) Manager
Rule
ManagerLog
Network
Physical Disk
ManagerDirectory
PageIO DataManager
Internal calls
EIS Commands
Figure 2.2: Internal logial and objet struture of an Autonomous Disk14
Both types of rule managers reate instanes of ECA Rules (see setion 2.1.3) basedon the type of and data in the EIS ommands, and exeute them. The rules are given therule manager objet as a ontext and a
ess the rest of the system by alling methodsprovided by the the rule manager lasses.These methods2 in turn do part of their work by using the LogManager lass (onlyon log disks) or the DiretoryManager (a
ess to the main index), DataManager (high-level read and insert methods that ombine the atual data and index manipulations)and PageIO (low-level disk a
ess) lasses. The latter three are tightly interonnetedand all eah other's methods to some degree.2.1.3 ECA RulesAll operations in the Autonomous Disks are ontrolled by rules following the event-ondition-ation paradigm desribed in [4℄. This means that there are ertain eventsde�ned that, when they o
ur, ause the system to start proessing all rules whihdelare that event as their trigger. Eah rule also de�nes onditions, and stops beingproessed if they are not ful�lled. If they are, then the ation de�ned by the rule isexeuted. New rules an be added to adjust or extend the funtionality of the system.An example for an ECA rule:Event: Arrival of an insert ommandCondition: StreamID is not present on this disk a
ording to indexAtion: Forward ommand to the disk that ontains the StreamIDGenerally, the events triggering the rules in the Autonomous Disks are arrivals of somesort of message over the network from a lient or another disk.2.1.4 InterfaeData items are abstrated as untyped data streams, and the Autonomous Disks are in-tended to provide general storage servies for all kinds of data, suh as XML douments,video data or sensor readings. Similar to OSD proposals [3℄ of an extended SCSI om-mand set, suh objet-oriented treatment of data represents a more advaned interfaethan the blok-oriented ommands normally used to ontrol disks.EIS CommandsIn the Autonomous Disks, ommuniation between lients and disks as well as amongdisks happens through external interfae stream (EIS) ommands. These all have thesame standardized format. Here is a exerpt3 from the soure ode, showing the essentialmember variables of the EISCommand lass:2whih onstitute the internal stream ommands; see setion 2.1.43slightly modi�ed for aestheti purposes 15
int type;SiteID SoureID;String StreamID;byte[℄ Stream;type de�nes the type of ommand (suh as insert, retrieve, result); its interpretation is�xed by onstant lass members of the EISCommand lass, the names of whih areused instead of numeri values in the ode.SiteID spei�es the identity of the node the ommand was issued by, though this an beindiret. More onretely, it is the node to whih results will be direted. Usually,SiteID will represent an IP number.StreamID is the name of the Stream whih is a�eted by the ommand, as used in themain index; in some ommands, this may be unneessary and thus have a nullvalue. In my work, this �ed was extended to arry additional information fordelustering, see setion 6.2.1.Stream is the atual data being transferred. Again, this an be null in some ommands.As explained in setion 2.1.3, EIS ommands trigger the exeution of rules, so the ab-strat lient interfae atually onsists of both the EIS ommands, the installed rules,and the possibility to add new rules or modify old ones.IS CommandsThe rules are given the �elds of the triggering EIS ommand as data and an exeutionontext (see setion 2.1.2) whih provides a number of internal stream (IS) ommandsthat the rules an use to do hek their onditions and exeute their ations. Thus,the IS ommands provide the interfae of the Autonomous Disks system towards the(possibly user-provided or -modi�ed) rules. Originally, the following IS ommands wereimplemented:send: Send an EIS ommand to another disk or a lient.traverse: San the index4 for a stream ID.insert_loal: Insert a stream into the index4.read_loal: Read data from the physial disk using the index4.athup: Used on log disks to start omitting the asynhronous bakups (see se-tion 2.1.5).4There are atually two almost idential ommands; one that operates on the primary opy and anotherthat uses the bakup16
waitAk: Wait for an aknowledgement message after sending an EIS ommand.Two more ommands, ompose and deompose (whih are needed for delustering),were mentioned in [1℄ but not implemented. The implementation of these ommands(see setion 6.3) was at the ore of the pratial part of this diploma thesis.2.1.5 FeaturesFault ToleraneFault tolerane is implemented by asynhronous bakups: eah update operation is�rst synhronously written to a dediated log disk whih then periodially submits theoperations to one or more bakup opies of the database. These opies reside on thesame disks as the primary working opy, but interleaved in a way that guarantees thatthe entire database an still be a
essed even when one5 disk fails (see �gure 2.1).The disadvantage of this approah is that the synhronous write to the log disk addstwie the network lateny to the response time of an insert request. However, in amultithreaded implementation, this time ould be used to servie other requests.Theoretially, it would be possible to have a disk perform both logging and storage,but a dediated log disk has the advantage of operating sequentially, and is thereforeable to log operations from several normal disks. Experiments have shown a single logdisk to beome the bottlenek only when logging more than �ve disks; beyond thatpoint, it beomes neessary to add more log disks. However, this depends on the natureof the operations being done; should the normal disks somehow also be able to operatesequentially, the log disk would beome a bandwidth bottlenek earlier.Transparent skew handlingSkew handling is provided by the Fat-Btree parallel index desribed in setion 2.2, whihserves as the main index for the streams stored in the Autonomous Disks. Disks shouldollet usage information, both in terms of spae and a
ess load, and pass it around6.Based on this information, a
ess and data distribution skews an be redued by movingthe index ranges that the individual disks over, and move data a
ordingly.Unfortunately, sine the ranges must be ontinuous in order to enable the indexto operate e�iently (see setion 2.2.1), there will sometimes be an imperfet tradeo�between the redution of data distribution and a
ess skews:� If an index range ontains a small amount of data that is a
essed very frequently,it will have to o
upy a disk exlusively in order to avoid an a
ess bottlenek,even though that ould leave the disk mostly empty. This e�et will be reduedby ahing.5or more, depending on the number of opies6A irular token-passing sheme is envisioned for this. The information doesn't need to be ompletelyup-to-date sine it is only used for a possibly ina
urate predition of future a
ess frequeny anyway.17
� If a range ontains a very large amount of data that is never a
essed, it wouldalso need to have an entire disk dediated to it for spae reasons, wasting thatdisk's ability to perform atual work (exept for doing the partial index lookupsfor some requests and forwarding them to the right disk).2.1.6 Comparison to Other ConeptsThere are a number of networked storage onepts whih the Autonomous Disks an beompared to:SANStorage Area Networks provide entralized storage spae whih is a
essed over a net-work. However, this network is dediated ompletely to that task and separate from thegeneral ommuniations network. Additionally, the lient interfae usually is at a verylow level, e.g. SCSI ommands. Obviously, there is little resemblane between SANsand the Autonomous Disks, save the fat that both allow the storage to be removedfrom individual mahines and managed and reon�gured entrally.DatabasesDatabase servers suh as the Orale or IBM DB2 produts, or more reently developedrelational database systems, o�er far more powerful funtionality than the AutonomousDisks. However, features suh as fault tolerane and skew handling, as well as delus-tering, are usually add-ons that need to be purhased separately and require additionalon�guration and administration overhead. Generally, the broad funtionality o�eredby databases also omes at the ost of very high omplexity in usage and administration.Autonomous Disks, on the other hand, o�er some of the ore funtionality and someadditional features at omparable or even inreased (due to the Fat-Btree) performanewhile needing very little administration.NASNetwork Attahed Storage is a rather broad ategory whih generally means storageservers that are a
essed through high-level interfaes (usually on a �lesystem level)aross a general-purpose network. Examples are Sun's NFS and the SMB protoolused by Mirosoft's ��le sharing� servies. Here, too, fault tolerane, skew handlingand delustering are usually expensive and ompliated add-ons. Still, this ategorya
omodates the Autonomous Disks best.All in all, The Autonomous Disks are probably best desribed as a networked hybridof a database and a parallel �lesystem that o�ers a useful subset of the normal featuresand adds some important features in an easy-to-use way.18
Root node
Inner nodes
Leaf nodes
DisksFigure 2.3: Struture of the Fat-Btree index; line styles show whih nodes are plaed onwhih disk, inluding dupliates2.1.7 PrototypeThere is a working, partial prototype implementation using the Java programming lan-guage7 and advaned ommodity hardware: Most of the development and some of theperformane testing took plae on PCs with 700 Mhz Intel Celeron proessors equippedwith 256MB RAM and onneted through swithed 100 MBit Ethernet. Some of thebenhmarks ould be run on a on�guration with faster CPUs and onneted throughGigabit Ethernet. Data migration and dynami reon�guration, as well as the Fat-Btreeindex are not yet fully implemented. The work desribed in this paper onsisted of anextension of this prototype.2.2 The Fat-Btree Parallel IndexA entral role in the Autonomous Disks is played by the Fat-Btree [2℄, whih is a B+ treebased, parallel diretory struture that avoids bottleneks for both reading and writing.2.2.1 Data DistributionData items are distributed aross the proessing elements in a simple range partitioningsheme, but the B-tree index allows migration of data items to adjust for data dis-tribution and a
ess skews as mentioned in setion 2.1.5. The range partitioning is aneessary result of the index struture (see setion 2.2.2); were data items sattered overdisks freely and the Fat-Btree rules still applied, all internal nodes of the tree would haveto be opied on most disks, resulting in an index struture (and performane) muh likethe Copy-whole-Btree (see setion 2.2.5).7IBM Java2 SDK 1.3 on Linux 19
Additionally, this keeps the number of request forwards to a minimum as desribedin setion 2.2.3. In partiular, a request that arrives at the disk that atually holds thedata item in question will never need to be forwarded.2.2.2 Index StrutureThe index struture of the Fat-Btree is displayed in �gure 2.3. Basially, the poliy is tohave eah proessing element (PE) keep a opy of all tree nodes that must be traversedin order to reah any of the data items stored on that PE. E�etively,� Leaf nodes will usually not be dupliated.� The root node is dupliated on all PEs.� Nodes in inbetween levels are dupliated on some PEs, the more the loser to theroot they are.This poliy has the advantage of allowing both read and write a
esses with high per-formane, unlike the alternative designs (see setion 2.2.5), whih perform well in onlyone of these and poor in the other.2.2.3 Read PerformaneThe Fat-Btree o�ers good read performane, as index lookup work an be balanedbetween all PEs. When an index traversal arrives at a tree node whih is not on theurrent PE, it always means that the data item a�eted by the request is also not on theurrent PE. The request will be forwarded to one of the disks whih hold a opy of thenode. The average number of forwards (Whih of ourse diretly a�ets response timein a negative way) will be low, but inreasing (very slowly) with the number of disksand size of the tree.2.2.4 Write PerformaneWrite performane is also good. Of ourse, having opies of index nodes means thatthere will be some synhronization overhead when one of them is hanged as the resultof an insert or delete operation, reduing performane. However, the impat of this ismuh less than might be expeted, beause the number of opies dereases exponentiallytowards the lower (i.e. lose to the leaf nodes) levels of the tree, while the probabilityof an insert or delete request resulting in hanges in a node dereases exponentially inthe opposite diretion in a B-tree.In other words: the nodes that are most likely to have many opies (whih wouldresult in the most overhead when hanged) are also exatly the nodes whih will hangemost infrequently, while the nodes whih are most often hanged, the leaf nodes, are thosewith the lowest probability that there is a opy at all. In the end, write performane isnearly as good as if there were no synhronization overhead, beause due to the natureof B-trees, the vast majority of write operations a�et only one leaf node.20
2.2.5 Comparison to Other Parallel IndexesThere are two widely used ways to build a parallel system with an index, both of whihhave severe restritions in their performane; restritions that the Fat-Btree overomes.Single-Index-Node:In this system, the index is kept on a single PE, while the data is split aross allthe others. The advantage is that there are no dupliates, so also no synhronizationoverhead: write performane is good. Additionally, there will always be exatly oneforward, beause the index lookup is always ompleted on the index PE. However, theindex PE an very quikly beome a bottlenek for both reading and writing, whihmeans that this approah is not usable for large databases. The Fat-Btree, on the otherhand, exhibits no bottleneks.Copy-Whole Index:Here, the entire index is dupliated on all PEs. Obviously, this means that lookup workan be perfetly balaned between all PEs, and that there will be at most one requestforward, sometimes none. The problem is that eah operation that hanges the indexrequires maximal synhronization overhead, sine all PEs are involved. Thus, the designis not suitable for any database with frequent hanges. Again, ompare the Fat-Btree,whih has sometimes synhronization overhead, but infrequently and usually small.
21
22
3 GoalsThe projet had at its beginning the single, simple goal of enabling the AutonomousDisks to e�iently handle large data items. However, this required some preliminarywork and a number of separate modi�ations, and eventually a set of somewhat inde-pendant goals formed, whih are desribed in this hapter.3.1 Treatment of Arbitrary-Size DataObviously, a generalized data repository should be able to handle data items of arbi-trary size, as appliations that manipulate massive amounts of data, suh as multimediastreams, large XML douments, sensor data from manufaturing proesses, stok marketdata or data transmitted by astronomial observation satellites, beome more ommonor begin to employ more sophistiated, database-like storage systems for their superiordata a
ess and manipulation apabilities.This demand was also honored in the SQL99 extension of the SQL language, whihintrodued the so-alled BLOB, or �binary large objet� data types into an environ-ment whih had previously been onerned only with the storage and manipulation ofrelatively small data items.Thus, the Autonomous Disks should also be able to handle streams of arbitrarysize. At the beginning of the projet, the prototype implementation (see setion 2.1.7)was not apable of storing streams larger than one disk page. Of ourse, lients ouldtheoretially break up their data into disk-page sized hunks and submit these with gen-erated names, and one of the parallel e�orts did this, but it is obviously an unsatisfyingsolution beause it makes the Autonomous Disks harder to use. Also, sine lients haveno knowledge of the internal struture of an Autonomous Disks luster, they annotperform optimal delustering (see setion 3.3).3.2 Alloation and DealloationAnother feature that the prototype laked was the ability to alloate and dealloate spaefreely. Spae was alloated at the end of a �le, one page at a time, and dealloationwas not supported at all. In order to improve the overall apability, I deided to �rstimplement a full alloation subsystem instead of just extending the existing system toalso alloate larger extents.An alloation subsystem must mainly perform two tasks: 23
Unsatisfiable Request
Allocation Unit Free Space User Data Internal FragmentationFigure 3.1: Partially alloated storage. A request for more than two alloation unitsannot be satis�ed, though enough free spae is available: external fragmen-tation� A
ept requests for the alloation of a given amount of spae, �nd for eah requesta large enough amount of free spae to satisfy it, mark the spae as used and makeit available to the originator of the request.� A
ept dealloation requests for spae that is not anymore needed by an appliationand add it to the free spae pool so that it an be reused.Additionlly, there are two important onstraints on how alloators should perform thesetasks:Speed There should be little delay, i.e. the alloator should not need to perform manyCPU- or IO-intensive operations.E�ieny The alloator should waste as little spae as possible.The latter requirement is the more problemati one, as it has been shown in [8℄ that itan not be satis�ed for all ases: For any given alloation mehanism, there is alwayssome sort of alloation pattern that will result in high fragmentation.Fragmentation is the term generally used for spae �wasted� by alloators. There aretwo distint types of fragmentation, and some alloators use a tradeo� between the twotypes. See �gure 3.1 for an explanatory image.Internal Fragmentation is spae that is alloated but not used, due to onstraints onalloation sizes or objet alignment.External Fragmentation is spae that is free, but annot be used to satisfy a request,usually beause the free spae is divided into several separate hunks, while therequest needs one ontinuous hunk. Another soure of external fragmentation arealignment restritions in some alloator mehanisms1 that prevent onseutive freespae from being alloated as a whole in some ases.I wanted to provide the ability to alloate ontinuous extents of disk spae as required,as well as dealloate unneeded spae and automatially oalese adjaent free extents,1Namely in buddy systems, see setion 4.1.324
A
BFigure 3.2: A: monolithi streams make skew handling impossible � B: delusteredstreams enable skew handlingand do it in a fashion that would result in little overhead and low fragmentation. Later, Ideided to add the ability to alloate spae �near� a given blok address, a tehnique thatmight be used to redue seek times by keeping related data strutures lose together.3.3 DelusteringAs streams beome larger, it is more and more desirable to distribute (or �deluster�)them aross several disks, sine that makes it possible to ombine the transfer rates ofthe disks to ahieve a higher total throughput and lower response time. Furthermore,skew handling is muh easier when there are many small streams instead of few largeones, beause skew handling is done by migrating streams aross disks, and the streams'size therefore ats as a �grain� on skew handling and limits its appliability, as shownin �gure 3.2. In fat, delustering already represents a simple method of avoiding, or atleast reduing data distribution and a
ess skews.Finally, sine the implemented alloation subsystem (see setion 5.1) avoids theindiretion-related speed penalty of indexed alloation, it has to alloate spae requestsontinuously. If the available free spae is fragmented (whih annot be avoided in somesituations), a large alloation request may fail even though enough spae is available.However, several smaller requests are muh more likely to be satis�ed even when frag-mentation is severe. Of ourse, this phenomenon is basially the same as indexed storage,imposing a similar speed penalty, but yielding the additional bene�ts mentioned above.Data delustering aross disks has been the subjet of many previous researh e�orts,suh as [17℄ [14℄ [24℄, and di�erent methods with various requirements and strengthshave been developed. Obviously, it is desirable to allow the hoie between severaldelustering methods sine some methods may be espeially suited for a partiular kindof environment or data. Ideally, it should be possible to easily add new delusteringmethods. 25
3.4 Consisteny with Autonomous Disks PropertiesOf ourse, the implementation of the above goals should also be done in a way thatinterferes as little as possible with the rest of the system, retains the other features ofthe Autonomous disks and reates as little additional omplexity as possible.This sounds obvious, but as we will see, it's not easy to do and therefore importantto keep in mind as a major separate goal.
26
4 Related WorkAny serious sienti� work must take into onsideration previous researh e�orts on thesame or similar subjets, to avoid repeating work that has already been done, espeiallyin ases where other, superior tehniques have been found. Therefore, this hapterprovides a survey of related e�orts.4.1 AlloationAs desribed in setion 5.1, the �rst step in my modi�ations of the Autonomous Disksdesign was the addition of a subsystem for disk spae alloation and dealloation. Thereare a number of approahes to and theoretial onsiderations of this lassial problem,and I took several of those into a
ount.4.1.1 Alloation StrategiesIn [6℄, Wilson et al. provided a survey of dynami storage alloation tehniques to avoidfragmentation. In [7℄, two of the same authors elaborated upon their onlusions bybenhmarking the performane of a number of alloation tehniques for alloation traesobtained from various real-world appliations. Their work onentrates on alloationof semiondutor memory, not harddisks, but I believe that their onlusions are stillappliable to some degree.In [6℄, the authors argue that there are three distint aspets of solutions to thealloation problem:Strategy: a set of overall long-term deisions whih aim to provide good alloation thatexploits regularities in program behaviour. A major part of a strategy may be thede�nition of what exatly onstitutes �good� alloation.Poliy: atual rules that speify whih of the su�iently large free bloks should beused to satisfy a given requestMehanism: the algorithms, data strutures and other details of a real implementationof a given poliy.A
ording to the authors, too muh attention has been paid to detailed mehanisms, toolittle to poliies and virtually none to strategies. They also argue that the randomizedsyntheti traes upon whih most previous alloator benhmarks had been based are27
not realisti, beause real world appliations exhibit phase behaviour in both requestsizes and objet lifetimes, and there is often a orrellation between the two. The lak ofthese harateristis in randomized traes an a�et fragmentation both positively andnegatively.However, in [7℄, the authors found that a number of well-known alloation tehniques1produed extremely low fragmentation when proessing their traes. They onludedthat the phase behaviour of real-world appliations favors these alloators, while ran-domized traes penalize them unneessarily, ausing programmers to avoid using them,espeially sine some of them are (falsely) thought to be slow.In regard to the question of alloation strategies, the authors onede (not openlythough) that it is hard to �nd a poliy that implements a given desired strategy, andto understand exatly what strategy a given poliy implements. They fous mostly onpoliies, but laim that the well-performing alloators mostly do so beause they allimplement two su
essful strategies: plaing objets alloated at the same time togetherbeause they tend to be dealloated at the same time, and alloating reently dealloatedspae preferentially in order to give other free spae time to be oalesed.When estimating the appliability of these results to the alloation of harddisk spae,there are several issues to onsider. First of all, the experiments in [7℄ used only short-time runs of programs; after the end of the proess, all address spae alloated to itis freed, and so fragmentation is a onsiderably smaller problem than with harddisks,where the lifetime of an area with alloation ativity, usually an entire �lesystem, ismuh longer - often years2.On the other hand, alloation and dealloation o
urs muh more slowly on disks,and there is no reason that the fragmentation situation would beome dramatially worseover time when the experiments showed very little fragmentation during a limited-timeprogram run that already displayed some phase behaviour.A more important di�erene is the parallel usage of disk spae by many appliations,even many users. In [6℄ and [7℄, a basi assumption was that an alloator needs todeal with alloation and dealloation requests of only one appliation, but this is hardlyever true with harddisks. Thus, the experimental results annot be expeted to applyompletely.However, it seems extremely unlikely that parallel usage would result in any kind ofmaliious regularities - at worst, the alloation patters should be similar to the randomtraes that Wilson et al. deemed un�t for alloation benhmarks. More likely, someof the bene�ial phase behaviour that led to the very low fragmentation observed in[7℄ would be preserved and the fragmentation be somewhere between that and the oneobserved with random traes.Sine studies ited in [6℄ showed the same alloation tehniques that performed so1Namely, ertain variants of �best �t� and ��rst �t� as well as a size-lass-based alloator developedby Doug Lea.2At �rst glae, the fragmentation problem may also seem to apply to the entirety of physial RAMwhile the omputer is running, but due to the virtual memory arhitetures of modern operatingsystems, it is not an issue28
well in [7℄ to perform at least adequately3 for random traes, it seems warranted touse one of those tehniques and expet fragmentation to be tolerable. My implemen-tation (see setion 5.1) thus adopted �address-ordered best �t�, a poliy that showedthe seond-lowest fragmentation in [7℄ and has quiker implementations than the evenbetter �address-ordered �rst �t� poliy. The atual implementation is highly e�ientand salable, inspired by the XFS �le system (see setion 4.1.4).Another aspet of alloation that was analzed in [6℄, and whih deserves brief mentionis the onept of deferred oalesing. This manifests as a simple list of �xed size in whihreently freed extents are stored, with no attempt of oalesing them with adjaent freespae. When an alloation request is served, this list is searhed sequentially beforethe normal meahanism is applied. The reason to do this is that after an extent isfreed, it will often be immediately reused to serve an alloation request, in whih asethe oalesing would have been unneessary work. The tehnique a�ets both speedand fragmentation, but the e�ets are not well researhed. It was adopted for theAutonomous Disks' alloation subsystem, not beause of speed gains but beause adesign neessity explained in setion 5.1.2.4.1.2 Traditional File SystemsTraditionally, spae management in �le systems has been done through bitmaps, wherethe state of eah blok � used or unused � is stored as one bit in the bitmap. This issimple to implement, but has the obvious disadvantages that �nding enough free spaeto satisfy a request requires a sequential san through the bitmap, and the neessarywork for alloating or dealloating a blok also inreases linearly with its size. However,there are some fators that make this approah viable reagrdless:� File systems are normally used to store many omparatively small �les, so alloa-tion requests are usually easy to satisfy.� Most �le systems use some kind of �indexed alloation�, in whih a �le an onsistof a (possibly large) number of non-ontinuous hunks, so that large alloationrequests an be satis�ed by using several smaller free extents, avoiding exhausivesearhes.� The alloation unit (the blok size) is omparatively large, making the bitmap andthus the onstant fator of the linear ost relatively small.For these reasons, most �le systems hoose the bitmap approah despite its speed penalty.Examples are traditional Unix �le systems like the BSD FFS or Linux ext2, WindowsNT's NTFS [9℄, the Maintosh HFS [10℄ and the BeOS �le system [11℄.Sine the Autonomous Disks are a hybrid of �le system and database (see se-tion 2.1.6), I wanted to avoid the speed penalty of indexed alloation for most data3Under strong assumptions, espeially of randomness, they follow the ��fty perent rule� developedby Knuth in [5℄, whih says that the number of unused (inluding unusable due to fragmentation)bloks tends to be half as muh as the number of bloks in use. 29
Bu
dd
y
Split
43210
Used Int. Fragm. Free
Allocated Objects*
Buddies
Buddies
Buddies
Buddies
Hie
rarc
hy
Exa
mp
leA
lloca
tio
nFigure 4.1: Binary buddy system, free and alloated. *: Free bloks annot be alloatedtogether due to alignmentitems. This also made the bitmap approah infeasible, so I needed to �nd a more sal-able solution.4.1.3 Buddy SystemsOne group of storage management methods that avoid sequential sanning are buddysystems, as desribed in [5℄. In a buddy system, spae is divided into �xed-size bloksthat an be hierarhially subdivided. Eah blok has a buddy, with whih it an beombined to form a blok of the next higher hierarhy level. Bloks are alloated as awhole, so requests have to be rounded up to the next higher blok size. If no blok ofthat size is available, a blok of the next higher size is split and one of the two buddiesused. The most ommon ase is the binary buddy system, in whih blok sizes are powersof two. See �gure 4.1 for an explanatory image.The advantage of this system is that �nding a free blok of the appropriate size anbe done very easily and quikly by keeping a lists of the free bloks in eah size lass.Finding a blok's buddy is a matter of address alulation, and so oalesing is alsosimple and speedy if usage information is kept in a data struture, suh as a hierahialbitmap, that performs these operations quikly.Unfortunately, the system also has a severe disadvantage: the size lasses result ina very high internal fragmentation (25% on average for the binary buddy system), andalignment restritions (see �gure 4.1 for an example) add to external fragmentation.Consequently, buddy systems exhibited prohibitively high fragmentation in [7℄. Forthis reason, the tehnique is not very popular and it is usually not onsidered at allfor harddisk alloation, possibly also beause the well-known implementation methoddesribed in [5℄ is not appropriate for use on harddisks.However, Koh desribed in [12℄, how a binary buddy system was su
essfully usedfor disk spae alloation on the Dartmouth Time-Sharing System. To my knowledge,this is the only buddy-system-based �le system implemented and desribed in literature.The key to its su
ess was again the (extent-based) indexed alloation, whih made it30
possible to drastially redue fragmentation by alloating a �le on a small set4 of non-ontinuous extents instead of a single one. For example, a �le of 19K+1 bytes in sizewould be alloated two extents of 16K and 4K, instead of one extent of 32K, as would beneessary in a �pure� binary buddy system, thus reduing internal fragmentation from40.6% to only 5%. This optimization was performed mostly by a periodially runningutility that realloated �les with high internal fragmentation, sine alloation has to bedone when neessary during write operations, and annot assume knowledge of the �lesize after the �nal lose operation.This unique approah seems surprisingly e�etive and somewhat unfairly ignored by�le system designers, though the need for the realloation utility limits its usability tosystems in whih �le reation, growing and shrinking are relatively infrequent, and thathave o�-peak times during whih the relatively resoure-hungry utility an run withoutimpeding produtive operation.In regard to the Autonomous Disks, however, it is not appliable, due to the lak ofindexed alloation. Additionally, Autonomous Disks might well be used in database-likeenvironments where the overhead resulting from the realloation utility would not betolerable.4.1.4 XFSIn [13℄, Sweeney desribes the arhiteture of XFS, the default �le system used in SilionGraphis' IRIX operating system sine 1996. It is not really a parallel �le system beauseit does not diretly employ or support any kind of parallelism, but it is designed not toobstrut parallel operation when the underlying storage is some sort of disk array.More interesting from my viewpoint was the highly salable design of its internaldata strutures, inluding those used in the management of free spae:XFS uses B-trees to e�iently implement a �best �t� alloation poliy. Free spae ismanaged in extents of onseutive free bloks. The extents are kept in two B-trees, oneindexed on extent size, the other on the extents' starting address. The �rst is used to�nd the smallest extent that an satisfy an alloation request5, while the address-indexedtree is used to oalese free extents and to �nd free spae �near� a given address6.The e�ieny and salability of this approah, even when not using indexed alloa-tion, made it ideal for the alloation subsystem in the Autonomous Disks, and I thereforeadopted this design in my implementation desribed in setion 5.1.4.2 File SystemsAs mentioned in setion 2.1.6, the Autonomous Disks are basially something interme-diate between a parallel �le system and a (very simple) parallel database with the addedapabilities of skew handling and fault tolerane. Even though the �le system aspet was4Only around 1.3 on average5this onstitutes the �best �t� alloation poliy.6This allows related data to be kept lose together, whih redues seek times. 31
not onsidered muh in [1℄, it is in my opinion the dominant one, therefore I onsideredother researh e�orts on (espeially parallel) �le systems prominently.4.2.1 BPFSIn [28℄, Russell desribes the arhiteture of BPFS, a Basi Parallel File System. Thiswork was not onsidered when designing my arhiteture, but in retrospet, my delus-tering subsystem ended up being remarkably similar to the BPFS arhiteture.BPFS is designed as a modular, distributed parallel �lesystem for use on lustersof workstations. Its ore is an arhiteture and a set of protools for ommuniationbetween the omponents. The implementation desribed in [28℄ is relatively low-level,providing no �lesystem struture and using an underlying native �lesystem to providethe storage and indexing. The BPFS funtionality is divided into four main omponents:Clients ollet the data for a user or appliation. They are expeted to be realized aslibraries that enapsule the BPFS funtionality and provide an API that may besimilar or idential to existing ones7. The number of lient proesses on eah nodedepends on the implementation.Servers ontrol the atual data, and eah server proess is solely responsible for alla
ess to its assigned storage; this makes sense beause the physial disk fores asequential exeution of requests anyway.Managers handle the metadata of �les, the delustering, reation and deletion. Onemanager proess is solely responsible for all metadata a
ess to a set of �les. Itneeds some storage in whih to keep the metadata.Agents at as proxies for lient requests and perform ahing as well as hiding thedelustering from the lient. There is one agent proess for eah open �le and itperforms all operations on that �le.These omponents are distributed over three types of �logial nodes�; there an be mul-tiple instanes of eah type, distributed among a network of (possibly) heterogenousphysial nodes. A physial node an ontain instanes of more than one type of logialnode. The three types of logial nodes are:Client Nodes onsist of a user appliation proess that a
esses BPFS via a lientomponent.Server Nodes onsist of a server omponent, a large amount of storage whih it ontrols,and a number of agent omponents.Manager Nodes onsist of a manager omponent and a (relatively small) amount ofstorage for metadata.32
Agent Component
Agent Component
Manager Node
Manager Component
Metadata
Metadata
Metadata
Server NodeClient Node
User Application
Client Component
Server Component
Data
Data
Data
Data
DataData
DataData
Data
Data
DataData
Data
NetworkConnection
Figure 4.2: The arhiteture of BPFS33
Delustering is performed by mapping loations within a �le on names of (and o�setswithin) �data omponents� whih are distributed aross the disks. The mapping funtionan be supplied by the lient and the mapping data is stored as �le metadata in the�le's manager omponent.It's easy to see that this is basially the same approah taken in my approah ofdelustering for the Autonomous Disks. There are, however, also some di�erenes:� The BPFS handling of mapping funtions appears more �exible, espeially the fatthat a lient is desribed as being free to supply a ompletely new mapping fution.However, it is not desribed how this funtion is stored and omputed by themanager. This feature might add onsiderable omplexity to the implementation.� My design doesn't have the division into separate manager, server and agent pro-esses; eah Autonomous disk performs all of these tasks. Espeially notable isthe lak of a separate agent omponent; the autonomous Disks design is statelessand does not require the opening of �les. This avoids resoure bottleneks8 andsetup delays, though having a separate agent proess might be bene�ial for largeoperations and espeially for stream-oriented appliations that require a onstantdata rate (see setion 10.4) and prevent superstream index entries from beomingbottleneks (see setion 5.3).� The BPFS design does not address issues suh as fault tolerane9, indexing andload balaning, while in the Autonomous Disks design, these features are seam-lessly provided by the rest of the system. It is oneivable that a di�erent BPFSimplementation might provide these on a separate layer, but the result ould beless e�ient beause the overall design does not take them into a
ount.This is probably a very general tradeo� between modularity and performane:learly separated modules failiate design and maintenane on a high level andallow parts of the system to be replaed separately from the rest, but if the APIsbetween the modules are too narrow, they might not allow some forms of inter-ation that might be bene�ial, espeially for performane. On the other hand, ifthey are too big, they beome di�ult to use and the probability inreases thatthere are hidden dependenies that e�etively destroy the modularity.Still, the similarity between the two approahes is strong and an be seen as an indiationthat the design is fundamentally sound.4.2.2 Staked FilesystemsIn [29℄, Heidemann developed what he alls the �stakable� design of �le systems. Thisbasiallymeans a layered approah in whih eah layer a
epts any kind of operation, and7one of the implemented lient libraries mimis the standard C IO library8the agents are analogous to the �le desriptors in traditional Unix �le systems9a severe drawbak, sine the probability of a hardware failure exponentially approahes one as thenumber of involved omponents inreases34
res_path reate unlink open read write allo free rd_blok wr_blokRead-Only Æ � � � Æ � Æ Æ Æ ÆDiretory � � � Æ Æ Æ Æ Æ Æ ÆInode/File Æ � � � � � � � Æ ÆAlloation Æ Æ Æ Æ Æ Æ � � � �Cahe Æ Æ Æ Æ Æ Æ Æ Æ � �Blok Æ Æ Æ Æ Æ Æ Æ Æ � �Default � � � � � � � � � ��: rejet Æ: unknown/pass-through �: implement/modify.Figure 4.3: A staked �le system and a part of its operations vetor, showing how oper-ations are handled through the stak. Note that an implemented operationin one layer may result in this or other operations of the layer below beingalled.either exeutes it (possibly alling one or more operations of the layer beneath), rejetsit (reating an error) or, if it isn't onerned with the operation (the most ommon ase),passes it through unhanged.Eah layer should provide a separate, small part of the �lesystem funtionality. Ex-amples are enryption layers, ahe layers or network transpareny layers. The funtion-ality might also be extremely small, suh as ompatibility adjustments or making a �lesystem read-only. This is alled a �featherweight layer� by Heidemann and given speialattention. Additionally, he develops a ahe oherene framework to make ahing ondi�erent layers possible while keeping data onsistent.The reason for the development of stakable design is stated as the need to make �lesystems easier to design and implement, and (more importantly) allow binary ompati-bility between �le system modules from di�erent parties (i.e. software vendors) withoutrequiring a
ess to the soure ode.Heidemann himself mentions that there are strong parallels between his stakingtehniques and objet-oriented design10. Eah layer might be implemented as a lassthat has an objet of the next lower layer as an instane member, and featherweightlayering is nearly equivalent to the reation of a sublass that overrides one or a fewmethods.However, the entral feature of the staking onept (though it is not stressed in thepaper) is the dynami on�guration of a stak of layers that were written independantlyfrom eah other, in whih a global operations vetor is reated that onsists of funtionpointers to the implementations of all operations o�ered by the stak. This would be veryinelegant or di�ult to emulate in Java, requiring either the use of �generi� methodsthat take e.g. an array of byte arrays as their arguments and parsing those, or some sortof on-the-�y rewriting of lasses by a ustomized ClassLoader to make them ontainpass-through methods for operations o�ered by lower layers that they don't modify.Sine the Autonomous Disks prototype implementation (see setion 2.1) on whih10or languages, suh as our implmentation language Java 35
my work was based did not use a layer approah for its design, and sine the objet-orientation of the implementation language already provides bene�ts (in regard to designease) of the staking tehnique while the others (binary ompatibility) are of questionablerelevane, espeially for a researh prototype system, I hose not to try applying thestaking tehnique to my design.4.2.3 Galleyin [30℄, Nieuwejaar and Kotz present Galley, a parallel �le system that is intended mainlyfor use on massively parallel superomputers. They ite results showing that the atuala
ess patterns of sienti� appliations developed for suh systems are fundamentallydi�erent from those in uniproessor or vetor superomputers and therefore not ade-quately handled by traditional parallel �le systems. In partiular, suh appliationstend to make large numbers of regular, but small and non-onseutive �le a
esses,whih traditional �le systems don't parallelize e�etively.The main improvement in Galley is therefore the introdution of a powerful (and veryomplex) interfae that appliation programs an use to speify their a
ess patterns, inpartiular strided patterns, whih are very ommon in sienti� appliations. A strideda
ess pattern is one that onsists of multiple a
esses of idential size and with identialintervals between them. Galley allows the appliation to speify both simple and nestedstrided a
ess patterns, giving Galley muh more information than other �le systemswhih it an use to exeute requests in parallel, oalese them if possible, and performintelligent disk sheduling.Delustering is provided by splitting �les into a �xed number11 of sub�les, where eahdisk ontains at most one sub�le of any given �le. Additionally, eah sub�le an ontainone or more (ompletely separate) forks of data. Sub�les and forks must be expliitlyaddressed by the appliation (or a library it uses). This gives the developer maximum�exibility for the development of IO-optimal parallel algorithms.Galley obviously provides very useful funtionality that ould result in large perfor-mane gains in its intended target environment: sienti� appliations that do �numberrunhing� on massively parallel systems. However, this environment di�ers a lot fromthat whih Autonomous Disks are intended for. Appliations programmers who want tomake use of Galley's performane improvements need signi�ant knowledge of the sys-tem details and must write their appliations spei�ally for Galley, while a key featureof Autonomous Disks is easy, transparent use of their features. Additionally, Galley isdesigned to ater optimally to a single parallel appliation that uses a few massive datastrutures, and it might not perform very well in a multiuser environment with manydi�erent data items, for whih Autonomous Disks are intended.For these reasons, and beause Galley does not address issues suh as skew handlingand fault tolerane, whih are key features of Autonomous Disks, Galley is largely in-ompatible with my projet and was not onsidered as a soure of ideas during its designand implementation.11to be hosen when reating a �le36
Total Disk Busy Time
Total Response Time
Transfer
Seek
Respone TimesPer-Disk
Distribution
Data
Figure 4.4: In�uene of striping on response time and disk busy time4.3 DelusteringEnabling the Autonomous Disks to provide delustering was a entral part of my work,therefore other researh e�orts in this �eld were of ourse also relevant. Papers ondelustering usually desribe only ertain strategies of distributing the data, not waysto manage the distributed data; these tend to be too integrated into the overall systemdesign to be generalizable and are therefore found in papers that desribe suh designs(like the one in setion 4.2.1).However, even though I only implemented two unsophistiated delustering strategiesthat are not based on any partiular paper, taking into a
ount available delusteringstrategies was important, sine looking at them provides requirements that a delusteringframework suh as the one desribed in setions 5.3 and 5.4 has to meet in order tosupport a variety of delustering strategies.4.3.1 One-Dimensional DataStripe SizeIn [15℄, Sheuermann, Weikum and Zabbak present a model for data partitioning arossdisks that takes into a
ount response time, throughput and load balaning. I did notonsider the load balaning aspet, beause the Autonomous Disks already provide loadbalaning. However, their work learly shows that the stripe size, the unit into whihdata is divided when delustering it, is an important tuning parameter. There are twomain onsiderations that form a tradeo�:� The stripe size should be small enough so that most requests involve all disks,to bene�t from the aggregated transfer rate of the disks - the main advantage ofdelustering. Thus, the average request time should be taken into a
ount.� On the other hand, a smaller stripe size leads to inreased overall disk busy timeper request, due to disk seeks. This redues the throughput when there are multiplerequests being served at the same time. 37
������������������������
������������������������
��������������������������������������������������������������������������������
�������������������������������������������������������������������������������� 26 27 28 29
1
0
Disks
2
3
Small Query Large Query
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
RR
Hash
Range
HRPS
Records
Figure 4.5: Delustering of reord-based data: Round-Robin, Hash, Range and Hybrid-Range delusteringThis tradeo� is illustrated in �gure 4.4. The importane of the right stripe size wastaken into a
ount when designing the Autonomous Disks delustering in suh a waythat the stripe size for eah stream an be hosen separately, as desribed in setion 5.4.Hybrid-Range PartitioningIn [14℄, Ghandeharizadeh and DeWitt proposed their Hybrid-Range Paritioning Strategyfor delustering in multiproessor databases. They observed that for reord-based data,hash-based and round-robin delustering has the result that even small range querieshave to be proessed on all PEs, inurring a relatively large parallelism overhead. Onthe other hand, simple range partitioning has the disadvantage of on�ning even verylarge range queries to just one or two PEs, so that no intra-query parallelism is usedand the response time is long.They therefore proposed a delustering strategy in whih the data is split into rela-tively small fragments, eah ontaining reords with a small range of the index attribute.These fragments are then distributed aross all the disks using a round-robin sheme.The result (see �gute 4.5) is that small range queries are on�ned to one or a few PEs,minimizing overhead, while larger ones use all disks to improve the response time. Thesize of the fragments an be optimized, taking into onsideration the average resourerequirements and size of queries to ahieve the distribution that mathes the workloadbest.The Autonomous Disks (or rather, the Fat-Btree) use a range partitioning with �ex-ible ranges for the main index attribute, as desribed in setion 2.2. However, individualstreams may have an internal database-like struture, suh as the �les used by Berke-ley DB style databases. Even non-database streams may frequently experiene �rangequeries�, i.e. a
esses to a ertain limited range within the stream (for example displayinga short exerpt of a video stream).Therefore, it is bene�ial to use a hybrid-range partitioning for the delusteringof individual streams in the Autonomous Disks, and the approah taken by me (seesetion 5.3) is quite similar. Of ourse, reord-based hash or round-robin delusteringis not possible at all with generalized, unstrutured data streams, so I hose hybrid-38
Disk 0 Disk 1 Disk 2 Disk 3
1 30 2
6 74
8 9 10 11
0
1
2
47
8
9
10
11
6 3M
i r
r o
rD
a t
a
5
5Figure 4.6: The RAID-x Orthogonal Striping and Mirroring shemerange delustering over a simple range partitioning sheme. In addition to the possibleperformane-related advantages, this was also hosen beause it allows the skew handlingto be more �ne-grained (see setion 3.3 and �gure 3.2).Raid-xIn [16℄, Hwang et al. presented RAID-x, a new distributed RAID arhiteture. Thedistinguishing features of their system are:� A blok distributionmethod for ahieving both fault tolerane and high throughputfor reading and writing alled Orthogonal Striping and Mirroring (OSM). Databloks are striped aross all disks, but bakup opies are grouped together andwritten sequentially, as shown in �gure 4.6.� Cooperative Disk Drivers (CCD), an arhiteture for establishing a transparentsingle I/O spae aross the nodes of the parallel RAID luster.� A disk on�guration in whih eah node of the luster has several disks, forminga two-dimensional disk array that allows both parallel and pipelined a
ess.It is not lear how muh the integration of these features is neessary, or whether theyould be used independently from eah other. The result are aggregated read transferrates omparable to those of a RAID-5 on�guration, and just like RAID-5, RAID-xan tolerate a single disk failure. However, beause there is no parity alulation, writeperformane is muh better than RAID-5. The drawbak is that RAID-x requires 50%data redundany, but tolerates only a single disk failure.In regard to the Autonomous Disks, the OSM distribution might be bene�ial, but[16℄ does not ompare its performane to RAID-0/1 (ombined mirroring and striping),whih is a ommon on�guration and very similar to the way the Autonomous Disksdelustering operates. In general, the RAID-x design is intended for blok-level a
ess39
and it would not be easy to apply it to an objet-oriented arhiteture, espeially withregard to the skew-handling-indued data migration of the Autonomous Disks.The Cooperative Disk Drivers, on the other hand, are quite similar to the way theAutonomous Disks operate: a serverless luster of nodes forming a single data repositorytowards lients. However, sine they provide only blok-level a
ess and neither skewhandling nor dynami reon�guration, their job is muh simpler.4.3.2 Two-Dimensional DataIn the 1990ies, several researh e�orts produed a ontinuous evolution of delusteringshemes that ahieve good response times for range queries on (mainly) two-dimensionaldata, suh as the retrieval of all available information on an area of a map seleted bya user. The problem here is to distribute the information aross the disks in suh amanner that � as far as possible � eah possible range query results in all disks beingused equally.Unfortunately, Abdel-Gha�ar and El Abbadi showed in [18℄ that suh a stritlyoptimal sheme exists only under very narrow onditions12, namely if either M , thenumber of disks, is 1,2,3 or 5, or the number of tiles in all but one of the dimensionsis only 1 or 2. However, the aforementioned evolution has yielded methods that omeloser and loser to a stritly optimal delustering.Disk Modulo (�gure 4.7)The �rst suh sheme was the Disk Modulo sheme introdued in [19℄ and extended in[20℄, in whih the data is divided into tiles and tile (x; y) is assigned to disk number(x+ y) modM . This is appliable for any M , but the performane is lower than all theother methods.Field-XOR (�gure 4.10)Another method is the Fieldwise Exlusive-or sheme, in whih tiles are assigned to disknumber (x y) modM (the binary representations of the tile numbers are ombinedthrough a bitwise exlusive-or operation). This ahieves better performane, but itrequires M to be a power of 2 and elaborate orretion tehniques if the number of tilesis smaller than M in one of the dimensions.12However, this limitation is true only for the hosen partition of the index spae into retangular tiles,a hoie that seems obvious, but has severe drawbaks, as disussed in setion 4.3.340
X
Y
Disks
3
2
0
1
Figure 4.7: Disk Modulo delusteringX
Y
Disks
3
2
0
1
Figure 4.8: ECC delusteringX
Y
Disks
3
2
0
1
Figure 4.9: Cyli Delustering(skip value: 2)
X
Y
Disks
3
2
0
1
Figure 4.10: Field-XOR delusteringX
Y
Disks
3
2
0
1
Figure 4.11: HCAM delustering. The on-voluted line is the spae-�llingHilbert urve.X
Y
Disks
3
2
0
1
Figure 4.12: Golden Ratio Sequenesdelustering. 41
Method Tiles per dimension Number of DisksDisk Modulo Any AnyField-XOR Compliations if < # disks Power of 2Error-Correting Codes Power of 2 Power of 2Hilbert Curve Any AnyCyli Delustering Any AnyGolden Ratio Sequenes Any AnyFigure 4.13: Limitations imposed by the various two-dimensional delustering methodsError-Correting Codes (�gure 4.8)Both of these methods are outperformed signi�antly (as shown in [23℄) by the ECCsheme presented in [22℄, whih uses error-orreting gray odes. The binary representa-tions of the tile oordinates are grouped on the disks in suh a way that those groupedon eah disk form a gray ode, with large Hamming distanes between them13, whihshould result in good delustering. The disadvantage of this method is that it needsboth M and the number of tiles in both dimensions to be a power of 2.Spae-Filling Curves (�gure 4.11)HCAM, a delustering method that performs equal or better than ECC (but only forsquare or near-square queries) while not putting any restritions on M or the numberof tiles, was introdued in [23℄. In HCAM, the tiles are linearized along a spae-�llingHilbert urve, and assigned to the disks in a round-robin fashion along this linearization.Cyli Delustering (�gure 4.9)Cyli Delustering, a generalization of the Disk Modulo method, was presented in [24℄.It plaes tile (x; y) on disk (x +H � y) modM , where H is a skip value that has to behosen by some algorithm or heuristi to yield good results. Three suh methods werepresented: the �rst (RPHM) based on a simple algorithm to hoose a skip value that isrelatively prime to M . This performs better than all previous methods for most queries,but it strongly favors even values of M . A seond algorithm is GFIB, whih �nds abetter H relatively prime to M based on Fibona
i numbers. The performane is evenbetter in nearly all ases. A �nal method relies on an exhaustive searh of all possiblevalues and simulation of their performane to �nd the value of H with the best result,but due to the omputational overhead, this quikly beomes infeasible for larger valuesof M and larger maps. All of these methods put no restritions on M and the numberof tiles.13i.e. they di�er in many bit positions42
Golden Ratio Sequenes (�gure 4.12)The so far best delustering sheme for two-dimensional data was introdued in [25℄,involving somewhat nonintuitive but not really omplex14 omputations based on GoldenRatio sequenes. It outperforms even the (pratially infeasible) exhaustive-searh ylidelustering in most ases, and imposes no restritions. Moreover, it stands out as theonly delustering method whose performane ould be mathematially analyzed. Theanalysis suggests that the GRS sheme has a worst ase response time within a fator of3 of the (often not existing) stritly optimal sheme for any query, and within a fatorof 1.5 for large queries, while the average performane is expeted to di�er from theoptimal by only a fator of 1.14.4.3.3 Higher Dimensional DataAll of the two-dimensional delustering shemes mentioned in setion 4.3.2 an be gen-eralized for a higher number of dimensions. However, as the number of dimensionsinreases, a silent assumption that all of these shemens make beomes untrue: theassumption that dividing the index spae into square tiles yields good performane fordelustering. There are several interrelated problems:� The number of tiles inreases exponentially with the number of dimensions thisway. This means that it may beome impratial to divide the spae into morethan two or three tiles in eah dimension 15 beause the large number of tiles inursan inreasing overhead in data management.� Additionally, the number of tiles adjaent to any given tile also inreases (thoughonly linearly), whih makes it more di�ult or impossible to �nd an optimal delus-tering sheme for small queries. The non-existane of an optimal sheme mentionedin setion 4.3.2 is proven only for retangular tiling.� Another problem is data distribution: the nature of the Eulidi distane metrihas the e�et that in higher-dimensional spaes, most data will be situated loseto the surfae of the index hyperube 16. Eqidistant tiling is not well �t to dealwith suh a distribution.� Even worse, not only the overall number of tiles inreases exponentially with thedimension, but also the number of tiles adjaent to an intersetion, whih willhave to be loaded for any query overing that interstion. In the worst ase, a highdimensionality may require using only two tiles per dimension, with the result thateven very small queries have a high probability of enompassing the single entralintersetion and thus requring all tiles to be loaded: for all hyperubes with a base14and omputationally feasible, even for large M and large maps15Three tiles per dimension result in almost 60,000 overall tiles in 10 dimensions16A 5-dimensional hyperube with a base length of 0.5 ontains less than 4% of the volume of a unithyperube 43
Figure 4.14: Hypertrapezoid pertitioning sheme for 2 and 3 dimensions, one hypertrape-zoid tile highlightedlength over 0.5 (whih may still be very small in terms of relative volume), this isatually a ertainty.After establishing in this way that equidistant retangular tiling is not suitable for high-dimensional data, Ferhatosmanoglu et al. suggested in [27℄ an alternative partitioningsheme17. They divide the spae not into retangular tiles, but into onentri hyper-ubes, with a possible subdivision into hypertrapezoids by interseting the onentrihyperubes with hyperpyramids formed by the enter of the index spae as the pyramids'tip and its surfaes as their bases. This solves all the problems mentioned above:� Both the total number of tiles and the number of tiles adjaent to an intersetioninrease linearly instead of exponentially with the number of dimensions: a hy-perube of dimension n has 2n surfaes, and there is one hyperpyramid for eahsurfae.� The number of hypertrapzoids into whih eah hyperpyramid is divided an behosen freely and inreases the total number of tiles by only a onstant, not expo-nential, fator.� The onentri hyperubes (and thus the hypertrapezoids) an be spaed non-equidistant to ahieve good data distribution: the outermost ones should be verylose together, sine they will ontain the most data.� While the hypertrapezoid partitioning does not alleviate the problem of the in-rease of adjaent tiles making optimal delustering di�ult, pure onentri-hyperube based paritioningmakes it trivial to �nd an optimal delustering sheme18.17their work is heavily based on an earlier paper [26℄ by Berhtold et al.18For example, a simple round-robin distribution advaning from the innermost hyperube outwards44
However, the hypertrapezoid partitioning also has at least one disadvantage, whih isnot mentioned in [27℄: even small queries lose to a orner of the index spae would tendto require almost half of the total data to be loaded.Anyway, this method, as well as the two-dimensional delustering methods of se-tion 4.3.2, are not diretly appliable to the Autonomous Disks, sine they use only aone-dimensional index. However, it would be feasible to o�er limited support for aninternal two- or multidimensional struture of streams, and o�er speial delusteringmethods for them. This was not implemented, but in setion 7.3, a way to integrate thisapability into the urrent ode with small e�ort is outlined.Sine Autonomous Disks are intended as a general data repository, the ability toe�iently handle multi-dimensional data is a valuable addition to the feature set, andthe relative ease with whih this delusteringmethod ould be added ahows the �exibilityof the delustering framework.4.3.4 Similarity GraphsIn [17℄, Liu and Shekhar present a very generalized method for �nding a good delus-tering layout for a partiular set of data and queries. Their idea is based on viewingthe data items as nodes of a graph; the graph's edges represent the likelihood of twonodes being a
essed together and are weighted a
ordingly. A heuristi method is usedto partition the graph over the available disks in suh a way that data items that arefrequently a
essed together are plaed on di�erent disks as muh as possible, so thatthey an be retrieved in parallel. An inremental plaement sheme for the graduala
umulation of data items is also suggested.The obvious problem is that building the similarity graph in the �rst plae requiresa lot of knowledge about a
ess probabilities that may not be available at all or may beostly to gather or maintain. An example for a ase where this is not a problem is therange query oriented two-dimensional data partitioning disussed in setion 4.3.2. As-suming that all queries are equiprobable or that their probability distribution is known,the edge weights an be omputed relatively easily. However, in general this questionmust be solved separately for any atual appliation of the method, so in reality it simplyreplaes the task of �nding a good delustering sheme with that of �nding meaningfulprobabilities of data items being a
essed together. The advantage is that the lattertask seems to be more straightforward to solve in most ases.Another disadvantage is that graph partitioning is a ostly (in terms of CPU usage)operation, though this is partially alleviated by modern, powerful CPUs and the inre-mental plaement method that uses only a subset of the graph to �nd a loal optimumplaement.In ase of the Autonomous Disks, I hose not to implement this method beausethe disks themselves do not generally have the means to provide the edge weights, andlients should be able to use the Autonomous Disks as easily as possible. Additionally,the similarity graphs partitioning does not yield a funtion-based partitioning, so theloation of eah data item has to be reorded as metadata, resulting in a possibly largeamount of metadata, something I deliberately wanted to avoid (see setion 5.4). 45
46
5 DesignDesign hanges and extensions that allowed the treatment of large data items, as well assome supporting funtionality, were the ore ontribution of my projet. In this hapter,the hanges and onsiderations that motivated them are explained.5.1 Alloation and DealloationThe �rst extensive programming work done in this projet was the implementation of afully featured alloation subsystem to manage the disk spae on eah Autonomous Disk.Figure 5.1 desribes the �paths� of spae in the resulting system.5.1.1 Main Data StruturesTo attain the goals desribed in setion 3.2 (fast alloation, low fragmentation), the�address-ordered best �t� alloation poliy whih performed so well in the fragmentationbenhmarks mentioned in setion 4.1.1 was hosen. For an e�ient implementation, theapproah used in XFS (see setion 4.1.4) was adopted.The main data strutures are two B+ trees1, both of whih ontain all extents ofontinuous free spae.� One tree is indexed on the size of the extents, with the starting address as seondarykey (to make eah entry unique). This is used diretly to implement the �best �t�poliy, as it easily allows to �nd a free extent that satis�es a request.� The seond tree is indexed on the starting address. It is used for oalesing freeextents with their neighbours if possible, and also for lustering alloation (seesetion 5.1.4)Generally, an alloation request is served by looking through the size-indexed tree forthe smallest free extent that is large enough to satisfy the request, and removing it fromthe trees. If the extent does not �t the request exatly (i.e. it is larger than neessary),it is split and the exess spae is reinserted into the trees as a new free extent.1Unrelated to the Fat-Btree, the Autonomous Disk's main index, whih has a muh less e�ientimplementation, partially beause it is distributed aross several disks. 47
Deferred Buffer
Page SourceSize-indexed
B-Tree
Address-indexed
B-Tree
Overflow
Remainder
Allocation
Application
Data
of split
Rem.
Alloc.Rem.
Allo
c.
Rem.
Allocation
Fre
ein
g
New tree nodes
Discarded tree nodes
Figure 5.1: �Spae paths� in the alloation subsystemWhen spae is freed, the address-indexed is used to �nd out whether there are freeextents diretly adjaent. If one is found, it is removed from the trees and the two orthree2 extents are oalesed into one, whih is then inserted into the trees.Thanks to the B-trees, this is a highly salable implementation that makes it possibleto �nd the best �t even on a very large and highly fragmented disk with hundreds ofthousands of free extents. However, there are some additional ompliations.5.1.2 Internal AlloationThe algorithm outlined above fails to take into a
ount one important fator: the B-treesin whih the free extents are stored also take up spae and must be stored on the disk.Sine their size is highly variable, they annot be put in prealloated spae. Instead, thetree nodes must be dynamially alloated and dealloated just like the appliation data.This is a lassial �hen-egg problem�: the tree nodes annot be alloated in the sameway as appliation data, beause when a tree node is alloated or dealloated, it meansthat the tree is in the proess of being modi�ed, therefore possibly in an inonsistentstate, and thus annot be used for another alloation or dealloation. This is solved byusing two additional data strutures: a page soure and a page sink.Page Soure:In abstrat, an objet whih yields free pages for use as tree nodes when desired. Itis implemented as a small bu�er that holds enough pages (or their addresses, to be2if there were free extents on both sides of the one being freed48
exat) to satisfy the need for new tree pages during a single dealloation operation3.The maximum number n of pages that might be needed an be estimated as follows:n = 2 �logb �p2��where p is the number of pages on the disk, divided by 2 to get the number of freeextents in the worst ase of maximal fragmentation. We take the bth logarithm of this,b being the minimal branhing degree of the trees (determined by the node size and thefat that a node is half full in the worst ase), to get the maximal height of a tree. Thisnumber is rounded up to give the maximal number of new pages neessary when a nodesplit in a tree of maximal size propagates up to the root, and multiplied by 2 to a
ountfor the presene of two trees. To give a pratial example: on a 100GB disk using 4KBpages, p is about 26,000,000 and b is 127. In this ase, n is 8.When the reserve beomes too small, it is re�lled. In order to redue the overhead,the bu�er atually holds up to 2n pages, and is re�lled by requesting n onseutive newpages only when it's half-full. Sine the atually required n is quite low for normal ases(as seen above), it might be advisable to use a larger number to redue overhead further,derease fragmentation and inrease loality by keeping the tree nodes in somewhatlarger lusters.Page Sink:Does the opposite job of the page soure by a
epting pages from disarded tree nodesand inserting them into the trees not immediately but later, when the alloation oroalesing request4 is �nished.Page sink and page soure ould theoretially be the same objet, one whih reusesdisarded node pages, but it still would have to use the real alloation subsystem for bothalloation and dealloation sine the trees' overall size might grow or shrink drastiallyin some situations.5.1.3 Deferred CoalesingDeferred Coalesing is a tehnique mentioned in setion 4.1.1 that seeks to speed up thealloation proess by keeping a �deferred bu�er� with �xed maximum size, ontainingreently freed extents in memory without oalesing them. The list is searhed beforeresorting to the normal alloation mehanism.Sine the �page sink� desribed in setion 5.1.2 already implements this by neessityfor pages used by the alloation subsystem's B-trees, it was an easy deision to extendit to do general deferred oalesing. Thus, all freed extents are put into this �deferredbu�er� �rst, whih ats as a FIFO queue, inserting the oldest extents in it into the treeswhen its designated size is exeeded.3The trees will grow only during dealloation, when a new free extent is inserted4Nodes will be disarded during alloation and oalesing, when extents are deleted from the trees -in the latter ase only temporarily. 49
This size of the bu�er diretly determines how e�etive the speed gains of deferredoalesing will be; if it's very small, there will be no e�et, but if it's too large, thesequential searh might also have a negative e�et, and it an inrease fragmentation ifit keeps too muh free spae unoalesed.Another question is what to do if a searh of the bu�er yields a free extent largerthan the requirement of the alloation request. One possibility would be to use thisextent immediately and not look through the trees. Obviously, this is the fastest option.However, it would mean that the alloation poliy would partially turn into �LIFO �rst�t�5, a poliy that exhibited an intolerably high fragmentation in [7℄. Therefore I insteadhose to look through both the deferred bu�er and the size-ordered B-tree (exept whenthe bu�er yields an exat �t) in order to implement a �best �t� poliy.5.1.4 Clustering AlloationIn [13℄, the apability of the B-tree based alloation mehanism to provide �lustering�,i.e. to use the address-indexed B-tree to �nd free spae near a given address is stressed.The bene�t is that data that is likely to be a
essed together an be kept lose together ondisk to redue seek times. This apability was not really neessary for the AutonomousDisks and is unlikely to be of muh use exept maybe for the Fat-Btree itself, but addingit to the alloation subsystem was a relatively simple task, sine the data struturessupport it inherently.5.2 Arbitrary-Length Streams, Random A
essThe prototype had to be modi�ed to enable it to handle streams of any size, whihwas ahieved by adding a length �eld to the index entries and alloating the neessarynumber of disk pages.But this was not su�ient. For some of the possible appliations, espeially multi-media data, stream sizes ould easily reah into the gigabyte range. As desribed insetion 2.1.4, the prototype implementation relies on transferring eah stream as a wholebefore starting to proess it, and during proessing, it atually needs to be opied severaltimes. For large streams, this would require an impossibly high amount of main memory.Additionally, all other requests would have to wait for the large stream to be proessed,resulting in intolerably high response times.To solve this problem, it is neessary to transfer large streams in smaller piees,small enough to omfortably proess several of them in main memory and ause noundue delays for other requests. This ould have been ahieved with a sequential model,in whih the piees of a stream are always transfered in order, starting at the stream'sbeginning, and this option was onsidered and partially implemented, as mentioned insetion 5.5.3. However, at the prie of an only slightly more omplex interfae, it waspossible to allow random a
ess to any point within a stream, without needing to transferthe entire stream.5This means using the �rst free extent that is large enough, looking at