From be7cbb4ef2cb02aa9ac48c02dc1ee585a8e49043 Mon Sep 17 00:00:00 2001 From: rsc Date: Tue, 12 Jul 2005 15:24:18 +0000 Subject: venti, now with documentation! --- man/man7/venti.7 | 439 ++++++++++++++++++++++++++++++++++++++++++++++++++ man/man7/venti.conf.7 | 360 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 799 insertions(+) create mode 100644 man/man7/venti.7 create mode 100644 man/man7/venti.conf.7 (limited to 'man/man7') diff --git a/man/man7/venti.7 b/man/man7/venti.7 new file mode 100644 index 00000000..efab4e99 --- /dev/null +++ b/man/man7/venti.7 @@ -0,0 +1,439 @@ +.TH VENTI 7 +.SH NAME +venti \- archival storage server +.SH DESCRIPTION +Venti is a block storage server intended for archival data. +In a Venti server, the SHA1 hash of a block's contents acts +as the block identifier for read and write operations. +This approach enforces a write-once policy, preventing +accidental or malicious destruction of data. In addition, +duplicate copies of a block are coalesced, reducing the +consumption of storage and simplifying the implementation +of clients. +.PP +This manual page documents the basic concepts of +block storage using Venti as well as the Venti network protocol. +.PP +.IR Venti (1) +documents some simple clients. +.IR Vac (1), +.IR vbackup (1), +.IR vacfs (4), +and +.IR vnfs (4) +are more complex clients. +.PP +.IR Venti (3) +describes a C library interface for accessing +Venti servers and manipulating Venti data structures. +.PP +.IR Venti.conf (7) +describes the Venti server configuration file. +.PP +.IR Venti (8) +describes the programs used to run a Venti server. +.PP +.SS "Scores +The SHA1 hash that identifies a block is called its +.IR score . +The score of the zero-length block is called the +.IR "zero score" . +.PP +Scores may have an optional +.IB label : +prefix, typically used to +describe the format of the data. +For example, +.IR vac (1) +uses a +.B vac: +prefix, while +.IR vbackup (1) +uses prefixes corresponding to the file system +types: +.BR ext2: , +.BR ffs: , +and so on. +.SS "Files and Directories +Venti accepts blocks up to 56 kilobytes in size. +By convention, Venti clients use hash trees of blocks to +represent arbitrary-size data +.IR files . +The data to be stored is split into fixed-size +blocks and written to the server, producing a list +of scores. +The resulting list of scores is split into fixed-size pointer +blocks (using only an integral number of scores per block) +and written to the server, producing a smaller list +of scores. +The process continues, eventually ending with the +score for the hash tree's top-most block. +Each file stored this way is summarized by +a +.B VtEntry +structure recording the top-most score, the depth +of the tree, the data block size, and the pointer block size. +One or more +.B VtEntry +structures can be concatenated +and stored as a special file called a +.IR directory . +In this +manner, arbitrary trees of files can be constructed +and stored. +.PP +Scores passed between programs conventionally refer +to +.B VtRoot +blocks, which contain descriptive information +as well as the score of a block containing a small number +of +.B VtEntries . +.SS "Block Types +To allow programs to traverse these structures without +needing to understand their higher-level meanings, +Venti tags each block with a type. The types are: +.PP +.nf +.ft L + VtDataType 000 \f1data\fL + VtDataType+1 001 \fRscores of \fPVtDataType\fR blocks\fL + VtDataType+2 002 \fRscores of \fPVtDataType+1\fR blocks\fL + \fR\&...\fL + VtDirType 010 VtEntry\fR structures\fL + VtDirType+1 011 \fRscores of \fLVtDirType\fR blocks\fL + VtDirType+2 012 \fRscores of \fLVtDirType+1\fR blocks\fL + \fR\&...\fL + VtRootType 020 VtRoot\fR structure\fL +.fi +.PP +The octal numbers listed are the type numbers used +by the commands below. +(For historical reasons, the type numbers used on +disk and on the wire are different from the above. +They do not distinguish +.BI VtDataType+ n +blocks from +.BI VtDirType+ n +blocks.) +.SS "Zero Truncation +To avoid storing the same short data blocks padded with +differing numbers of zeros, Venti clients working with fixed-size +blocks conventionally +`zero truncate' the blocks before writing them to the server. +For example, if a 1024-byte data block contains the +11-byte string +.RB ` hello " " world ' +followed by 1013 zero bytes, +a client would store only the 11-byte block. +When the client later read the block from the server, +it would append zeros to the end as necessary to +reach the expected size. +.PP +When truncating pointer blocks +.RB ( VtDataType+ \fIn +and +.BI VtDirType+ n +blocks), +trailing zero scores are removed +instead of trailing zero bytes. +.PP +Because of the truncation convention, +any file consisting entirely of zero bytes, +no matter what the length, will be represented by the zero score: +the data blocks contain all zeros and are thus truncated +to the empty block, and the pointer blocks contain all zero scores +and are thus also truncated to the empty block, +and so on up the hash tree. +.SS NETWORK PROTOCOL +A Venti session begins when a +.I client +connects to the network address served by a Venti +.IR server ; +the conventional address is +.BI tcp! server !venti +(the +.B venti +port is 17034). +Both client and server begin by sending a version +string of the form +.BI venti- versions - comment \en \fR. +The +.I versions +field is a list of acceptable versions separated by +colons. +The protocol described here is version +.B 02 . +The client is responsible for choosing a common +version and sending it in the +.B VtThello +message, described below. +.PP +After the initial version exchange, the client transmits +.I requests +.RI ( T-messages ) +to the server, which subsequently returns +.I replies +.RI ( R-messages ) +to the client. +The combined act of transmitting (receiving) a request +of a particular type, and receiving (transmitting) its reply +is called a +.I transaction +of that type. +.PP +Each message consists of a sequence of bytes. +Two-byte fields hold unsigned integers represented +in big-endian order (most significant byte first). +Data items of variable lengths are represented by +a one-byte field specifying a count, +.IR n , +followed by +.I n +bytes of data. +Text strings are represented similarly, +using a two-byte count with +the text itself stored as a UTF-8 encoded sequence +of Unicode characters (see +.IR utf (7)). +Text strings are not +.SM NUL\c +-terminated: +.I n +counts the bytes of UTF-8 data, which include no final +zero byte. +The +.SM NUL +character is illegal in text strings in the Venti protocol. +The maximum string length in Venti is 1024 bytes. +.PP +Each Venti message begins with a two-byte size field +specifying the length in bytes of the message, +not including the length field itself. +The next byte is the message type, one of the constants +in the enumeration in the include file +.BR . +The next byte is an identifying +.IR tag , +used to match responses with requests. +The remaining bytes are parameters of different sizes. +In the message descriptions, the number of bytes in a field +is given in brackets after the field name. +The notation +.IR parameter [ n ] +where +.I n +is not a constant represents a variable-length parameter: +.IR n [1] +followed by +.I n +bytes of data forming the +.IR parameter . +The notation +.IR string [ s ] +(using a literal +.I s +character) +is shorthand for +.IR s [2] +followed by +.I s +bytes of UTF-8 text. +The notation +.IR parameter [] +where +.I parameter +is the last field in the message represents a +variable-length field that comprises all remaining +bytes in the message. +.PP +All Venti RPC messages are prefixed with a field +.IR size [2] +giving the length of the message that follows +(not including the +.I size +field itself). +The message bodies are: +.ta \w'\fLVtTgoodbye 'u +.IP +.ne 2v +.B VtThello +.IR tag [1] +.IR version [ s ] +.IR uid [ s ] +.IR strength [1] +.IR crypto [ n ] +.IR codec [ n ] +.br +.B VtRhello +.IR tag [1] +.IR sid [ s ] +.IR rcrypto [1] +.IR rcodec [1] +.IP +.ne 2v +.B VtTping +.IR tag [1] +.br +.B VtRping +.IR tag [1] +.IP +.ne 2v +.B VtTread +.IR tag [1] +.IR score [20] +.IR type [1] +.IR pad [1] +.IR count [2] +.br +.B VtRead +.IR tag [1] +.IR data [] +.IP +.ne 2v +.B VtTwrite +.IR tag [1] +.IR type [1] +.IR pad [3] +.IR data [] +.br +.B VtRwrite +.IR tag [1] +.IR score [20] +.IP +.ne 2v +.B VtTsync +.IR tag [1] +.br +.B VtRsync +.IR tag [1] +.IP +.ne 2v +.B VtRerror +.IR tag [1] +.IR error [ s ] +.IP +.ne 2v +.B VtTgoodbye +.IR tag [1] +.PP +Each T-message has a one-byte +.I tag +field, chosen and used by the client to identify the message. +The server will echo the request's +.I tag +field in the reply. +Clients should arrange that no two outstanding +messages have the same tag field so that responses +can be distinguished. +.PP +The type of an R-message will either be one greater than +the type of the corresponding T-message or +.BR Rerror , +indicating that the request failed. +In the latter case, the +.I error +field contains a string describing the reason for failure. +.PP +Venti connections must begin with a +.B hello +transaction. +The +.B VtThello +message contains the protocol +.I version +that the client has chosen to use. +The fields +.IR strength , +.IR crypto , +and +.IR codec +could be used to add authentication, encryption, +and compression to the Venti session +but are currently ignored. +The +.IR rcrypto , +and +.I rcodec +fields in the +.B VtRhello +response are similarly ignored. +The +.IR uid +and +.IR sid +fields are intended to be the identity +of the client and server but, given the lack of +authentication, should be treated only as advisory. +The initial +.B hello +should be the only +.B hello +transaction during the session. +.PP +The +.B ping +message has no effect and +is used mainly for debugging. +Servers should respond immediately to pings. +.PP +The +.B read +message requests a block with the given +.I score +and +.I type . +Use +.I vttodisktype +and +.I vtfromdisktype +(see +.IR venti (3)) +to convert a block type enumeration value +.RB ( VtDataType , +etc.) +to the +.I type +used on disk and in the protocol. +The +.I count +field specifies the maximum expected size +of the block. +The +.I data +in the reply is the block's contents. +.PP +The +.B write +message writes a new block of the given +.I type +with contents +.I data +to the server. +The response includes the +.I score +to use to read the block, +which should be the SHA1 hash of +.IR data . +.PP +The Venti server may buffer written blocks in memory, +waiting until after responding to the +.B write +message before writing them to +permanent storage. +The server will delay the response to a +.B sync +message until after all blocks in earlier +.B write +messages have been written to permanent storage. +.PP +The +.B goodbye +message ends a session. There is no +.BR VtRgoodbye : +upon receiving the +.BR VtTgoodbye +message, the server terminates up the connection. +.SH SEE ALSO +.IR venti (1), +.IR venti (3) diff --git a/man/man7/venti.conf.7 b/man/man7/venti.conf.7 new file mode 100644 index 00000000..000d8aa4 --- /dev/null +++ b/man/man7/venti.conf.7 @@ -0,0 +1,360 @@ +.TH VENTI.CONF 7 +.SH NAME +venti.conf \- venti configuration +.SH DESCRIPTION +Venti is a SHA1-addressed archival storage server. +See +.IR venti (7) +for a full introduction to the system. +This page documents the structure and operation of the server. +.PP +A venti server requires multiple disks or disk partitions, +each of which must be properly formatted before the server +can be run. +.SS Disk +The venti server maintains three disk structures, typically +stored on raw disk partitions: +the append-only +.IR "data log" , +which holds, in sequential order, +the contents of every block written to the server; +the +.IR index , +which helps locate a block in the data log given its score; +and optionally the +.IR "bloom filter" , +a concise summary of which scores are present in the index. +The data log is the primary storage. +To improve the robustness, it should be stored on +a device that provides RAID functionality. +The index and the bloom filter are optimizations +employed to access the data log efficiently and can be rebuilt +if lost or damaged. +.PP +The data log is logically split into sections called +.IR arenas , +typically sized for easy offline backup +(e.g., 500MB). +A data log may comprise many disks, each storing +one or more arenas. +Such disks are called +.IR "arena partitions" . +Arena partitions are filled in the order given in the configuration. +.PP +The index is logically split into block-sized pieces called +.IR buckets , +each of which is responsible for a particular range of scores. +An index may be split across many disks, each storing many buckets. +Such disks are called +.IR "index sections" . +.PP +The index must be sized so that no bucket is full. +When a bucket fills, the server must be shut down and +the index made larger. +Since scores appear random, each bucket will contain +approximately the same number of entries. +Index entries are 40 bytes long. Assuming that a typical block +being written to the server is 8192 bytes and compresses to 4096 +bytes, the active index is expected to be about 1% of +the active data log. +Storing smaller blocks increases the relative index footprint; +storing larger blocks decreases it. +To allow variation in both block size and the random distribution +of scores to buckets, the suggested index size is 5% of +the active data log. +.PP +The (optional) bloom filter is a large bitmap that is stored on disk but +also kept completely in memory while the venti server runs. +It helps the venti server efficiently detect scores that are +.I not +already stored in the index. +The bloom filter starts out zeroed. +Each score recorded in the bloom filter is hashed to choose +.I nhash +bits to set in the bloom filter. +A score is definitely not stored in the index of any of its +.I nhash +bits are not set. +The bloom filter thus has two parameters: +.I nhash +(maximum 32) +and the total bitmap size +(maximum 512MB, 2\s-2\u32\d\s+2 bits). +.PP +The bloom filter should be sized so that +.I nhash +\(ti +.I nblock +\(ti +0.7 +\(<= +0.7 \(ti +.IR b , +where +.I nblock +is the expected number of blocks stored on the server +and +.I b +is the bitmap size in bits. +The false positive rate of the bloom filter when sized +this way is approximately 2\s-2\u\-\fInblock\fR\d\s+2. +.I Nhash +less than 10 are not very useful; +.I nhash +greater than 24 are probably a waste of memory. +.I Fmtbloom +(see +.IR venti-fmt (8)) +can be given either +.I nhash +or +.IR nblock ; +if given +.IR nblock , +it will derive an appropriate +.IR nhash . +.SS Memory +Venti can make effective use of large amounts of memory +for various caches. +.PP +The +.I "lump cache +holds recently-accessed venti data blocks, which the server refers to as +.IR lumps . +The lump cache should be at least 1MB but can profitably be much larger. +The lump cache can be thought of as the level-1 cache: +read requests handled by the lump cache can +be served instantly. +.PP +The +.I "block cache +holds recently-accessed +.I disk +blocks from the arena partitions. +The block cache needs to be able to simultaneously hold two blocks +from each arena plus four blocks for the currently-filling arena. +The block cache can be thought of as the level-2 cache: +read requests handled by the block cache are slower than those +handled by the lump cache, since the lump data must be extracted +from the raw disk blocks and possibly decompressed, but no +disk accesses are necessary. +.PP +The +.I "index cache +holds recently-accessed or prefetched +index entries. +The index cache needs to be able to hold index entries +for three or four arenas, at least, in order for prefetching +to work properly. Each index entry is 50 bytes. +Assuming 500MB arenas of +128,000 blocks that are 4096 bytes each after compression, +the minimum index cache size is about 6MB. +The index cache can be thought of as the level-3 cache: +read requests handled by the index cache must still go +to disk to fetch the arena blocks, but the costly random +access to the index is avoided. +.PP +The size of the index cache determines how long venti +can sustain its `burst' write throughput, during which time +the only disk accesses on the critical path +are sequential writes to the arena partitions. +For example, if you want to be able to sustain 10MB/s +for an hour, you need enough index cache to hold entries +for 36GB of blocks. Assuming 8192-byte blocks, +you need room for almost five million index entries. +Since index entries are 50 bytes each, you need 250MB +of index cache. +If the background index update process can make a single +pass through the index in an hour, which is possible, +then you can sustain the 10MB/s indefinitely (at least until +the arenas are all filled). +.PP +The +.I "bloom filter +requires memory equal to its size on disk, +as discussed above. +.PP +A reasonable starting allocation is to +divide memory equally (in thirds) between +the bloom filter, the index cache, and the lump and block caches; +the third of memory allocated to the lump and block caches +should be split unevenly, with more (say, two thirds) +going to the block cache. +.SS Network +The venti server announces two network services, one +(conventionally TCP port +.BR venti , +17034) serving +the venti protocol as described in +.IR venti (7), +and one serving HTTP +(conventionally TCP port +.BR venti , +80). +.PP +The venti web server provides the following +URLs for accessing status information: +.TP +.B /index +A summary of the usage of the arenas and index sections. +.TP +.B /xindex +An XML version of +.BR /index . +.TP +.B /storage +Brief storage totals. +.TP +.BI /set/ variable +The current integer value of +.IR variable . +Variables are: +.BR compress , +whether or not to compress blocks +(for debugging); +.BR logging , +whether to write entries to the debugging logs; +.BR stats , +whether to collect run-time statistics; +.BR icachesleeptime , +the time in milliseconds between successive updates +of megabytes of the index cache; +.BR arenasumsleeptime , +the time in milliseconds between reads while +checksumming an arena in the background. +The two sleep times should be (but are not) managed by venti; +they exist to provide more experience with their effects. +The other variables exist only for debugging and +performance measurement. +.TP +.BI /set/ variable / value +Set +.I variable +to +.IR value . +.TP +.BI /graph/ name / param / param / \fR... +A PNG image graphing the named run-time statistic over time. +The details of names and parameters are undocumented; +see +.B httpd.c +in the venti sources. +.TP +.B /log +A list of all debugging logs present in the server's memory. +.TP +.BI /log/ name +The contents of the debugging log with the given +.IR name . +.TP +.B /flushicache +Force venti to begin flushing the index cache to disk. +The request response will not be sent until the flush +has completed. +.TP +.B /flushdcache +Force venti to begin flushing the arena block cache to disk. +The request response will not be sent until the flush +has completed. +.PD +.PP +Requests for other files are served by consulting a +directory named in the configuration file +(see +.B webroot +below). +.SS Configuration File +A venti configuration file +enumerates the various index sections and +arenas that constitute a venti system. +The components are indicated by the name of the file, typically +a disk partition, in which they reside. The configuration +file is the only location that file names are used. Internally, +venti uses the names assigned when the components were formatted +with +.I fmtarenas +or +.I fmtisect +(see +.IR venti-fmt (8)). +In particular, only the configuration needs to be +changed if a component is moved to a different file. +.PP +The configuration file consists of lines in the form described below. +Lines starting with +.B # +are comments. +.TP +.BI index " name +Names the index for the system. +.TP +.BI arenas " file +.I File +is an arena partition, formatted using +.IR fmtarenas . +.TP +.BI isect " file +.I File +is an index section, formatted using +.IR fmtisect . +.PP +After formatting a venti system using +.IR fmtindex , +the order of arenas and index sections should not be changed. +Additional arenas can be appended to the configuration; +run +.I fmtindex +with the +.B -a +flag to update the index. +.PP +The configuration file also holds configuration parameters +for the venti server itself. +These are: +.TF httpaddr netaddr +.TP +.BI mem " size +lump cache size +.TP +.BI bcmem " size +block cache size +.TP +.BI icmem " size +index cache size +.TP +.BI addr " netaddr +network address to announce venti service +(default +.BR tcp!*!venti ) +.TP +.BI httpaddr " netaddr +network address to announce HTTP service +(default +.BR tcp!*!http ) +.TP +.B queuewrites +queue writes in memory +(default is not to queue) +.PD +See the server description in +.IR venti (8) +for explanations of these variables. +.SH EXAMPLE +.IP +.EX +index main +isect /tmp/disks/isect0 +isect /tmp/disks/isect1 +arenas /tmp/disks/arenas +mem 10M +bcmem 20M +icmem 30M +.EE +.SH "SEE ALSO" +.IR venti (8), +.IR venti-fmt (8) +.SH BUGS +Setting up a venti server is too complicated. +.PP +Venti should not require the user to decide how to +partition its memory usage. -- cgit v1.2.3