aboutsummaryrefslogtreecommitdiff
path: root/man/man7
diff options
context:
space:
mode:
authorrsc <devnull@localhost>2005-07-12 15:24:18 +0000
committerrsc <devnull@localhost>2005-07-12 15:24:18 +0000
commitbe7cbb4ef2cb02aa9ac48c02dc1ee585a8e49043 (patch)
treebf1d493c17a924df86dd05099caf4c07bc11c0d7 /man/man7
parenta0d146edd7a7de6236a0d60baafeeb59f8452aae (diff)
downloadplan9port-be7cbb4ef2cb02aa9ac48c02dc1ee585a8e49043.tar.gz
plan9port-be7cbb4ef2cb02aa9ac48c02dc1ee585a8e49043.tar.bz2
plan9port-be7cbb4ef2cb02aa9ac48c02dc1ee585a8e49043.zip
venti, now with documentation!
Diffstat (limited to 'man/man7')
-rw-r--r--man/man7/venti.7439
-rw-r--r--man/man7/venti.conf.7360
2 files changed, 799 insertions, 0 deletions
diff --git a/man/man7/venti.7 b/man/man7/venti.7
new file mode 100644
index 00000000..efab4e99
--- /dev/null
+++ b/man/man7/venti.7
@@ -0,0 +1,439 @@
+.TH VENTI 7
+.SH NAME
+venti \- archival storage server
+.SH DESCRIPTION
+Venti is a block storage server intended for archival data.
+In a Venti server, the SHA1 hash of a block's contents acts
+as the block identifier for read and write operations.
+This approach enforces a write-once policy, preventing
+accidental or malicious destruction of data. In addition,
+duplicate copies of a block are coalesced, reducing the
+consumption of storage and simplifying the implementation
+of clients.
+.PP
+This manual page documents the basic concepts of
+block storage using Venti as well as the Venti network protocol.
+.PP
+.IR Venti (1)
+documents some simple clients.
+.IR Vac (1),
+.IR vbackup (1),
+.IR vacfs (4),
+and
+.IR vnfs (4)
+are more complex clients.
+.PP
+.IR Venti (3)
+describes a C library interface for accessing
+Venti servers and manipulating Venti data structures.
+.PP
+.IR Venti.conf (7)
+describes the Venti server configuration file.
+.PP
+.IR Venti (8)
+describes the programs used to run a Venti server.
+.PP
+.SS "Scores
+The SHA1 hash that identifies a block is called its
+.IR score .
+The score of the zero-length block is called the
+.IR "zero score" .
+.PP
+Scores may have an optional
+.IB label :
+prefix, typically used to
+describe the format of the data.
+For example,
+.IR vac (1)
+uses a
+.B vac:
+prefix, while
+.IR vbackup (1)
+uses prefixes corresponding to the file system
+types:
+.BR ext2: ,
+.BR ffs: ,
+and so on.
+.SS "Files and Directories
+Venti accepts blocks up to 56 kilobytes in size.
+By convention, Venti clients use hash trees of blocks to
+represent arbitrary-size data
+.IR files .
+The data to be stored is split into fixed-size
+blocks and written to the server, producing a list
+of scores.
+The resulting list of scores is split into fixed-size pointer
+blocks (using only an integral number of scores per block)
+and written to the server, producing a smaller list
+of scores.
+The process continues, eventually ending with the
+score for the hash tree's top-most block.
+Each file stored this way is summarized by
+a
+.B VtEntry
+structure recording the top-most score, the depth
+of the tree, the data block size, and the pointer block size.
+One or more
+.B VtEntry
+structures can be concatenated
+and stored as a special file called a
+.IR directory .
+In this
+manner, arbitrary trees of files can be constructed
+and stored.
+.PP
+Scores passed between programs conventionally refer
+to
+.B VtRoot
+blocks, which contain descriptive information
+as well as the score of a block containing a small number
+of
+.B VtEntries .
+.SS "Block Types
+To allow programs to traverse these structures without
+needing to understand their higher-level meanings,
+Venti tags each block with a type. The types are:
+.PP
+.nf
+.ft L
+ VtDataType 000 \f1data\fL
+ VtDataType+1 001 \fRscores of \fPVtDataType\fR blocks\fL
+ VtDataType+2 002 \fRscores of \fPVtDataType+1\fR blocks\fL
+ \fR\&...\fL
+ VtDirType 010 VtEntry\fR structures\fL
+ VtDirType+1 011 \fRscores of \fLVtDirType\fR blocks\fL
+ VtDirType+2 012 \fRscores of \fLVtDirType+1\fR blocks\fL
+ \fR\&...\fL
+ VtRootType 020 VtRoot\fR structure\fL
+.fi
+.PP
+The octal numbers listed are the type numbers used
+by the commands below.
+(For historical reasons, the type numbers used on
+disk and on the wire are different from the above.
+They do not distinguish
+.BI VtDataType+ n
+blocks from
+.BI VtDirType+ n
+blocks.)
+.SS "Zero Truncation
+To avoid storing the same short data blocks padded with
+differing numbers of zeros, Venti clients working with fixed-size
+blocks conventionally
+`zero truncate' the blocks before writing them to the server.
+For example, if a 1024-byte data block contains the
+11-byte string
+.RB ` hello " " world '
+followed by 1013 zero bytes,
+a client would store only the 11-byte block.
+When the client later read the block from the server,
+it would append zeros to the end as necessary to
+reach the expected size.
+.PP
+When truncating pointer blocks
+.RB ( VtDataType+ \fIn
+and
+.BI VtDirType+ n
+blocks),
+trailing zero scores are removed
+instead of trailing zero bytes.
+.PP
+Because of the truncation convention,
+any file consisting entirely of zero bytes,
+no matter what the length, will be represented by the zero score:
+the data blocks contain all zeros and are thus truncated
+to the empty block, and the pointer blocks contain all zero scores
+and are thus also truncated to the empty block,
+and so on up the hash tree.
+.SS NETWORK PROTOCOL
+A Venti session begins when a
+.I client
+connects to the network address served by a Venti
+.IR server ;
+the conventional address is
+.BI tcp! server !venti
+(the
+.B venti
+port is 17034).
+Both client and server begin by sending a version
+string of the form
+.BI venti- versions - comment \en \fR.
+The
+.I versions
+field is a list of acceptable versions separated by
+colons.
+The protocol described here is version
+.B 02 .
+The client is responsible for choosing a common
+version and sending it in the
+.B VtThello
+message, described below.
+.PP
+After the initial version exchange, the client transmits
+.I requests
+.RI ( T-messages )
+to the server, which subsequently returns
+.I replies
+.RI ( R-messages )
+to the client.
+The combined act of transmitting (receiving) a request
+of a particular type, and receiving (transmitting) its reply
+is called a
+.I transaction
+of that type.
+.PP
+Each message consists of a sequence of bytes.
+Two-byte fields hold unsigned integers represented
+in big-endian order (most significant byte first).
+Data items of variable lengths are represented by
+a one-byte field specifying a count,
+.IR n ,
+followed by
+.I n
+bytes of data.
+Text strings are represented similarly,
+using a two-byte count with
+the text itself stored as a UTF-8 encoded sequence
+of Unicode characters (see
+.IR utf (7)).
+Text strings are not
+.SM NUL\c
+-terminated:
+.I n
+counts the bytes of UTF-8 data, which include no final
+zero byte.
+The
+.SM NUL
+character is illegal in text strings in the Venti protocol.
+The maximum string length in Venti is 1024 bytes.
+.PP
+Each Venti message begins with a two-byte size field
+specifying the length in bytes of the message,
+not including the length field itself.
+The next byte is the message type, one of the constants
+in the enumeration in the include file
+.BR <venti.h> .
+The next byte is an identifying
+.IR tag ,
+used to match responses with requests.
+The remaining bytes are parameters of different sizes.
+In the message descriptions, the number of bytes in a field
+is given in brackets after the field name.
+The notation
+.IR parameter [ n ]
+where
+.I n
+is not a constant represents a variable-length parameter:
+.IR n [1]
+followed by
+.I n
+bytes of data forming the
+.IR parameter .
+The notation
+.IR string [ s ]
+(using a literal
+.I s
+character)
+is shorthand for
+.IR s [2]
+followed by
+.I s
+bytes of UTF-8 text.
+The notation
+.IR parameter []
+where
+.I parameter
+is the last field in the message represents a
+variable-length field that comprises all remaining
+bytes in the message.
+.PP
+All Venti RPC messages are prefixed with a field
+.IR size [2]
+giving the length of the message that follows
+(not including the
+.I size
+field itself).
+The message bodies are:
+.ta \w'\fLVtTgoodbye 'u
+.IP
+.ne 2v
+.B VtThello
+.IR tag [1]
+.IR version [ s ]
+.IR uid [ s ]
+.IR strength [1]
+.IR crypto [ n ]
+.IR codec [ n ]
+.br
+.B VtRhello
+.IR tag [1]
+.IR sid [ s ]
+.IR rcrypto [1]
+.IR rcodec [1]
+.IP
+.ne 2v
+.B VtTping
+.IR tag [1]
+.br
+.B VtRping
+.IR tag [1]
+.IP
+.ne 2v
+.B VtTread
+.IR tag [1]
+.IR score [20]
+.IR type [1]
+.IR pad [1]
+.IR count [2]
+.br
+.B VtRead
+.IR tag [1]
+.IR data []
+.IP
+.ne 2v
+.B VtTwrite
+.IR tag [1]
+.IR type [1]
+.IR pad [3]
+.IR data []
+.br
+.B VtRwrite
+.IR tag [1]
+.IR score [20]
+.IP
+.ne 2v
+.B VtTsync
+.IR tag [1]
+.br
+.B VtRsync
+.IR tag [1]
+.IP
+.ne 2v
+.B VtRerror
+.IR tag [1]
+.IR error [ s ]
+.IP
+.ne 2v
+.B VtTgoodbye
+.IR tag [1]
+.PP
+Each T-message has a one-byte
+.I tag
+field, chosen and used by the client to identify the message.
+The server will echo the request's
+.I tag
+field in the reply.
+Clients should arrange that no two outstanding
+messages have the same tag field so that responses
+can be distinguished.
+.PP
+The type of an R-message will either be one greater than
+the type of the corresponding T-message or
+.BR Rerror ,
+indicating that the request failed.
+In the latter case, the
+.I error
+field contains a string describing the reason for failure.
+.PP
+Venti connections must begin with a
+.B hello
+transaction.
+The
+.B VtThello
+message contains the protocol
+.I version
+that the client has chosen to use.
+The fields
+.IR strength ,
+.IR crypto ,
+and
+.IR codec
+could be used to add authentication, encryption,
+and compression to the Venti session
+but are currently ignored.
+The
+.IR rcrypto ,
+and
+.I rcodec
+fields in the
+.B VtRhello
+response are similarly ignored.
+The
+.IR uid
+and
+.IR sid
+fields are intended to be the identity
+of the client and server but, given the lack of
+authentication, should be treated only as advisory.
+The initial
+.B hello
+should be the only
+.B hello
+transaction during the session.
+.PP
+The
+.B ping
+message has no effect and
+is used mainly for debugging.
+Servers should respond immediately to pings.
+.PP
+The
+.B read
+message requests a block with the given
+.I score
+and
+.I type .
+Use
+.I vttodisktype
+and
+.I vtfromdisktype
+(see
+.IR venti (3))
+to convert a block type enumeration value
+.RB ( VtDataType ,
+etc.)
+to the
+.I type
+used on disk and in the protocol.
+The
+.I count
+field specifies the maximum expected size
+of the block.
+The
+.I data
+in the reply is the block's contents.
+.PP
+The
+.B write
+message writes a new block of the given
+.I type
+with contents
+.I data
+to the server.
+The response includes the
+.I score
+to use to read the block,
+which should be the SHA1 hash of
+.IR data .
+.PP
+The Venti server may buffer written blocks in memory,
+waiting until after responding to the
+.B write
+message before writing them to
+permanent storage.
+The server will delay the response to a
+.B sync
+message until after all blocks in earlier
+.B write
+messages have been written to permanent storage.
+.PP
+The
+.B goodbye
+message ends a session. There is no
+.BR VtRgoodbye :
+upon receiving the
+.BR VtTgoodbye
+message, the server terminates up the connection.
+.SH SEE ALSO
+.IR venti (1),
+.IR venti (3)
diff --git a/man/man7/venti.conf.7 b/man/man7/venti.conf.7
new file mode 100644
index 00000000..000d8aa4
--- /dev/null
+++ b/man/man7/venti.conf.7
@@ -0,0 +1,360 @@
+.TH VENTI.CONF 7
+.SH NAME
+venti.conf \- venti configuration
+.SH DESCRIPTION
+Venti is a SHA1-addressed archival storage server.
+See
+.IR venti (7)
+for a full introduction to the system.
+This page documents the structure and operation of the server.
+.PP
+A venti server requires multiple disks or disk partitions,
+each of which must be properly formatted before the server
+can be run.
+.SS Disk
+The venti server maintains three disk structures, typically
+stored on raw disk partitions:
+the append-only
+.IR "data log" ,
+which holds, in sequential order,
+the contents of every block written to the server;
+the
+.IR index ,
+which helps locate a block in the data log given its score;
+and optionally the
+.IR "bloom filter" ,
+a concise summary of which scores are present in the index.
+The data log is the primary storage.
+To improve the robustness, it should be stored on
+a device that provides RAID functionality.
+The index and the bloom filter are optimizations
+employed to access the data log efficiently and can be rebuilt
+if lost or damaged.
+.PP
+The data log is logically split into sections called
+.IR arenas ,
+typically sized for easy offline backup
+(e.g., 500MB).
+A data log may comprise many disks, each storing
+one or more arenas.
+Such disks are called
+.IR "arena partitions" .
+Arena partitions are filled in the order given in the configuration.
+.PP
+The index is logically split into block-sized pieces called
+.IR buckets ,
+each of which is responsible for a particular range of scores.
+An index may be split across many disks, each storing many buckets.
+Such disks are called
+.IR "index sections" .
+.PP
+The index must be sized so that no bucket is full.
+When a bucket fills, the server must be shut down and
+the index made larger.
+Since scores appear random, each bucket will contain
+approximately the same number of entries.
+Index entries are 40 bytes long. Assuming that a typical block
+being written to the server is 8192 bytes and compresses to 4096
+bytes, the active index is expected to be about 1% of
+the active data log.
+Storing smaller blocks increases the relative index footprint;
+storing larger blocks decreases it.
+To allow variation in both block size and the random distribution
+of scores to buckets, the suggested index size is 5% of
+the active data log.
+.PP
+The (optional) bloom filter is a large bitmap that is stored on disk but
+also kept completely in memory while the venti server runs.
+It helps the venti server efficiently detect scores that are
+.I not
+already stored in the index.
+The bloom filter starts out zeroed.
+Each score recorded in the bloom filter is hashed to choose
+.I nhash
+bits to set in the bloom filter.
+A score is definitely not stored in the index of any of its
+.I nhash
+bits are not set.
+The bloom filter thus has two parameters:
+.I nhash
+(maximum 32)
+and the total bitmap size
+(maximum 512MB, 2\s-2\u32\d\s+2 bits).
+.PP
+The bloom filter should be sized so that
+.I nhash
+\(ti
+.I nblock
+\(ti
+0.7
+\(<=
+0.7 \(ti
+.IR b ,
+where
+.I nblock
+is the expected number of blocks stored on the server
+and
+.I b
+is the bitmap size in bits.
+The false positive rate of the bloom filter when sized
+this way is approximately 2\s-2\u\-\fInblock\fR\d\s+2.
+.I Nhash
+less than 10 are not very useful;
+.I nhash
+greater than 24 are probably a waste of memory.
+.I Fmtbloom
+(see
+.IR venti-fmt (8))
+can be given either
+.I nhash
+or
+.IR nblock ;
+if given
+.IR nblock ,
+it will derive an appropriate
+.IR nhash .
+.SS Memory
+Venti can make effective use of large amounts of memory
+for various caches.
+.PP
+The
+.I "lump cache
+holds recently-accessed venti data blocks, which the server refers to as
+.IR lumps .
+The lump cache should be at least 1MB but can profitably be much larger.
+The lump cache can be thought of as the level-1 cache:
+read requests handled by the lump cache can
+be served instantly.
+.PP
+The
+.I "block cache
+holds recently-accessed
+.I disk
+blocks from the arena partitions.
+The block cache needs to be able to simultaneously hold two blocks
+from each arena plus four blocks for the currently-filling arena.
+The block cache can be thought of as the level-2 cache:
+read requests handled by the block cache are slower than those
+handled by the lump cache, since the lump data must be extracted
+from the raw disk blocks and possibly decompressed, but no
+disk accesses are necessary.
+.PP
+The
+.I "index cache
+holds recently-accessed or prefetched
+index entries.
+The index cache needs to be able to hold index entries
+for three or four arenas, at least, in order for prefetching
+to work properly. Each index entry is 50 bytes.
+Assuming 500MB arenas of
+128,000 blocks that are 4096 bytes each after compression,
+the minimum index cache size is about 6MB.
+The index cache can be thought of as the level-3 cache:
+read requests handled by the index cache must still go
+to disk to fetch the arena blocks, but the costly random
+access to the index is avoided.
+.PP
+The size of the index cache determines how long venti
+can sustain its `burst' write throughput, during which time
+the only disk accesses on the critical path
+are sequential writes to the arena partitions.
+For example, if you want to be able to sustain 10MB/s
+for an hour, you need enough index cache to hold entries
+for 36GB of blocks. Assuming 8192-byte blocks,
+you need room for almost five million index entries.
+Since index entries are 50 bytes each, you need 250MB
+of index cache.
+If the background index update process can make a single
+pass through the index in an hour, which is possible,
+then you can sustain the 10MB/s indefinitely (at least until
+the arenas are all filled).
+.PP
+The
+.I "bloom filter
+requires memory equal to its size on disk,
+as discussed above.
+.PP
+A reasonable starting allocation is to
+divide memory equally (in thirds) between
+the bloom filter, the index cache, and the lump and block caches;
+the third of memory allocated to the lump and block caches
+should be split unevenly, with more (say, two thirds)
+going to the block cache.
+.SS Network
+The venti server announces two network services, one
+(conventionally TCP port
+.BR venti ,
+17034) serving
+the venti protocol as described in
+.IR venti (7),
+and one serving HTTP
+(conventionally TCP port
+.BR venti ,
+80).
+.PP
+The venti web server provides the following
+URLs for accessing status information:
+.TP
+.B /index
+A summary of the usage of the arenas and index sections.
+.TP
+.B /xindex
+An XML version of
+.BR /index .
+.TP
+.B /storage
+Brief storage totals.
+.TP
+.BI /set/ variable
+The current integer value of
+.IR variable .
+Variables are:
+.BR compress ,
+whether or not to compress blocks
+(for debugging);
+.BR logging ,
+whether to write entries to the debugging logs;
+.BR stats ,
+whether to collect run-time statistics;
+.BR icachesleeptime ,
+the time in milliseconds between successive updates
+of megabytes of the index cache;
+.BR arenasumsleeptime ,
+the time in milliseconds between reads while
+checksumming an arena in the background.
+The two sleep times should be (but are not) managed by venti;
+they exist to provide more experience with their effects.
+The other variables exist only for debugging and
+performance measurement.
+.TP
+.BI /set/ variable / value
+Set
+.I variable
+to
+.IR value .
+.TP
+.BI /graph/ name / param / param / \fR...
+A PNG image graphing the named run-time statistic over time.
+The details of names and parameters are undocumented;
+see
+.B httpd.c
+in the venti sources.
+.TP
+.B /log
+A list of all debugging logs present in the server's memory.
+.TP
+.BI /log/ name
+The contents of the debugging log with the given
+.IR name .
+.TP
+.B /flushicache
+Force venti to begin flushing the index cache to disk.
+The request response will not be sent until the flush
+has completed.
+.TP
+.B /flushdcache
+Force venti to begin flushing the arena block cache to disk.
+The request response will not be sent until the flush
+has completed.
+.PD
+.PP
+Requests for other files are served by consulting a
+directory named in the configuration file
+(see
+.B webroot
+below).
+.SS Configuration File
+A venti configuration file
+enumerates the various index sections and
+arenas that constitute a venti system.
+The components are indicated by the name of the file, typically
+a disk partition, in which they reside. The configuration
+file is the only location that file names are used. Internally,
+venti uses the names assigned when the components were formatted
+with
+.I fmtarenas
+or
+.I fmtisect
+(see
+.IR venti-fmt (8)).
+In particular, only the configuration needs to be
+changed if a component is moved to a different file.
+.PP
+The configuration file consists of lines in the form described below.
+Lines starting with
+.B #
+are comments.
+.TP
+.BI index " name
+Names the index for the system.
+.TP
+.BI arenas " file
+.I File
+is an arena partition, formatted using
+.IR fmtarenas .
+.TP
+.BI isect " file
+.I File
+is an index section, formatted using
+.IR fmtisect .
+.PP
+After formatting a venti system using
+.IR fmtindex ,
+the order of arenas and index sections should not be changed.
+Additional arenas can be appended to the configuration;
+run
+.I fmtindex
+with the
+.B -a
+flag to update the index.
+.PP
+The configuration file also holds configuration parameters
+for the venti server itself.
+These are:
+.TF httpaddr netaddr
+.TP
+.BI mem " size
+lump cache size
+.TP
+.BI bcmem " size
+block cache size
+.TP
+.BI icmem " size
+index cache size
+.TP
+.BI addr " netaddr
+network address to announce venti service
+(default
+.BR tcp!*!venti )
+.TP
+.BI httpaddr " netaddr
+network address to announce HTTP service
+(default
+.BR tcp!*!http )
+.TP
+.B queuewrites
+queue writes in memory
+(default is not to queue)
+.PD
+See the server description in
+.IR venti (8)
+for explanations of these variables.
+.SH EXAMPLE
+.IP
+.EX
+index main
+isect /tmp/disks/isect0
+isect /tmp/disks/isect1
+arenas /tmp/disks/arenas
+mem 10M
+bcmem 20M
+icmem 30M
+.EE
+.SH "SEE ALSO"
+.IR venti (8),
+.IR venti-fmt (8)
+.SH BUGS
+Setting up a venti server is too complicated.
+.PP
+Venti should not require the user to decide how to
+partition its memory usage.