1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
|
.TH VENTI 7
.SH NAME
venti \- archival storage server
.SH DESCRIPTION
Venti is a block storage server intended for archival data.
In a Venti server, the SHA1 hash of a block's contents acts
as the block identifier for read and write operations.
This approach enforces a write-once policy, preventing
accidental or malicious destruction of data. In addition,
duplicate copies of a block are coalesced, reducing the
consumption of storage and simplifying the implementation
of clients.
.PP
This manual page documents the basic concepts of
block storage using Venti as well as the Venti network protocol.
.PP
.IR Venti (1)
documents some simple clients.
.IR Vac (1),
.IR vacfs (4),
and
.IR vbackup (8)
are more complex clients.
.PP
.IR Venti (3)
describes a C library interface for accessing
Venti servers and manipulating Venti data structures.
.PP
.IR Venti (8)
describes the programs used to run a Venti server.
.PP
.SS "Scores
The SHA1 hash that identifies a block is called its
.IR score .
The score of the zero-length block is called the
.IR "zero score" .
.PP
Scores may have an optional
.IB label :
prefix, typically used to
describe the format of the data.
For example,
.IR vac (1)
uses a
.B vac:
prefix, while
.IR vbackup (8)
uses prefixes corresponding to the file system
types:
.BR ext2: ,
.BR ffs: ,
and so on.
.SS "Files and Directories
Venti accepts blocks up to 56 kilobytes in size.
By convention, Venti clients use hash trees of blocks to
represent arbitrary-size data
.IR files .
The data to be stored is split into fixed-size
blocks and written to the server, producing a list
of scores.
The resulting list of scores is split into fixed-size pointer
blocks (using only an integral number of scores per block)
and written to the server, producing a smaller list
of scores.
The process continues, eventually ending with the
score for the hash tree's top-most block.
Each file stored this way is summarized by
a
.B VtEntry
structure recording the top-most score, the depth
of the tree, the data block size, and the pointer block size.
One or more
.B VtEntry
structures can be concatenated
and stored as a special file called a
.IR directory .
In this
manner, arbitrary trees of files can be constructed
and stored.
.PP
Scores passed between programs conventionally refer
to
.B VtRoot
blocks, which contain descriptive information
as well as the score of a directory block containing a small number
of directory entries.
.PP
Conventionally, programs do not mix data and directory entries
in the same file. Instead, they keep two separate files, one with
directory entries and one with metadata referencing those
entries by position.
Keeping this parallel representation is a minor annoyance
but makes it possible for general programs like
.I venti/copy
(see
.IR venti (1))
to traverse the block tree without knowing the specific details
of any particular program's data.
.SS "Block Types
To allow programs to traverse these structures without
needing to understand their higher-level meanings,
Venti tags each block with a type. The types are:
.PP
.nf
.ft L
VtDataType 000 \f1data\fL
VtDataType+1 001 \fRscores of \fPVtDataType\fR blocks\fL
VtDataType+2 002 \fRscores of \fPVtDataType+1\fR blocks\fL
\fR\&...\fL
VtDirType 010 VtEntry\fR structures\fL
VtDirType+1 011 \fRscores of \fLVtDirType\fR blocks\fL
VtDirType+2 012 \fRscores of \fLVtDirType+1\fR blocks\fL
\fR\&...\fL
VtRootType 020 VtRoot\fR structure\fL
.fi
.PP
The octal numbers listed are the type numbers used
by the commands below.
(For historical reasons, the type numbers used on
disk and on the wire are different from the above.
They do not distinguish
.BI VtDataType+ n
blocks from
.BI VtDirType+ n
blocks.)
.SS "Zero Truncation
To avoid storing the same short data blocks padded with
differing numbers of zeros, Venti clients working with fixed-size
blocks conventionally
`zero truncate' the blocks before writing them to the server.
For example, if a 1024-byte data block contains the
11-byte string
.RB ` hello " " world '
followed by 1013 zero bytes,
a client would store only the 11-byte block.
When the client later read the block from the server,
it would append zero bytes to the end as necessary to
reach the expected size.
.PP
When truncating pointer blocks
.RB ( VtDataType+ \fIn
and
.BI VtDirType+ n
blocks),
trailing zero scores are removed
instead of trailing zero bytes.
.PP
Because of the truncation convention,
any file consisting entirely of zero bytes,
no matter what its length, will be represented by the zero score:
the data blocks contain all zeros and are thus truncated
to the empty block, and the pointer blocks contain all zero scores
and are thus also truncated to the empty block,
and so on up the hash tree.
.SS Network Protocol
A Venti session begins when a
.I client
connects to the network address served by a Venti
.IR server ;
the conventional address is
.BI tcp! server !venti
(the
.B venti
port is 17034).
Both client and server begin by sending a version
string of the form
.BI venti- versions - comment \en \fR.
The
.I versions
field is a list of acceptable versions separated by
colons.
The protocol described here is version
.BR 02 .
The client is responsible for choosing a common
version and sending it in the
.B VtThello
message, described below.
.PP
After the initial version exchange, the client transmits
.I requests
.RI ( T-messages )
to the server, which subsequently returns
.I replies
.RI ( R-messages )
to the client.
The combined act of transmitting (receiving) a request
of a particular type, and receiving (transmitting) its reply
is called a
.I transaction
of that type.
.PP
Each message consists of a sequence of bytes.
Two-byte fields hold unsigned integers represented
in big-endian order (most significant byte first).
Data items of variable lengths are represented by
a one-byte field specifying a count,
.IR n ,
followed by
.I n
bytes of data.
Text strings are represented similarly,
using a two-byte count with
the text itself stored as a UTF-encoded sequence
of Unicode characters (see
.IR utf (7)).
Text strings are not
.SM NUL\c
-terminated:
.I n
counts the bytes of UTF data, which include no final
zero byte.
The
.SM NUL
character is illegal in text strings in the Venti protocol.
The maximum string length in Venti is 1024 bytes.
.PP
Each Venti message begins with a two-byte size field
specifying the length in bytes of the message,
not including the length field itself.
The next byte is the message type, one of the constants
in the enumeration in the include file
.BR <venti.h> .
The next byte is an identifying
.IR tag ,
used to match responses to requests.
The remaining bytes are parameters of different sizes.
In the message descriptions, the number of bytes in a field
is given in brackets after the field name.
The notation
.IR parameter [ n ]
where
.I n
is not a constant represents a variable-length parameter:
.IR n [1]
followed by
.I n
bytes of data forming the
.IR parameter .
The notation
.IR string [ s ]
(using a literal
.I s
character)
is shorthand for
.IR s [2]
followed by
.I s
bytes of UTF-8 text.
The notation
.IR parameter []
where
.I parameter
is the last field in the message represents a
variable-length field that comprises all remaining
bytes in the message.
.PP
All Venti RPC messages are prefixed with a field
.IR size [2]
giving the length of the message that follows
(not including the
.I size
field itself).
The message bodies are:
.ta \w'\fLVtTgoodbye 'u
.IP
.ne 2v
.B VtThello
.IR tag [1]
.IR version [ s ]
.IR uid [ s ]
.IR strength [1]
.IR crypto [ n ]
.IR codec [ n ]
.br
.B VtRhello
.IR tag [1]
.IR sid [ s ]
.IR rcrypto [1]
.IR rcodec [1]
.IP
.ne 2v
.B VtTping
.IR tag [1]
.br
.B VtRping
.IR tag [1]
.IP
.ne 2v
.B VtTread
.IR tag [1]
.IR score [20]
.IR type [1]
.IR pad [1]
.IR count [2]
.br
.B VtRead
.IR tag [1]
.IR data []
.IP
.ne 2v
.B VtTwrite
.IR tag [1]
.IR type [1]
.IR pad [3]
.IR data []
.br
.B VtRwrite
.IR tag [1]
.IR score [20]
.IP
.ne 2v
.B VtTsync
.IR tag [1]
.br
.B VtRsync
.IR tag [1]
.IP
.ne 2v
.B VtRerror
.IR tag [1]
.IR error [ s ]
.IP
.ne 2v
.B VtTgoodbye
.IR tag [1]
.PP
Each T-message has a one-byte
.I tag
field, chosen and used by the client to identify the message.
The server will echo the request's
.I tag
field in the reply.
Clients should arrange that no two outstanding
messages have the same tag field so that responses
can be distinguished.
.PP
The type of an R-message will either be one greater than
the type of the corresponding T-message or
.BR Rerror ,
indicating that the request failed.
In the latter case, the
.I error
field contains a string describing the reason for failure.
.PP
Venti connections must begin with a
.B hello
transaction.
The
.B VtThello
message contains the protocol
.I version
that the client has chosen to use.
The fields
.IR strength ,
.IR crypto ,
and
.IR codec
could be used to add authentication, encryption,
and compression to the Venti session
but are currently ignored.
The
.IR rcrypto ,
and
.I rcodec
fields in the
.B VtRhello
response are similarly ignored.
The
.IR uid
and
.IR sid
fields are intended to be the identity
of the client and server but, given the lack of
authentication, should be treated only as advisory.
The initial
.B hello
should be the only
.B hello
transaction during the session.
.PP
The
.B ping
message has no effect and
is used mainly for debugging.
Servers should respond immediately to pings.
.PP
The
.B read
message requests a block with the given
.I score
and
.IR type .
Use
.I vttodisktype
and
.I vtfromdisktype
(see
.IR venti (3))
to convert a block type enumeration value
.RB ( VtDataType ,
etc.)
to the
.I type
used on disk and in the protocol.
The
.I count
field specifies the maximum expected size
of the block.
The
.I data
in the reply is the block's contents.
.PP
The
.B write
message writes a new block of the given
.I type
with contents
.I data
to the server.
The response includes the
.I score
to use to read the block,
which should be the SHA1 hash of
.IR data .
.PP
The Venti server may buffer written blocks in memory,
waiting until after responding to the
.B write
message before writing them to
permanent storage.
The server will delay the response to a
.B sync
message until after all blocks in earlier
.B write
messages have been written to permanent storage.
.PP
The
.B goodbye
message ends a session. There is no
.BR VtRgoodbye :
upon receiving the
.BR VtTgoodbye
message, the server terminates up the connection.
.SH SEE ALSO
.IR venti (1),
.IR venti (3),
.IR venti (8)
.br
Sean Quinlan and Sean Dorward,
``Venti: a new approach to archival storage'',
.I "Usenix Conference on File and Storage Technologies" ,
2002.
|