-
Notifications
You must be signed in to change notification settings - Fork 0
/
DEVELOPERS
222 lines (155 loc) · 7.11 KB
/
DEVELOPERS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
Internal documentation for developers
-------------------------------------
Coding guidelines
Design decisions
Storage of binarydata
Directory tree in database
Transaction Policies
Self-containment
StatFS statistics
Testing
References
Coding guidelines
-----------------
8-character TABs.
Standard C99, portable to all platforms supporting FUSE (Linux,
FreeBSD, OpenBSD, NetBSD, Darwin/MacOSX). No C++, normally no C99.
Some newer 64-bit things like 'endian.h', 'stdint.h' are allowed,
if there is no other option.
Do not introduce unnecessary 3rd party dependencies in addition
to the required 'libfuse' and 'libpq'.
Use the internal fuse_opts to parse command line options.
Use the native 'libpq', not abstractions. The database operations
are simple enough. If possible avoid string manipulations as
for timestamps (we are on low-level OS-abstraction layer, so
'struct timespec' and epochs are fine).
Consistency of the data should be ensured in the database.
We don't want to implement an 'fsck' for pgfuse.
Logging is currently done only over syslog.
Design desicions
----------------
Storage of binary data
----------------------
Options:
One ByteA field
All data in a big bytea column: needs memory of the size of the
complete file on client (pgfuse) and server (PostgreSQL) side,
is ok for small files (was the first proof-of-concept implementation).
Multiple ByteA of equal size
As in Mysqlfs simulate blocks as bytea fields of fixes size with
a block number. The blocksize has to be carefully tuned with file-
system, PostgreSQL and fuse parameters.
Should give good average performance, the "One ByteA field" variant
for small files is still as efficient as before.
Blobs
They are streamable, but we lack some security (this was the
case before PostgreSQL 9.0, they were publibly readable) and we
lack referential integrity.
The functions to manipulate the blobs are not so nice.
It's also questionable whether they could be faster than a bytea.
Some unsorted thoughts:
Streams are mere abstractions and not really needed from the database
interface.
COPY FROM and COPY to as a fast, non-transactional mode?
Pad blocks in data or not? Or all but the last one, allowing very
small files to be stored efficiently.
How to tune the block sizes? What factors influence the experiment?
At the moment we store padded blocks of fixed size (STANDARD_BLOCK_SIZE),
not really sure if that is good or bad.
The block size should be computed (small files have only one block,
all other have a a blocksize of the first full block). Gets us more
independent from configuration or command line options which can mount
wrong data then. The blocksize should be available when initializing
the PgFuse filesystem.
Directory tree in database
--------------------------
Every inode has a 'parent_id', self-referencing. To avoid the NULL
reference there is a root element pointing to itself. Pathes are
decomposed programatically by descending the tree. This requires
no additional redudant storage is easy to change in renames and
gives acceptable read performance.
Transaction Policies
--------------------
Fundamental question: What file operations should form a database transaction?
one extreme: isolate threads and all file operations in one transaction.
This is most likely an illusion, as we can't assume that FUSE threads
are assigned to a specific file (on the contrary!).
other extreme: "autocommit" (every write, every read, etc.), this allows
for parallel usage. We trust in FUSE and the locking there. The database
should help us to sequentiallize the operations.
Currently the second option was choosen.
Self-containment
----------------
React decently to loss of database connections. Try to reestablish
the connection, the loss of database connection could be temporary.
Try to reexecute the file system operation.
What should be reported back as permanent error state to FUSE after
a certain timeout when the database doesn't appear again?
EIO seems a good option (as if the disk would have temporary I/O
problems).
StatFS statistics
-----------------
Per see hard to measure, database can spawn many machines, disks, etc.
The information about disk usage is intentionally hidden to the user!
Nevertheless we can get some useable data for some scenarios, mainly
if things run on a single machine with several partitions (or SANs).
A) What to meaure?
Calculate the size of:
a) virtual things like blocks and inodes (that's the way we went)
b) get physical disk usage data from the database directly
B) Detect location of tables on disk
a) environment variable PGDATA
Is only really set for the postgresql startup script and for the
account service as DBA (postgres), we should maybe not assume
wrong setups for normal users..
b) standard location (probed)
$PGDATA/base
show data_directory;
select setting from pg_settings where name = 'data_directory';
/var/lib/postgres/data
ERROR: must be superuser to examine "data_directory"
Can't do this, as the db user must be underprivileged:
ArchLinux: /var/lib/postgres
Centos/RHEL/Fedora: /var/lib/pgsql
Debian/Ubuntu: /var/lib/postgresql/9.1/main (/var/lib/postgresql should be fine)
This is a rough sketch of the algorithm:
1) Get a list of oids containing the tablespaces of PgFuse tables and indexes
select distinct reltablespace FROM pg_class WHERE relname in ( 'dir', 'data', 'data_dir_id_idx', 'data_block_no_idx', 'dir_parent_id_idx' );
[0,55877]
If there is a '0' in this list, replace it with the OID of the default tablespace:
select dattablespace from pg_database where datname=current_database();
[55025,55877]
2) Get table space locations (version dependend)
select spclocation from pg_tablespace where oid = 55025;
>= 9.2
select pg_tablespace_location('55025');
["/media/sd/test"]
or we get nothing, in this case, the tablespace resides in PGDATA.
We assume nobody makes symlinks there to point to other disks! So
we add $PGDATA to the list of directories
3) Resolve list of pathes containing the relevant tablespaces to
the list of entries in /etc/mtab (getmntent_r), unique it,
then use 'statfs' to retrieve the data, eventually take the
minimum, if there are many.
Testing
-------
The makefile contains some basic functionality tests (mostly using
commands of the shell).
bonnie is a good stress and performance tester. Don't despair because
of poor performance, that's normal. :-)
Another option is to have many shells open and do some things in
parallel, like:
while(true);do
mkdir mnt/dir
rmdir mnt/dir
done
We should actually write a filesystem operation simulator doing all
kind of random operations with certain probabilities.
http://www.iozone.org/ is another option.
One of the list at http://ltp.sourceforge.net/tooltable.php. Not
tried yet though.
References
----------
Good FUSE tutorials at:
http://www.cs.hmc.edu/~geoff/classes/hmc.cs135.201109/homework/fuse/fuse_doc.html
http://www.cs.nmsu.edu/~pfeiffer/fuse-tutorial/