-
Notifications
You must be signed in to change notification settings - Fork 7
/
draft-commonsmachinery-urn-blockhash-00.txt
560 lines (360 loc) · 21 KB
/
draft-commonsmachinery-urn-blockhash-00.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
Network Working Group P. Liljenberg
Internet-Draft Commons Machinery
Intended status: Experimental December 2015
Expires: June 3, 2016
The blockhash perceptual image URN scheme
draft-commonsmachinery-urn-blockhash-00
Abstract
This document describes a URN namespace for identifying images by
their likeness, encoded as a perceptual hash based on the image
content. The naming scheme relies on an algorithm called blockhash.
The hash is easy to compute but still has a high probability of
positively identifying an image after it has been shared through
various Internet services, which typically strip other identifying
metadata.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on June 3, 2016.
Copyright Notice
Copyright (c) 2015 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
Liljenberg Expires June 3, 2016 [Page 1]
Internet-Draft Blockhash URN December 2015
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
1. Conventions used in this document
The key words "MUST", "MUST NOT", "SHOULD", "SHOULD NOT", and "MAY"
in this document are to be interpreted as defined in "Key words for
use in RFCs to Indicate Requirement Levels" [RFC2119].
2. Introduction
Most URN namespaces are intended to identify a unique resource
[RFC1630], e.g. by a unique ID that is assigned by a registry, or
from a cryptographic hash derived from the resource itself.
The blockhash URN namespace specified in this document instead
identifies a unique likeness of a resource, specifically of an image.
It is intended to be used in situations where an exact match may not
be possible, e.g. when comparing two instances of an image that might
have been resized or recoded in different file formats.
This is achieved by using a perceptual hash based on the blockhash
algorithm [BLOCKHASH]. A small change in an image also results in a
correspondingly small change in the blockhash. This is in contrast
to a cryptographic hash where even small changes to a file result in
unpredictable and often large changes in the hash.
This allows not only the exact comparison of two blockhashes, but
also makes it possible to determine how similar two images are. The
similary is measured as the hamming distance, i.e. the number of bits
in the two hashes that are different. Tools using blockhashes can
then choose a maximum hamming distance that is relevant for the
particular use case and the precision needed. Efficient algorithms
exist to query a set of hashes to find the ones within a given
distance from a particular hash, e.g. HmSearch [HMSEARCH].
The primary use case for blockhashes is to implement services that
can perform a reverse image lookup, i.e. based on the image itself
find out information about it or find similar images.
This namespace specification is for a formal namespace. The
specification adheres to the guidelines given in "Uniform Resource
Names (URN) Namespace Definition Mechanisms" [RFC3406].
Liljenberg Expires June 3, 2016 [Page 2]
Internet-Draft Blockhash URN December 2015
3. Specification Template
3.1. Namespace ID
"blockhash" requested.
3.2. Registration information
Registration version number: 1
Registration date: 201?-??-??
3.3. Declared registration of the namespace
Registering organisation:
Commons Machinery in Sweden AB Aesgatan 5 646 32 Gnesta Sweden
http://commonsmachinery.se/
Designated Contact Person:
Jonas Oeberg jonas@commonsmachinery.se
3.4. Declaration of syntactic structure
The Namespace Specific Strings (NSS) of all URNs assigned by the
schema described in this document will conform to the syntax defined
in section 2.2 of [RFC2141]. The formal syntax of the NSS is defined
by the following normative ABNF [RFC5234] rules for "blockhash-nss":
blockhash-nss = hexhash
hexhash = 1*hex
hex = %x30-39 / %x41-46 / %x61-66 ; 0-9, A-F, a-f
The following are comments and restrictions not captured by the above
grammar.
Though both upper-case and lowercase characters are allowed in the
hexadecimal value, lowercase characters SHOULD be used.
The number of bits encoded in "hexhash" MUST be M, where, M = N * N
and N is a multiple of 4. It is not expected that hash lengths other
than 64, 144 and 256 bits will be useful. Blockhashes SHOULD be 256
bits long unless there is a specific need for a different size.
Liljenberg Expires June 3, 2016 [Page 3]
Internet-Draft Blockhash URN December 2015
3.5. Relevant ancillary documentation
None as yet.
3.6. Identifier uniqueness considerations
A blockhash URN identifies the unique likeness of an image, rather
than a specific image file. Within the blockhash scheme, all images
that produce the same blockhash are considered identicial.
3.7. Identifier persistence considerations
The algorithm that calculate a blockhash from an image also establish
a persistent identifier. The process can later be repeated on the to
reproduce the identifier at any point in the lifetime of the image.
3.8. Process of identifier assignment
A blockhash identifier is calculated from a digital image, and only
depend on access to the pixel representation of the image. It can be
read from a local or remote file, from an in-memory pixmap or canvas,
or another suitable source.
A blockhash SHOULD be calculated following these steps:
1. Convert the image to RGB format with an optional alpha channel
(if not already done). The standard conversion method of the
file format SHOULD be used, avoiding any file- or system-specific
colour profiles. Any orientation tag in the image metadata
SHOULD be applied.
2. Construct a grid of N by N blocks covering the image. N MUST be
a multiple of 4.
3. Initialise each block to 0.
4. For each pixel:
A. Decide which blocks the pixel overlaps. It can be one, two
or four blocks, depending on the pixel coordinate.
B. If the image has an alpha channel and the alpha for the pixel
is 0 (indicating transparency), set the RGB components to
their maximum values, otherwise use them as they are.
C. Add the sum of the RGB components from the previous step to
each of the blocks the pixel covers, in proportion to the
area of the pixel that is in each block.
Liljenberg Expires June 3, 2016 [Page 4]
Internet-Draft Blockhash URN December 2015
5. Divide the block grid into four horizontal groups, each
containing N columns and N/4 rows. For each group, determine the
median value of all the blocks contained in the group.
6. Let H be the theoretically maximum cumulative value of all pixels
in a block divided by two (half the value space).
7. For each block, starting with the upper left and progressing row
by row until the lower right, add a bit to the result hash:
A. If the block value is higher than the median value of the
horizontal group the block belongs to, add a 1.
B. If the block value and median have the same value, and the
median is above above H, add a 1. Note that imprecisions
from floating point calculations or other implementation
details must be taken into consideration when determining if
the block value and the median are the same value.
C. Otherwise, add a 0.
8. Encode the hash in hexadecimal, interpreting the sequence of bits
as eight-bit bytes.
9. Construct a blockhash URN according to the grammar above using
the hexadecimal representation of the hash.
Implementations MAY use a different process which yield similar
results, but SHOULD make users aware if the modified process can
introduce differences to the standard algorithm which may result in
false negatives. One possible alternative to the algorithm described
above is to resize the image to ensure that each block holds an even
number of pixels to avoid the calculations in 4a and 4c.
3.9. Process of identifier resolution
Not specified.
3.10. Rules for lexical equivalence
Lexical equivalence of two blockhash identifiers are determined by
first converting both hexadecimal strings to lower-case or upper-
case, and then comparing the resulting strings.
Alternatively, the hexadecimal strings can be converted to byte or
bit sequences, which are then compared directly.
Liljenberg Expires June 3, 2016 [Page 5]
Internet-Draft Blockhash URN December 2015
3.11. Conformance with URN syntax
No special considerations.
3.12. Validation mechanism
None specified.
3.13. Scope
Global.
4. IANA Considerations
This document includes a URN namespace registration that is to be
entered into the IANA registry for URN NIDs.
5. Namespace Considerations
A new URN namespaces is needed as there are no existing namespaces
for representing perceptual hashes as URNs in general, and none
specifically for the blockhash algorithm.
[I-D.thiemann-hash-urn] proposes a namespace for cryptographic
hashes, but it is intentionally limited to only allow the hash
algorithms specified in the internet draft.
In a similar vein this namespace is limited to blockhash itself, to
avoid having to establish a secondary level of namespace assignments
for different perceptual hash algorithms.
The blockhash algorithm was choosen to allow even relatively
restricted environments (e.g. javascript code in a web browser) to
produce an image hash, that is still sufficiently precise for
identifying images as they are shared across the Internet, even after
resizes, recodings and smaller edits.
More complex perceptual hashes exist that can track more complex
changes, e.g. large crops or rotations, but they also put higher
requirements on the implementation. Such larger changes would often
mean that the edit is no longer the same image but a derivate work.
Blockhash is intended to only match images that can be considered the
same work, so not matching derivatives is actually a benefit.
Representing the hash as a hexadecimal string rather than as a more
compact base32 is choosen to align the format more closely with the
underlying bit structure. This allow code to determine a rough
hamming distance between two blockhashes just by comparing the
Liljenberg Expires June 3, 2016 [Page 6]
Internet-Draft Blockhash URN December 2015
hexadecimal strings, rather than first having to convert them into
binary sequences.
The restriction that the grid width and height must be the same allow
some scope to implementations to choose a reasonable hash size while
ensuring that the hash size can always be reconstructed from the hash
length. Requiring that the grid width and height must be a multiple
of 4 ensures that the hash size is a multiple of eight, and thus will
fill up a sequence of bytes exactly.
The original blockhash algorithm [BLOCKHASH] calculated a single
median value for the entire block grid. During testing it was found
that this often resulted collisions in images with a lot of contrast
between different parts of the image. An example is outdoor images
of landscapes with a bright sky and a darker ground. An image-wide
median results in the bright areas mostly being represented by 0 bits
and the darker with 1 bits, losing a lot of detail that could help
discern different images with similar contrast structure. By
calculating a median value for each of four horizontal groups and
comparing the block values to the median in the corresponding group
help retain such contrast and reduce the collision frequency.
6. Community Considerations
Internet services today make it very simple to share an image, but
most loses the context of the original image in the process. The
context is where and when the image was created or a photo taken, by
whom, and what the conditions are for properly sharing the image. It
is needed both to avoid copyright infringements, and to give credit
to the original creator.
There already exist several reverse image search services, but they
are generally proprietary, closed implementations, and require users
to upload the image to the service, potentially violating the user's
privacy.
Having a standardised way to refer to a perceptual hash of an image
will provide new capabilities directly to Internet users to find out
the context of images, even when they have passed through several
steps of sharing.
The blockhash on its own is not sufficient, but it will enable new
services that can index large collections of images. Client-side
tools can then calculate a blockhash and query these indices to find
more information about the image, preserving the user privacy.
Liljenberg Expires June 3, 2016 [Page 7]
Internet-Draft Blockhash URN December 2015
The scope of this document does not cover processes for finding and
querying such indices, but enable such future work to build on
blockhash URNs.
7. Security Considerations
Blockhashes are generally only useful when they relate to images that
are already public and being shared. As such, producing a blockhash
of the image and sharing that will not change the security or privacy
situation significantly.
For all images that are a few times larger than the grid size, it
will be practically impossible to reconstruct the original image
based on its blockhash. Thus sharing a blockhash of a private image
does not infringe on the user's privacy directly. However, if the
image is later shared, the previously shared blockhash can be used to
infer that the user had access of the image at the earlier time.
Images that are not much bigger than the grid size could be
reproduced in sufficient detail to be identified from a blockhash.
Such images are usually icons and other non-sensitive content, but
implementation can also mitigate against this by requiring the user
to confirm that such a small image should be hashed.
8. References
8.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997.
[RFC3406] Daigle, L., van Gulik, D., Iannella, R., and P. Faltstrom,
"Uniform Resource Names (URN) Namespace Definition
Mechanisms", BCP 66, RFC 3406, October 2002.
[RFC5234] Crocker, D. and P. Overell, "Augmented BNF for Syntax
Specifications: ABNF", STD 68, RFC 5234, January 2008.
8.2. Informative References
[RFC1630] Berners-Lee, T., "Universal Resource Identifiers in WWW: A
Unifying Syntax for the Expression of Names and Addresses
of Objects on the Network as used in the World-Wide Web",
RFC 1630, June 1994.
Liljenberg Expires June 3, 2016 [Page 8]
Internet-Draft Blockhash URN December 2015
[BLOCKHASH]
Yang, B., Gu, F., and X. Niu, "Block Mean Value Based
Image Perceptual Hashing", December 2006.
[HMSEARCH]
Zhang, X., Qin, J., Wang, W., Sun, Y., and J. Lu,
"HmSearch: an efficient hamming distance query processing
algorithm", July 2013.
[I-D.thiemann-hash-urn]
Thiemann, P., "A URN Namespace For Identifiers Based on
Cryptographic Hashes", draft-thiemann-hash-urn-01 (work in
progress), September 2003.
Appendix A. Experimental results
On a sample of 100,000 random images from Wikimedia Commons, the
algorithm generated colliding matches for 1,036 images, meaning that
it has a collision rate of ca 1%. The vast majority of such
collisions are where between two to four images generate the same
hash; only rarely does a hash match more than four images.
+--------------------------------+-------------+
| Number of images in collision | Collisions |
+--------------------------------+-------------+
| 2 | 247 |
| | |
| 3 | 51 |
| | |
| 4 | 19 |
| | |
| 5 | 8 |
| | |
| 6 | 9 |
| | |
| 7 | 3 |
| | |
| 8 or more | 16 |
+--------------------------------+-------------+
Table 1: Blockhash collisions rate
Taking 4,000 unique sample images (not counting collisions) and doing
a crosswise comparison of them, calculating the hamming distance
between one hash and every other hash, we get the following
distribution of hamming distances (up to 10 bits):
Liljenberg Expires June 3, 2016 [Page 9]
Internet-Draft Blockhash URN December 2015
+-------------------+----------------+--------------------+
| Hamming distance | Cross matches | Cross matches (%) |
+-------------------+----------------+--------------------+
| 1 | 0 | 0% |
| | | |
| 2 | 2 | 0,05% |
| | | |
| 3 | 2 | 0,05% |
| | | |
| 4 | 7 | 0,175% |
| | | |
| 5 | 7 | 0,175% |
| | | |
| 6 | 25 | 0,625% |
| | | |
| 7 | 30 | 0,75% |
| | | |
| 8 | 47 | 1,175% |
| | | |
| 9 | 55 | 1,375% |
| | | |
| 10 | 73 | 1,825% |
+-------------------+----------------+--------------------+
Table 2: False positive match rate
Author's Address
Peter Liljenberg
Commons Machinery
Aesgatan 5
Gnesta 64632
SE
Email: dev@commonsmachinery.se
URI: http://commonsmachinery.se/
Liljenberg Expires June 3, 2016 [Page 10]