Last week's vtools genesis received a warm welcome, and I appreciate all the testing people did. And with testing came bugs. One of the compilation warnings was quietly resolved in a regrind, before anyone but mod6 had a chance to look. Meanwhile hanbot found another two issues.
One is related to compilation, a C11 feature, a _Noreturn
attribute on one of the functions, prevented vtools from compiling on older version of GCC1. Even though in the makefile I specified that the source code is C99 conformant, what I was missing is -pedantic
flag2, that would've warned me that _Noreturn
is not available in C99. The fix was trivial and that was to use a GCC specific __attribute__((noreturn))
3.
The other bug is a lot more serious. Unix diff is a line oriented tool, and according to POSIX a line "is a sequence of zero or more non- <newline> characters plus a terminating <newline> character." A file that doesn't end in newline is not even consider a proper text file4. Never the less this is a convention that's not often followed, nor do I think it should be, but the support for the alternative needs special case handling. Unified Format that we use puts \ No newline at end of file
in the appropriate places. Fuzzy patches can afford to be sloppy with newlines, but vpatch makes hard claims by including the hash of the source and target files. Lacking the marker patch
adds extra newline on press invaliding the hash. This is exactly what happened with the current vdiff5, I took a wrong turn at conditional flag, and instead of producing correct output, vdiff was simply warning the operator about the missing newlines.
A fix for both of these issues is available in two identical patches, one against the genesis and the other one is on top of SHA release from last week. The reason for two separate patches is that this week's vtools is diverging into SHA and Keccak versions. The SHA release I intend to support until we have the necessary tooling to both press and display Keccak patches. As such people who want to continue testing vdiff in their current workflow should grab vdiff_sha_fixes_newline_gcc
to fix the bugs that came up last week.
Keccak support turned out to be a lot trickier than I expected. I'm using Diana Coman's smg_keccak, which is written in Ada.
I ran into two problems, one is that the interface to smg_keccak assumes stateless operation. You pass input and output buffers to Sponge
procedure. And you get the immediate result in a single pass. Diff on the other hand reads files block by block, so a conventional hashing interface would be more appropriate. You setup a context, then you incrementally feed new buffers to the context, and eventually you end the process by closing the context and producing the hash result. This is the technique I ended up implementing,
385 -- state based Sponge 386 type Keccak_Context (Block_Len: Keccak_Rate := Default_Bitrate) is 387 record 388 Internal: State := (others => (others => 0)); 389 Block: Bitstream(1..Block_Len) := (others => 0); 390 Pos: Natural; 391 end record; 392 393 procedure KeccakBegin(Ctx: in out Keccak_Context); 394 procedure KeccakHash(Ctx: in out Keccak_Context; 395 Input: Bitstream); 396 procedure KeccakEnd(Ctx: in out Keccak_Context; 397 Output: out Bitstream); 398
This is essentially Sponge
procedure, with the state externalized into Keccak_Context
record. KeccakBegin
zeroes out the context. There is some complexity, which I handled somewhat inelegantly: KeccakHash
treats context's Block
as a circular buffer, mapping Input
at a moving index. When index gets to a boundary, i.e. Block
bitstream is filled, the whole block is fed to AbsorbBlock
and the Keccak_Function
is applied. This is identical to what Sponge
does, except my code for handling the circular buffer is dirty. Ada turns out allows you to number a sequence from an arbitrary base. Diana has some elegant slicing going in her Sponge
but I fell back to potato programming with variables being carefully updated in a goto
loop. Finally KeccakEnd
pads whatever's left in block, and finishes the hash. Experienced Ada programmers are invited to read the procedures and suggest improvements.
Second problem I ran into is that the C interoperability code that Diana Coman kindly provided no longer worked for me. I wrote a separate file that exposes Keccak_Context
to C. You can guess the purpose of that code, by the very frequent appearance of letter "C".
281 package Keccak_C is 282 subtype C_Context is Keccak_Context(Block_Len=>Default_Bitrate); 283 type C_Context_Access is access C_Context; 284 procedure C_Get_Size(Size: out Interfaces.C.size_t); 285 pragma Export (C, C_Get_Size, "keccak_get_ctx_byte_size"); 286 function C_Begin return C_Context_Access; 287 pragma Export (C, C_Begin, "keccak_begin"); 288 procedure C_Hash(Ctx: C_Context_Access; 289 Input: Interfaces.C.Char_Array; 290 Len: Interfaces.C.Size_T); 291 pragma Export (C, C_Hash, "keccak_hash"); 292 procedure C_End(Ctx: C_Context_Access; 293 Output: out Interfaces.C.Char_Array; 294 Len: Interfaces.C.Size_T); 295 pragma Export (C, C_End, "keccak_end"); 296 procedure C_Deallocate(Ctx: in out C_Context_Access); 297 pragma Export (C, C_Deallocate, "keccak_free"); 298 end Keccak_C;
With corresponding C side headers,
171 extern void *keccak_begin(); 172 extern void keccak_get_ctx_byte_size(size_t *size); 173 extern void keccak_hash(void *ctx, char *array, size_t size); 174 extern void keccak_end(void *ctx, char *out, size_t size); 175 extern void keccak_free(void **ctx); 176 extern void adainit(); 177 extern void adafinal();
The relevant code uses two techniques for passing data back and forth. C_Context_Access
is what Ada calls access type, or what in other languages is called a pointer. From the perspective of C it's an opaque pointer that we get from Ada, and simply pass around through the lifetime of the operation. The underlying record is dynamically allocated with keccak_begin
, and it's the users responsibility to deallocate the memory with keccak_free
. Note that the later takes a pointer to a pointer, because that's how Ada seems to demand it, and after freeing the pointer itself is set to NULL
. I think this is a much cleaner convention than what's standard in the C world, i.e. freeing without also updating the pointer.
The other technique has already been demonstrated in Diana's Hash
function and it's used in both keccak_hash
and keccak_end
. We use a pointer to a character buffer, and explicitly pass the size
of that buffer. On the Ada side we then allocate size
amount storage and populate it with data from buffer. Similar but reverse technique is used to get the data out. Ada's array interoperability seems to be limited to character arrays6, and by default the interning functions assume that the data is NULL
terminated the way C string would be. So it's important to explicitly tell Interfaces.C.To_Ada
and Interfaces.C.To_C
to avoid NULL
handling by passing Trim_Nul => False
and Append_Nul => False
respectively.
Writing the necessary Ada interoperability code was frankly a pain in the ass, but I don't think it was Ada's fault at any point. Every single change at my hands would produce a boundary check warning or an explicit exception, but once the issue was resolved resultant code would ultimately make sense, as opposed to being an arbitrary hack the way it is sometimes done in other languages. The result seems to work, and even produce testable results, but I'm not sure of its overall reliability. I expect there to be bugs.
Keccak patch gets its own subtree in the vtools graph. As I mentioned I'm applying fixes on top of genesis, but vdiff's code is also dependent on a creatively named "keccak" vpatch, which is smg_keccak.ads
and smg_keccak.adb
taken verbatim from eucrypt project. This way while reading vdiff_keccak
it would be obvious what changes I've introduced. Still lacking rename and other similar operations in v machinery, I copied the smg files into the vtools hierarchy, otherwise they are identical to their source.
make
can still be used to produce a vdiff
file, but it's thin wrapper on top of grpbuild
build project. Resultant vdiff
works as expected, except it produces a keccak hash7.
vtools
on a minimalist compiler, like PCC
, and see how many of these GCC-isms will fall out [↩]% echo -n test > foo % hexdump foo 0000000 6574 7473 0000004 % shasum -a 512 foo ee26b0dd4af7e749aa1a8ee3c10ae9923f618980772e473f8819a5d4940e0db27ac185f8a0e1d5f84f88bc887fd67b143732c304cc5fa9ad8e6f57f50028a8ff foo % vdiff bar foo|tee p1 vdiff: foo: No newline at end of file --- bar false +++ foo ee26b0dd4af7e749aa1a8ee3c10ae9923f618980772e473f8819a5d4940e0db27ac185f8a0e1d5f84f88bc887fd67b143732c304cc5fa9ad8e6f57f50028a8ff @@ -0,0 +1 @@ +test % patch<p1 % hexdump bar 0000000 6574 7473 000a 0000005 % shasum -a 512 bar 0e3e75234abc68f4378a86b3f4b32a198ba301845b0cd6e50106e874345700cc6663a86c1ea125dc5e92be17c98f9a0f85ca9d5f595db2012f7cc3571945c123 bar
Correct behavior
% ./vdiff qux foo|tee p2 --- qux false +++ foo ee26b0dd4af7e749aa1a8ee3c10ae9923f618980772e473f8819a5d4940e0db27ac185f8a0e1d5f84f88bc887fd67b143732c304cc5fa9ad8e6f57f50028a8ff @@ -0,0 +1 @@ +test \ No newline at end of file % patch < p2 % shasum -a 512 qux ee26b0dd4af7e749aa1a8ee3c10ae9923f618980772e473f8819a5d4940e0db27ac185f8a0e1d5f84f88bc887fd67b143732c304cc5fa9ad8e6f57f50028a8ff qux
[↩]
Posted on 2018-02-21 18:00 vtools
Original vdiff as written by asciilifeform is a tiny awk script that calls out to standard1 sha512sum
and diff
. It was a great illustration of the concept, but inadvertently suffered from in-band issues, poor portability across the unixes in use by the republic. This post introduces the vtools project that is supposed to address some of these shortcomings as well as deliver some long expected features.
First release consists of two patches, one is the genesis, a stripped down version of diff2, the other is SHA-512 code bolted to the differ. I've decided to keep the two parts separate, since the next release will be explicitly about replacing SHA-512 with Keccak, and because the coupling between two parts might be educational for people who might want to hack on this differ themselves. Lacking a republican SHA, I'm using one I've lifted from Busybox.
The result is feature equivalent with the current vdiff, and should produce equivalent patches on the same code base.3
I took diff from GNU diffutils 3.6 and stripped it down to parts absolutely necessary for the functioning of vdiff. Specifically awk vdiff passes -rNu
flags, which makes the operation recursive, produces diffs for missing files, and generates output in unified format. Diff codebase is split between lib
and src
. The former includes copy-pasted code, that's shared between GNU projects. There's a lot of redundancy there, difftools carries an entire compatibility substrate with it; a lot of code there I could eliminate at the expense of "portability". It's unclear to me how much the result has suffered, since the functionality in lib
folder lacks internal consistency. Code that theoretically could run on DOS, shares space with code that has hard unix assumptions baked in. The other directory, src
is where diff proper lives. The distinction is arbitrary, but I've kept it for now, because it is aiding me in the exploration of the code.
The project has a top-level Makefile which will from now on build all the different tools. Functionality ought to be self evident, press the tree, call make at top level, you get a vdiff executable. Vdiff takes some extra arguments, left over from diff, I pruned them down to only ones that are still at all relevant, but their availability shouldn't be relied on.
Testing, comments and hate are welcome.4
It takes about 9.5s
to generate the entire set of patches from the current trb stable using awk vdiff, on the other hand this implementation takes 1.2s
on my machine. For reference it takes about 0.2s
to simply diff those files (producing broken vpatches). The way I tested this, I generated all the intermediate presses (test1.tbz2) for trb from genesis
to makefiles
, and then diffed all those presses against each other.5
cloc on diffutils-3.6
------------------------------------------------------------------------------- Language files blank comment code ------------------------------------------------------------------------------- C 338 9412 12349 49897 Bourne Shell 96 9135 6533 40712 PO File 32 9082 13029 29382 C/C++ Header 166 4272 7421 22080 m4 197 1283 1387 20227 TeX 1 812 3694 7175 make 13 1760 1469 3875 Perl 1 103 117 451 sed 2 0 0 16 ------------------------------------------------------------------------------- SUM: 846 35859 45999 173815 -------------------------------------------------------------------------------
cloc on the fresh press of vdiff
------------------------------------------------------------------------------- Language files blank comment code ------------------------------------------------------------------------------- C 15 784 951 2740 C/C++ Header 12 280 351 609 make 1 5 0 34 ------------------------------------------------------------------------------- SUM: 28 1069 1302 3383 -------------------------------------------------------------------------------
Lines of code is a somewhat meaningless metric in this case, since vdiff is not a replacement for diff proper. Perhaps a more relevant metric would've been a vdiff written from scratch, but lacking that we can marvel at great savings.
strcmp
to sort names the same for everyone. Well turns out that en_US.UTF-8
places dot before dash, while C locale, which is what strcmp
uses, places it after. Which means that for example mpi
which contains files named mpi.h
and mpi-internal.h
will have files in different order when produced by C vdiff as opposed to awk vdiff. This might've been prevented if later had LC_ALL=C
set, but as it stands most of the extant vpatches have been produced with whatever system locale.
This can be easily demonstrated,
# mkdir a b
# echo foo > b/mpi.h
# echo foo > b/mpi-internal.h
# LC_ALL=C diff -ruN a b | grep '^diff'
diff -ruN a/mpi-internal.h b/mpi-internal.h
diff -ruN a/mpi.h b/mpi.h
# LC_ALL=en_US.UTF-8 diff -ruN a b | grep '^diff'
diff -ruN a/mpi.h b/mpi.h
diff -ruN a/mpi-internal.h b/mpi-internal.h
Now would be a good time to introduce standard republican alphabetic order.
Like the internationalization there are potentially some changes that are the result of the cut, but these should be considered bugs. There is one explicit change though that was made which is related to the diagnostic output. What differ does when it encounters files that it can't produce a diff for is standardized by POSIX. It is expected to produce messages of the format "File foo is a directory while file bar is a regular file", etc. and output them in band. In case of diff this is perhaps useful behavior, but vpatch format doesn't recognize these sort of messages, so patch author has to remove them by hand. These messages are notoriously hard to spot, and during testing I found a leftover "Binary files a/logbot/logbot.fasl and b/logbot/logbot.fasl differ" in the published logbot's vpatch. So all the diagnostic messages now go to stderr. Vdiff's output should be standard format vpatch. [↩]
wget http://btcbase.org/data/vtools/test1.tbz2
tar xjf test1.tbz2
cd test1
bash run.bash <path to vdiff>
[↩]Posted on 2018-02-08 20:25 notes
bagge policy is for travel from United States to Uruguay
cabin 23kg/ 50lb
check 2 x 23kg/50lbOVERWEIGHT BAGGAGE FEE
from 24kg/ 52lb to 32kg/ 70lb US$100
from 33kg/ 72lb to 45kg/ 99lb US$200OVERSIZED BAGGAGE FEE
159 - 272 cm linear (63 - 107 inches) US$150 per bag
US$175 for each additional bag up to 23 kg each + overweight fees
No more than 2 additional bags allowed.RESTRICTIONS
Individual pieces over 45 kg (99 lb) and 272 cm (107 inches) linear are not allowed
during 1 Jul - 31 Aug, 1 Dec - 31 Jan no additional baggage allowed
Anything weighing more than 100 lb must be sent as cargo.Each piece can measure up to 62 combined linear inches (158 cm) (height + length + width).
Pieces with a combined linear measurement between 63 inches (159 cm) and 107 inches (272 cm) are considered excess baggage.
Pieces whose combined linear measurements exceed 107 inches (272 cm) will not be accepted as baggage and must be transported as cargo.