BARKS IN THE WIND

v.py updated for vtools

Posted on 2018-09-30 23:09 technical, vtools

As has been discussed in the logs, vtools doesn't stand on its own as a V implementation. Instead it's a collection of tools for working with vpatches. V authors can use vtools so as to not rely on often brittle GNU utilities.

On my own workbench I've been using a patched up version of original asciilifeform's v.py, where I replaced a call to GNU patch with one to vpatch. The replacement is essentially a drop in1, with the advantage of being much stricter about the vpatches that are accepted and also making sure that the press hashes are valid.

I have barely touched v.py otherwise2, so I consider this a proof of concept release. It consists of two patches, the original v.py version 99 genesis3 and my own modifications.

Unless you have a working keccak v, you'll need to bootstrap manually. Assuming you have a working vtools build,

PATH=path to vtools:$PATH
mkdir {wot,seals,patches}
curl --silent http://wot.deedbot.org/BDDE12104FE81BE7F83B698F5356DE4752432A9E.asc -o wot/phf.asc
gpg --import wot/phf.asc
curl --silent -o patches/v99.vpatch  http://btcbase.org/data/vpy/v99.vpatch
curl --silent -o patches/v98.vpatch  http://btcbase.org/data/vpy/v98.vpatch
curl --silent -o seals/v99.vpatch.phf.sig  http://btcbase.org/data/vpy/v99.vpatch.phf.sig
curl --silent -o seals/v98.vpatch.phf.sig  http://btcbase.org/data/vpy/v98.vpatch.phf.sig
gpg --verify seals/v99.vpatch.phf.sig patches/v99.vpatch
gpg --verify seals/v98.vpatch.phf.sig patches/v98.vpatch
cat patches/v99.vpatch patches/v98.vpatch | vpatch
pip install python-gnupg
chmod +x v/v.py
./v/v.py --wot ./wot -fingers --seals ./seals ./patches p ./patches/v98.vpatch v_press

You now have a self-pressed v.py in the v_press directory!

Some things to note: the bulk of bootstrapping effort is verifying the patch signatures, something that v does for you. On the other hand you can just cat any number of patches into vpatch utility and it will produce a verified press. Asciilifeform's v.py uses stock python, but it does depend on python-gnupg package, which can be installed through pip or whatever global packaging system (on gentoo it's emerge python-gnupg).

  1. right now vpatch doesn't support target directory, and presses into the current directory, so I had to do some changes to accommodate v.py's concept of destination. i also hardened the call out to an external process, though perhaps unnecessarily. []
  2. I have also added support for subkeys. []
  3. the first release of v.py is actually version 100, but the diff between 100 and 99 is in my opinion entirely cosmetic, so I avoided pedantically reconstructing the entire chain, and started with a canonical version of v.py. []

vtools complete keccak prerelease

Posted on 2018-04-07 20:20 notes, technical, vtools

I'm going to call this post a vtools pre-release. I'm deferring the proper release write up till Wednesday, but meanwhile the relevant release work has been done, and it's good time to point interested parties to the bits so that further log discussion can happen. I doubt that my write ups stand on their own, that is without also close following of the going ons in the logs, but this post is particularly so only of interest to specific people.

I've reground the project around manifest file. From previous conversations, it seems like the format is mostly inconsequential, so I'm using <date> <nick>\n<message>. An example of the manifest file press, you can see the implicit press order in btcbase annotation, and how that manifest change looks in a vpatch file. In the process I've discovered that btcbase presser isn't working quite right, so at the moment /tree/ shouldn't be relied on for exploration of the press.1

Keccak vdiff /vpatch are now at feature parity with the existing shell based tooling, specifically vpatch now supports no newline directive. We're going to start working with a complete round trip in mp-wp, which is going to be keccak only release. I would still like to make vpatch work with SHA-512 though.

Current complete patchset, with vtools_vpatch_newline keccak and vdiff_sha_static SHA-512 heads,

http://btcbase.org/data/vtools/keccak.vpatch
http://btcbase.org/data/vtools/keccak.vpatch.phf.sig
http://btcbase.org/data/vtools/vdiff_fixes_newline_gcc.vpatch
http://btcbase.org/data/vtools/vdiff_fixes_newline_gcc.vpatch.phf.sig
http://btcbase.org/data/vtools/vdiff_keccak.vpatch
http://btcbase.org/data/vtools/vdiff_keccak.vpatch.phf.sig
http://btcbase.org/data/vtools/vdiff_sha_fixes_newline_gcc.vpatch
http://btcbase.org/data/vtools/vdiff_sha_fixes_newline_gcc.vpatch.phf.sig
http://btcbase.org/data/vtools/vdiff_sha_static.vpatch
http://btcbase.org/data/vtools/vdiff_sha_static.vpatch.phf.sig
http://btcbase.org/data/vtools/vtools_fixes_bitrate_char_array.vpatch
http://btcbase.org/data/vtools/vtools_fixes_bitrate_char_array.vpatch.phf.sig
http://btcbase.org/data/vtools/vtools_fixes_static_tohex.vpatch
http://btcbase.org/data/vtools/vtools_fixes_static_tohex.vpatch.phf.sig
http://btcbase.org/data/vtools/vtools_genesis.vpatch
http://btcbase.org/data/vtools/vtools_genesis.vpatch.phf.sig
http://btcbase.org/data/vtools/vtools_vdiff_sha.vpatch
http://btcbase.org/data/vtools/vtools_vdiff_sha.vpatch.phf.sig
http://btcbase.org/data/vtools/vtools_vpatch.vpatch
http://btcbase.org/data/vtools/vtools_vpatch.vpatch.phf.sig
http://btcbase.org/data/vtools/vtools_vpatch_newline.vpatch
http://btcbase.org/data/vtools/vtools_vpatch_newline.vpatch.phf.sig
  1. e.g. vtools_vpatch_newline's manifest contains extra two entries at the end, which is a result of a buggy press rather than the contents of relevant vpatches []

vtools vpatch

Posted on 2018-03-22 15:00 technical, vtools

I just wrapped up busy three weeks worth of a family trip and two back to back conferences. I completely forgot how exhausting conferences are, and how little time they leave for anything else. There's a short backlog of posts, that I'm going to publish once I'm back home, but now that I had a chance to recover a bit, I'm going to release what I managed to work on during my travels.

I present for your consideration a proof of concept release of a vpatcher, that can press keccak patches. This implementation was modeled on vpatch parser/press that's used in btcbase to render patches and more importantly to produce an in-memory press. While the general architecture was chosen upfront, this implementation was authored incrementally, a development style that is surprisingly well supported by Ada1, so the bulk of functionality is contained in a single file. At this point I'm not convinced that some kind of split is required, though splitting it in the style of ascii's FFI might bring some clarity.

Vpatch is essentially standard unix patch tool, that supports strict application of a unified diff and verification of V hashes. It takes a patch in a standard input stream and attempts to press whatever content into the current directory. Existing files are patched, new files are created, old files are removed. Processing is done one file at a time, and the operation terminates with an error when an issue is encountered. Patcher keeps a temporary file around for the result output, which gets moved in place once the file's hash has been verified. This means that atomicity is preserved at a file level, but not at the patch level and failed press results in an inconsistent state of the directory. Individual files are always either in the previous or new state, which means that you get to inspect the offending file, but you have to fully redo the press on failure. This is a decision that I might have to reconsider, at the expense of increased complexity. Right now very little is kept in memory: information about the current file being patched, the current hunk and whatever simple data used to track the state.

To build the patcher use make vpatch, or call grpbuild on vpatch.gpr directly. To test the patcher, I reground trb stable release. Since the patcher doesn't verify sigs, I haven't signed the regrind, and it's provided for testing purposes only.

Press the genesis,

% vpatch < ps/1.vpatch
creating bitcoin/.gitignore
creating bitcoin/COPYING
...
creating bitcoin/src/wallet.cpp
creating bitcoin/src/wallet.h

Apply the next patch on top of it,

% vpatch < ps/2.vpatch
patching bitcoin/src/bitcoinrpc.cpp
patching bitcoin/src/db.cpp
patching bitcoin/src/headers.h
patching bitcoin/src/init.cpp
deleting bitcoin/src/qtui.h
patching bitcoin/src/util.h
patching bitcoin/src/wallet.cpp

If we now try to rerun genesis in the same folder we get,

% vpatch < ps/1.vpatch
creating bitcoin/.gitignore

raised VPATCH.STATE : attempt to create a file, but file already exists

Likewise attempt to reapply second patch results in failure, since whatever files have invalid hash2

% vpatch < ps/2.vpatch
patching bitcoin/src/bitcoinrpc.cpp

raised VPATCH.STATE : from hash doesn't match

Same will happen if we attempt to apply a significantly later patch, since the necessary intermediate patches are missing,

% vpatch < ps/12.vpatch
patching bitcoin/src/main.cpp

raised VPATCH.STATE : from hash doesn't match

Finally applying the correct patch succeeds,

% vpatch < ps/3.vpatch
patching bitcoin/src/db.cpp
patching bitcoin/src/init.cpp
patching bitcoin/src/main.cpp
patching bitcoin/src/main.h
patching bitcoin/src/makefile.linux-mingw
patching bitcoin/src/makefile.unix
patching bitcoin/src/net.cpp

Supporting pipe streaming means that we can start vpatch and incrementally feed it patches, moving the tree towards the press top. (In this case patches that we're pressing are named in order from 1 to 27.)

% cat ps/{1..27}.vpatch | vpatch
creating bitcoin/.gitignore
creating bitcoin/COPYING
...
creating bitcoin/deps/Makefile
creating bitcoin/deps/Manifest.sha512
patching bitcoin/src/db.cpp
patching bitcoin/src/init.cpp
patching bitcoin/src/main.cpp
creating bitcoin/verify.mk

The press can be sanity checked using e.g. checksum file from btcbase, but obviously the tool itself does both input and output hash verification as it goes.

There are some known issues, the biggest one is that \ No newline at end of file doesn't work yet, and the patcher fails when it encounters the dirrective. Half way through development I discovered that Text_IO is idiosyncratic: there's no machinery to produce a line without a newline at the end, or to figure out whether or not existing file has one.3 Ada always outputs a valid text file and there's no way to avoid it with Text_IO. Future developers beware! A solution to this problem is to use Sequential_IO specialized on Character, but that means writing own high level procedures like Get_Line. In the work in progress modification that uses Sequential_IO I was able to build a drop in replacement for Text_IO with minimum of changes to existing code, by gradually adding the missing functionality.

To understand the way this patcher works it's helpful to have some idea about the diff format, that we're using. There's three data structures that I use to keep track of patch data. The header,

   type Header (From_L, To_L: Natural) Is record
      From_Hash: Hash;
      From_File: String(1..From_L);
      To_Hash: Hash;
      To_File: String(1..To_L);
   end record;

which holds the source and the destination file information and corresponds to

diff -uNr a/bitcoin/.gitignore b/bitcoin/.gitignore
--- a/bitcoin/.gitignore false
+++ b/bitcoin/.gitignore 6654c7489c311585d7d3...

lines.

A hash is a variant record type that can either hold a specific value or when there's no hash, indicated by false label, it's explicitly marked empty, which happens when the file is either created or removed, with an empty from and to respectively.

   type Hash_Type is (Empty, Value);
   type Hash(The_Type: Hash_Type := Empty) is record
      case The_Type is
         when Value =>
            Value: String(1..Hash_Length);
         when Empty =>
            null;
      end case;
   end record;

We distinguish between three possible file operations,

   type Patch_Op is (Op_Create, Op_Delete, Op_Patch);

which depend on the presence of hashes. If we only have input hash, that means that the file has been deleted, likewise only output hash the file is being newly created. Both hashes indicate that the file is being patched.

Each header is followed by one or more "hunks", a line count prelude followed by the line related commands,

   type Line_Numbers is record
      Start: Natural;
      Count: Natural;
   end record;

   type Hunk is record
      From_File_Line_Numbers: Line_Numbers;
      To_File_Line_Numbers: Line_Numbers;
   end record;

A hunk holds the actual change details that is a sequence of optional context lines which we use for sanity checking ("patch claims line foo does input file actually have foo in that line"), followed by some number of additions or deletions, followed by optional context lines. The line commands are not actually stored in memory, instead they are processed as they are encountered. A typical hunk looks like this,

@@ -435,7 +432,7 @@
     {
         BOOST_FOREACH(string strAddr, mapMultiArgs["-addnode"])
         {
-            CAddress addr(strAddr, fAllowDNS);
+            CAddress addr(strAddr);
             addr.nTime = 0; // so it won't relay unless successfully connected
             if (addr.IsValid())
                 AddAddress(addr);

Only the line in bold is kept in memory. Each record has a corresponding Get procedure which reads the record from input stream. This way you can say, e.g. Get(A_Hunk) and that'll read the @@ -435,7 +432,7 @@ line from input stream.

The parser is naive in that it looks at the stream one character at a time, and dispatches to various handlers, rarelly needing to read the whole of something, before making a decision. This is a traditional lisp parsing technique, which is also well supported in Ada. The bulk of work happens inside an unreasonably large Process_Hunks_For_Header procedure. I will eventually attempt to refactor it, but right now it does all the pre and post checks, parses the hunk body and performs relevant modifications. It relies on record Gets to parse the input patch. The are two loops in the body, Hunk_Loop, which handles each hunk under the header, and Hunk_Body_Loop, which actually handles individual line changes within a hunk. The core of hunk body loop is a dispatch on the first character in the line,

            exit Hunk_Body_Loop when From_Count = 0 and To_Count = 0;
            Look_Ahead(C, EOL);
            if EOL then
               raise Parse with "blank line in hunk";
            end if;
            case C is
               when '+' => -- line added
...
               when '-' => -- line deleted
...
               when ' ' => -- line stays the same
...
               when others =>
                  raise Parse with "unexpected character "
...
            end case;

Attentive reader will note that we exit the loop based exclusively on the expected line count, which is the information communicated in the hunk @@ prelude.

There are some obvious improvements for future releases, the aformentioned newline at end of file issue. I'd also like to port this implementation to SHA 512 branch, to allow testing in the current workflow. The SHA port particularly will let me test Ada to C interoperability. Going back to the Wednesday schedule, I will address one of these in the next release.

  1. I'm more and more impressed with Ada as a language, unfortunately after extensive use I've ran into various issues with core library, which generally left a poor impression. A lot of the decision very much leave the impression of http://btcbase.org/log/2018-03-08#1787319 []
  2. Vpatch is a streaming tool, and none of the files are read twice. So hashing is happening online, and we hash AND attempt to patch in parellel. If patching attempt fails (because patch information doesn't match file's contents), we complete hashing and report either of the errors. Hashing has higher priority, but if the hash is valid we'll report patching error instead. []
  3. End_Of_File also has an odd behavior in that it will report True, if there's an end of file OR if there's a newline followed by end of file. This means that if you're using traditional "while not eof" loop, you're going to lose last newline. []

vtools C interop, other fixes

Posted on 2018-03-08 02:15 technical, vtools

The original plan to get vpatch released this week fell through, instead there's more bug fixes.

The exercise of trimming GNU diff left a bad taste in my mouth, the end result, while significantly reduced and thus easier to study, is still a significant chunk of "clever" C code clocking at 3383 lines. But the exercise was worthwhile1 in that it allowed me to explicitly preserve diff's quirks when it comes to hunk construction, in order to be able to replicate existing vpatches.

Same consideration doesn't apply to a patcher, since a patcher is entirely dumb machinery, a kind of player piano, executing instructions from pre-recorded tape. As I've been enjoying my brief foray into Ada programming, thrust as it was on me by the republic, I decided to stick to the environment and use it to implement the patcher also. There is some rational reasons for using Ada for patcher instead of C. Where differ can afford to be sloppy in operation-- an operator can identify issues by reading the patch-- a press absolutely must result in a tree of files claimed by the press chain, or fail explicitly.

Current version of ada patcher was modeled on btcbase's internal Lisp implementation, and at 490 lines, can successfully press trb's "stable" branch. Unfortunately in the process of testing the patcher I've discovered another bug in the current keccak differ, that ate up the rest of my allocated time.

The possibility of that bug was hinted at in the recent conversation with diana_coman on the subject of Ada and C interoperability. Vtools interfaces to SMG's Keccak has two functions that among other things transfer arrays of characters between diff's C code and Keccak's Ada, C_Hash and C_End. The first takes in the text of the files under the consideration, in chunks, and the later sends back the final hash value, encoded as an array of bits, and on the C side represented as a string. Last week's vdiff uses Interfaces.C. Char_Array type to point to a shared char buffer, and standard functions Interfaces.C. To_Ada / Interfaces.C. To_C to convert the data between languages.

Well, in our conversation, Diana mentioned that in her experiments with To_Ada it sometime stoped too soon and failed to copy the entire contents of buffer. My reaction at the time was essentially "well, works for me"2, except now it doesn't, and on the most recent pass the issue came back with vengeance, almost entirely failing to transfer the data3. The test environment ostensibly stayed the same, so I'm completely mystified by the behavior. I'll attempt a dive into Ada's code to at least understand what's going on, but meanwhile, I rewrote the offending bits using yet another C to Ada copy method.

This is not the approach that diana_coman took in her code, which still uses Interfaces.C. Char_Array for data type, but instead of using To_C/Ada her functions explicitly walk the buffer in a loop, copying each character one by one. I instead listened to the siren's call of Ada's documentation4 and went with Interfaces.C. Pointers, which even provides a helpful example of a roundtrip at the end of the section. The details of the implementation can be seen in the patch, but they follow almost directly the blueprint in the documentation. The approach is similar to what diana_coman does, in that characters are copied in an explicit loop, but instead of Char_Array I'm now using a pointer abstraction, which mimics C's behavior and requires explicit advancement.

The method that I've implemented turned out to have already been condemned by ave1. Apparently he wrote what looks like an extensive investigation into dealing with char array. All in all much back to the drawing board.

The patch also includes a backport of bounds error fix for smg_keccak, documented in extensive detail on diana_coman's blog.

  1. besides, that is, personal educational value []
  2. clearly there's a need for a better stress testing environment on my part []
  3. although when it comes to hashing, "almost" entirely is the proverbial "shit soup" []
  4. standard!!1 []

vtools keccak

Posted on 2018-02-28 23:55 technical, vtools

Last week's vtools genesis received a warm welcome, and I appreciate all the testing people did. And with testing came bugs. One of the compilation warnings was quietly resolved in a regrind, before anyone but mod6 had a chance to look. Meanwhile hanbot found another two issues.

One is related to compilation, a C11 feature, a _Noreturn attribute on one of the functions, prevented vtools from compiling on older version of GCC1. Even though in the makefile I specified that the source code is C99 conformant, what I was missing is -pedantic flag2, that would've warned me that _Noreturn is not available in C99. The fix was trivial and that was to use a GCC specific __attribute__((noreturn))3.

The other bug is a lot more serious. Unix diff is a line oriented tool, and according to POSIX a line "is a sequence of zero or more non- <newline> characters plus a terminating <newline> character." A file that doesn't end in newline is not even consider a proper text file4. Never the less this is a convention that's not often followed, nor do I think it should be, but the support for the alternative needs special case handling. Unified Format that we use puts \ No newline at end of file in the appropriate places. Fuzzy patches can afford to be sloppy with newlines, but vpatch makes hard claims by including the hash of the source and target files. Lacking the marker patch adds extra newline on press invaliding the hash. This is exactly what happened with the current vdiff5, I took a wrong turn at conditional flag, and instead of producing correct output, vdiff was simply warning the operator about the missing newlines.

A fix for both of these issues is available in two identical patches, one against the genesis and the other one is on top of SHA release from last week. The reason for two separate patches is that this week's vtools is diverging into SHA and Keccak versions. The SHA release I intend to support until we have the necessary tooling to both press and display Keccak patches. As such people who want to continue testing vdiff in their current workflow should grab vdiff_sha_fixes_newline_gcc to fix the bugs that came up last week.

Keccak support turned out to be a lot trickier than I expected. I'm using Diana Coman's smg_keccak, which is written in Ada.

I ran into two problems, one is that the interface to smg_keccak assumes stateless operation. You pass input and output buffers to Sponge procedure. And you get the immediate result in a single pass. Diff on the other hand reads files block by block, so a conventional hashing interface would be more appropriate. You setup a context, then you incrementally feed new buffers to the context, and eventually you end the process by closing the context and producing the hash result. This is the technique I ended up implementing,

385   -- state based Sponge
386   type Keccak_Context (Block_Len: Keccak_Rate := Default_Bitrate) is
387      record
388         Internal: State := (others => (others => 0));
389         Block: Bitstream(1..Block_Len) := (others => 0);
390         Pos: Natural;
391      end record;
392
393   procedure KeccakBegin(Ctx: in out Keccak_Context);
394   procedure KeccakHash(Ctx: in out Keccak_Context;
395                        Input: Bitstream);
396   procedure KeccakEnd(Ctx: in out Keccak_Context;
397                       Output: out Bitstream);
398

This is essentially Sponge procedure, with the state externalized into Keccak_Context record. KeccakBegin zeroes out the context. There is some complexity, which I handled somewhat inelegantly: KeccakHash treats context's Block as a circular buffer, mapping Input at a moving index. When index gets to a boundary, i.e. Block bitstream is filled, the whole block is fed to AbsorbBlock and the Keccak_Function is applied. This is identical to what Sponge does, except my code for handling the circular buffer is dirty. Ada turns out allows you to number a sequence from an arbitrary base. Diana has some elegant slicing going in her Sponge but I fell back to potato programming with variables being carefully updated in a goto loop. Finally KeccakEnd pads whatever's left in block, and finishes the hash. Experienced Ada programmers are invited to read the procedures and suggest improvements.

Second problem I ran into is that the C interoperability code that Diana Coman kindly provided no longer worked for me. I wrote a separate file that exposes Keccak_Context to C. You can guess the purpose of that code, by the very frequent appearance of letter "C".

281 package Keccak_C is
282    subtype C_Context is Keccak_Context(Block_Len=>Default_Bitrate);
283    type C_Context_Access is access C_Context;
284    procedure C_Get_Size(Size: out Interfaces.C.size_t);
285    pragma Export (C, C_Get_Size, "keccak_get_ctx_byte_size");
286    function C_Begin return C_Context_Access;
287    pragma Export (C, C_Begin, "keccak_begin");
288    procedure C_Hash(Ctx: C_Context_Access;
289                     Input: Interfaces.C.Char_Array;
290                     Len: Interfaces.C.Size_T);
291    pragma Export (C, C_Hash, "keccak_hash");
292    procedure C_End(Ctx: C_Context_Access;
293                    Output: out Interfaces.C.Char_Array;
294                    Len: Interfaces.C.Size_T);
295    pragma Export (C, C_End, "keccak_end");
296    procedure C_Deallocate(Ctx: in out C_Context_Access);
297    pragma Export (C, C_Deallocate, "keccak_free");
298 end Keccak_C;

With corresponding C side headers,

171 extern void *keccak_begin();
172 extern void keccak_get_ctx_byte_size(size_t *size);
173 extern void keccak_hash(void *ctx, char *array, size_t size);
174 extern void keccak_end(void *ctx, char *out, size_t size);
175 extern void keccak_free(void **ctx);
176 extern void adainit();
177 extern void adafinal();

The relevant code uses two techniques for passing data back and forth. C_Context_Access is what Ada calls access type, or what in other languages is called a pointer. From the perspective of C it's an opaque pointer that we get from Ada, and simply pass around through the lifetime of the operation. The underlying record is dynamically allocated with keccak_begin, and it's the users responsibility to deallocate the memory with keccak_free. Note that the later takes a pointer to a pointer, because that's how Ada seems to demand it, and after freeing the pointer itself is set to NULL. I think this is a much cleaner convention than what's standard in the C world, i.e. freeing without also updating the pointer.

The other technique has already been demonstrated in Diana's Hash function and it's used in both keccak_hash and keccak_end. We use a pointer to a character buffer, and explicitly pass the size of that buffer. On the Ada side we then allocate size amount storage and populate it with data from buffer. Similar but reverse technique is used to get the data out. Ada's array interoperability seems to be limited to character arrays6, and by default the interning functions assume that the data is NULL terminated the way C string would be. So it's important to explicitly tell Interfaces.C.To_Ada and Interfaces.C.To_C to avoid NULL handling by passing Trim_Nul => False and Append_Nul => False respectively.

Writing the necessary Ada interoperability code was frankly a pain in the ass, but I don't think it was Ada's fault at any point. Every single change at my hands would produce a boundary check warning or an explicit exception, but once the issue was resolved resultant code would ultimately make sense, as opposed to being an arbitrary hack the way it is sometimes done in other languages. The result seems to work, and even produce testable results, but I'm not sure of its overall reliability. I expect there to be bugs.

Keccak patch gets its own subtree in the vtools graph. As I mentioned I'm applying fixes on top of genesis, but vdiff's code is also dependent on a creatively named "keccak" vpatch, which is smg_keccak.ads and smg_keccak.adb taken verbatim from eucrypt project. This way while reading vdiff_keccak it would be obvious what changes I've introduced. Still lacking rename and other similar operations in v machinery, I copied the smg files into the vtools hierarchy, otherwise they are identical to their source.

make can still be used to produce a vdiff file, but it's thin wrapper on top of grpbuild build project. Resultant vdiff works as expected, except it produces a keccak hash7.

  1. anything before 4.7 []
  2. go figure, it's not enough to just ask for a standard, you also need to declare yourself a pedant []
  3. One possible exercise would be to attempt to compile vtools on a minimalist compiler, like PCC, and see how many of these GCC-isms will fall out []
  4. POSIX again says that a text file is A file that contains characters organized into zero or more lines. []
  5. % echo -n test > foo
    % hexdump foo
    0000000 6574 7473
    0000004
    % shasum -a 512 foo
    ee26b0dd4af7e749aa1a8ee3c10ae9923f618980772e473f8819a5d4940e0db27ac185f8a0e1d5f84f88bc887fd67b143732c304cc5fa9ad8e6f57f50028a8ff  foo
    % vdiff bar foo|tee p1
    vdiff: foo: No newline at end of file
    
    --- bar false
    +++ foo ee26b0dd4af7e749aa1a8ee3c10ae9923f618980772e473f8819a5d4940e0db27ac185f8a0e1d5f84f88bc887fd67b143732c304cc5fa9ad8e6f57f50028a8ff
    @@ -0,0 +1 @@
    +test
    % patch<p1
    % hexdump bar
    0000000 6574 7473 000a
    0000005
    % shasum -a 512 bar
    0e3e75234abc68f4378a86b3f4b32a198ba301845b0cd6e50106e874345700cc6663a86c1ea125dc5e92be17c98f9a0f85ca9d5f595db2012f7cc3571945c123  bar

    Correct behavior

    % ./vdiff qux foo|tee p2
    --- qux false
    +++ foo ee26b0dd4af7e749aa1a8ee3c10ae9923f618980772e473f8819a5d4940e0db27ac185f8a0e1d5f84f88bc887fd67b143732c304cc5fa9ad8e6f57f50028a8ff
    @@ -0,0 +1 @@
    +test
    \ No newline at end of file
    % patch < p2
    % shasum -a 512 qux
    ee26b0dd4af7e749aa1a8ee3c10ae9923f618980772e473f8819a5d4940e0db27ac185f8a0e1d5f84f88bc887fd67b143732c304cc5fa9ad8e6f57f50028a8ff  qux

    []

  6. corrections are welcome []
  7. Apparently keccak allows arbitrary sized hashes, but the one in vdiff is set to 64 so it matches SHA output. Another special constant is the bitrate that's set to eucrypt's 1344. The implications of both elude me, but I'm sure necessary adjustments will become obvious with reflection and use. []

vtools genesis

Posted on 2018-02-21 18:00 vtools

Original vdiff as written by asciilifeform is a tiny awk script that calls out to standard1 sha512sum and diff. It was a great illustration of the concept, but inadvertently suffered from in-band issues, poor portability across the unixes in use by the republic. This post introduces the vtools project that is supposed to address some of these shortcomings as well as deliver some long expected features.

First release consists of two patches, one is the genesis, a stripped down version of diff2, the other is SHA-512 code bolted to the differ. I've decided to keep the two parts separate, since the next release will be explicitly about replacing SHA-512 with Keccak, and because the coupling between two parts might be educational for people who might want to hack on this differ themselves. Lacking a republican SHA, I'm using one I've lifted from Busybox.

The result is feature equivalent with the current vdiff, and should produce equivalent patches on the same code base.3

I took diff from GNU diffutils 3.6 and stripped it down to parts absolutely necessary for the functioning of vdiff. Specifically awk vdiff passes -rNu flags, which makes the operation recursive, produces diffs for missing files, and generates output in unified format. Diff codebase is split between lib and src. The former includes copy-pasted code, that's shared between GNU projects. There's a lot of redundancy there, difftools carries an entire compatibility substrate with it; a lot of code there I could eliminate at the expense of "portability". It's unclear to me how much the result has suffered, since the functionality in lib folder lacks internal consistency. Code that theoretically could run on DOS, shares space with code that has hard unix assumptions baked in. The other directory, src is where diff proper lives. The distinction is arbitrary, but I've kept it for now, because it is aiding me in the exploration of the code.

The project has a top-level Makefile which will from now on build all the different tools. Functionality ought to be self evident, press the tree, call make at top level, you get a vdiff executable. Vdiff takes some extra arguments, left over from diff, I pruned them down to only ones that are still at all relevant, but their availability shouldn't be relied on.

Testing, comments and hate are welcome.4

Somewhat meaningless statistics

It takes about 9.5s to generate the entire set of patches from the current trb stable using awk vdiff, on the other hand this implementation takes 1.2s on my machine. For reference it takes about 0.2s to simply diff those files (producing broken vpatches). The way I tested this, I generated all the intermediate presses (test1.tbz2) for trb from genesis to makefiles, and then diffed all those presses against each other.5

cloc on diffutils-3.6

-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
C                              338           9412          12349          49897
Bourne Shell                    96           9135           6533          40712
PO File                         32           9082          13029          29382
C/C++ Header                   166           4272           7421          22080
m4                             197           1283           1387          20227
TeX                              1            812           3694           7175
make                            13           1760           1469           3875
Perl                             1            103            117            451
sed                              2              0              0             16
-------------------------------------------------------------------------------
SUM:                           846          35859          45999         173815
-------------------------------------------------------------------------------

cloc on the fresh press of vdiff

-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
C                               15            784            951           2740
C/C++ Header                    12            280            351            609
make                             1              5              0             34
-------------------------------------------------------------------------------
SUM:                            28           1069           1302           3383
-------------------------------------------------------------------------------

Lines of code is a somewhat meaningless metric in this case, since vdiff is not a replacement for diff proper. Perhaps a more relevant metric would've been a vdiff written from scratch, but lacking that we can marvel at great savings.

  1. no such thing in unix world []
  2. yes, I removed autotools []
  3. This is of course not true, because there's always something. When walking the trees for comparison, diff sorts file names according to some collation rules. Getting rid of internationalization means that collation rules can no longer be switched by the operator, instead C differ uses standard library's strcmp to sort names the same for everyone. Well turns out that en_US.UTF-8 places dot before dash, while C locale, which is what strcmp uses, places it after. Which means that for example mpi which contains files named mpi.h and mpi-internal.h will have files in different order when produced by C vdiff as opposed to awk vdiff. This might've been prevented if later had LC_ALL=C set, but as it stands most of the extant vpatches have been produced with whatever system locale.

    This can be easily demonstrated,

    # mkdir a b
    # echo foo > b/mpi.h
    # echo foo > b/mpi-internal.h
    # LC_ALL=C diff -ruN a b | grep '^diff'
    diff -ruN a/mpi-internal.h b/mpi-internal.h
    diff -ruN a/mpi.h b/mpi.h
    # LC_ALL=en_US.UTF-8 diff -ruN a b | grep '^diff'
    diff -ruN a/mpi.h b/mpi.h
    diff -ruN a/mpi-internal.h b/mpi-internal.h

    Now would be a good time to introduce standard republican alphabetic order.

    Like the internationalization there are potentially some changes that are the result of the cut, but these should be considered bugs. There is one explicit change though that was made which is related to the diagnostic output. What differ does when it encounters files that it can't produce a diff for is standardized by POSIX. It is expected to produce messages of the format "File foo is a directory while file bar is a regular file", etc. and output them in band. In case of diff this is perhaps useful behavior, but vpatch format doesn't recognize these sort of messages, so patch author has to remove them by hand. These messages are notoriously hard to spot, and during testing I found a leftover "Binary files a/logbot/logbot.fasl and b/logbot/logbot.fasl differ" in the published logbot's vpatch. So all the diagnostic messages now go to stderr. Vdiff's output should be standard format vpatch. []

  4. This theme is new, and stollen from elsewhere, specifically comment section is visually quirky. Bear with me, while I figure it out. []
  5. The same dataset can be used to test a vdiff'er on your system,

    wget http://btcbase.org/data/vtools/test1.tbz2
    tar xjf test1.tbz2
    cd test1
    bash run.bash <path to vdiff>
    []

vdiff fixes

Posted on 2018-01-22 11:23 technical, vtools

original vdiff posted by asciilifeform suffers from a bug warned against in the man page,

# NOTE: If using a pipe, co-process, or socket to getline, or from print
# or printf within a loop, you must use close() to create new instances
# of the command or socket. AWK does not automatically close pipes,
# sockets, or co-processes when they return EOF.

sha512sum pipe is opened as many times as there are files in the diffed folders, so at some point, on a large folder, it's going to hit a "too many files open" system exception. the solution is of course to ensure that sha512sum pipe is closed after each execution:

#!/bin/sh
diff -uNr $1 $2 | awk 'm = /^(---|\+\+\+)/{cmd="sha512sum \"" $2 "\" 2>/dev/null ";s=cmd| getline x; if (s) { split(x, a, " "); o = a[1]; } else {o = "false";} close(cmd); print $1 " " $2 " " o} !m { print $0 }'

we can verify that the fix results in correct behavior by running the following,

mkdir -p a b; echo foo|tee b/{0..100}
ulimit -n 15
vdiff a b

(removing close(cmd) from above will simulate original behavior, and on my systems results in gawk: cmd. line:1: (FILENAME=- FNR=28) fatal: cannot open pipe `sha512sum "b/12" 2>/dev/null ' (Too many open files))