BARKS IN THE WIND: vtools vpatch

Posted on 2018-03-22 15:00 technical, vtools

I just wrapped up busy three weeks worth of a family trip and two back to back conferences. I completely forgot how exhausting conferences are, and how little time they leave for anything else. There's a short backlog of posts, that I'm going to publish once I'm back home, but now that I had a chance to recover a bit, I'm going to release what I managed to work on during my travels.

I present for your consideration a proof of concept release of a vpatcher, that can press keccak patches. This implementation was modeled on vpatch parser/press that's used in btcbase to render patches and more importantly to produce an in-memory press. While the general architecture was chosen upfront, this implementation was authored incrementally, a development style that is surprisingly well supported by Ada1, so the bulk of functionality is contained in a single file. At this point I'm not convinced that some kind of split is required, though splitting it in the style of ascii's FFI might bring some clarity.

Vpatch is essentially standard unix patch tool, that supports strict application of a unified diff and verification of V hashes. It takes a patch in a standard input stream and attempts to press whatever content into the current directory. Existing files are patched, new files are created, old files are removed. Processing is done one file at a time, and the operation terminates with an error when an issue is encountered. Patcher keeps a temporary file around for the result output, which gets moved in place once the file's hash has been verified. This means that atomicity is preserved at a file level, but not at the patch level and failed press results in an inconsistent state of the directory. Individual files are always either in the previous or new state, which means that you get to inspect the offending file, but you have to fully redo the press on failure. This is a decision that I might have to reconsider, at the expense of increased complexity. Right now very little is kept in memory: information about the current file being patched, the current hunk and whatever simple data used to track the state.

To build the patcher use make vpatch, or call grpbuild on vpatch.gpr directly. To test the patcher, I reground trb stable release. Since the patcher doesn't verify sigs, I haven't signed the regrind, and it's provided for testing purposes only.

Press the genesis,

% vpatch < ps/1.vpatch
creating bitcoin/.gitignore
creating bitcoin/COPYING
...
creating bitcoin/src/wallet.cpp
creating bitcoin/src/wallet.h

Apply the next patch on top of it,

% vpatch < ps/2.vpatch
patching bitcoin/src/bitcoinrpc.cpp
patching bitcoin/src/db.cpp
patching bitcoin/src/headers.h
patching bitcoin/src/init.cpp
deleting bitcoin/src/qtui.h
patching bitcoin/src/util.h
patching bitcoin/src/wallet.cpp

If we now try to rerun genesis in the same folder we get,

% vpatch < ps/1.vpatch
creating bitcoin/.gitignore

raised VPATCH.STATE : attempt to create a file, but file already exists

Likewise attempt to reapply second patch results in failure, since whatever files have invalid hash2

% vpatch < ps/2.vpatch
patching bitcoin/src/bitcoinrpc.cpp

raised VPATCH.STATE : from hash doesn't match

Same will happen if we attempt to apply a significantly later patch, since the necessary intermediate patches are missing,

% vpatch < ps/12.vpatch
patching bitcoin/src/main.cpp

raised VPATCH.STATE : from hash doesn't match

Finally applying the correct patch succeeds,

% vpatch < ps/3.vpatch
patching bitcoin/src/db.cpp
patching bitcoin/src/init.cpp
patching bitcoin/src/main.cpp
patching bitcoin/src/main.h
patching bitcoin/src/makefile.linux-mingw
patching bitcoin/src/makefile.unix
patching bitcoin/src/net.cpp

Supporting pipe streaming means that we can start vpatch and incrementally feed it patches, moving the tree towards the press top. (In this case patches that we're pressing are named in order from 1 to 27.)

% cat ps/{1..27}.vpatch | vpatch
creating bitcoin/.gitignore
creating bitcoin/COPYING
...
creating bitcoin/deps/Makefile
creating bitcoin/deps/Manifest.sha512
patching bitcoin/src/db.cpp
patching bitcoin/src/init.cpp
patching bitcoin/src/main.cpp
creating bitcoin/verify.mk

The press can be sanity checked using e.g. checksum file from btcbase, but obviously the tool itself does both input and output hash verification as it goes.

There are some known issues, the biggest one is that \ No newline at end of file doesn't work yet, and the patcher fails when it encounters the dirrective. Half way through development I discovered that Text_IO is idiosyncratic: there's no machinery to produce a line without a newline at the end, or to figure out whether or not existing file has one.3 Ada always outputs a valid text file and there's no way to avoid it with Text_IO. Future developers beware! A solution to this problem is to use Sequential_IO specialized on Character, but that means writing own high level procedures like Get_Line. In the work in progress modification that uses Sequential_IO I was able to build a drop in replacement for Text_IO with minimum of changes to existing code, by gradually adding the missing functionality.

To understand the way this patcher works it's helpful to have some idea about the diff format, that we're using. There's three data structures that I use to keep track of patch data. The header,

   type Header (From_L, To_L: Natural) Is record
      From_Hash: Hash;
      From_File: String(1..From_L);
      To_Hash: Hash;
      To_File: String(1..To_L);
   end record;

which holds the source and the destination file information and corresponds to

diff -uNr a/bitcoin/.gitignore b/bitcoin/.gitignore
--- a/bitcoin/.gitignore false
+++ b/bitcoin/.gitignore 6654c7489c311585d7d3...

lines.

A hash is a variant record type that can either hold a specific value or when there's no hash, indicated by false label, it's explicitly marked empty, which happens when the file is either created or removed, with an empty from and to respectively.

   type Hash_Type is (Empty, Value);
   type Hash(The_Type: Hash_Type := Empty) is record
      case The_Type is
         when Value =>
            Value: String(1..Hash_Length);
         when Empty =>
            null;
      end case;
   end record;

We distinguish between three possible file operations,

   type Patch_Op is (Op_Create, Op_Delete, Op_Patch);

which depend on the presence of hashes. If we only have input hash, that means that the file has been deleted, likewise only output hash the file is being newly created. Both hashes indicate that the file is being patched.

Each header is followed by one or more "hunks", a line count prelude followed by the line related commands,

   type Line_Numbers is record
      Start: Natural;
      Count: Natural;
   end record;

   type Hunk is record
      From_File_Line_Numbers: Line_Numbers;
      To_File_Line_Numbers: Line_Numbers;
   end record;

A hunk holds the actual change details that is a sequence of optional context lines which we use for sanity checking ("patch claims line foo does input file actually have foo in that line"), followed by some number of additions or deletions, followed by optional context lines. The line commands are not actually stored in memory, instead they are processed as they are encountered. A typical hunk looks like this,

@@ -435,7 +432,7 @@
     {
         BOOST_FOREACH(string strAddr, mapMultiArgs["-addnode"])
         {
-            CAddress addr(strAddr, fAllowDNS);
+            CAddress addr(strAddr);
             addr.nTime = 0; // so it won't relay unless successfully connected
             if (addr.IsValid())
                 AddAddress(addr);

Only the line in bold is kept in memory. Each record has a corresponding Get procedure which reads the record from input stream. This way you can say, e.g. Get(A_Hunk) and that'll read the @@ -435,7 +432,7 @@ line from input stream.

The parser is naive in that it looks at the stream one character at a time, and dispatches to various handlers, rarelly needing to read the whole of something, before making a decision. This is a traditional lisp parsing technique, which is also well supported in Ada. The bulk of work happens inside an unreasonably large Process_Hunks_For_Header procedure. I will eventually attempt to refactor it, but right now it does all the pre and post checks, parses the hunk body and performs relevant modifications. It relies on record Gets to parse the input patch. The are two loops in the body, Hunk_Loop, which handles each hunk under the header, and Hunk_Body_Loop, which actually handles individual line changes within a hunk. The core of hunk body loop is a dispatch on the first character in the line,

            exit Hunk_Body_Loop when From_Count = 0 and To_Count = 0;
            Look_Ahead(C, EOL);
            if EOL then
               raise Parse with "blank line in hunk";
            end if;
            case C is
               when '+' => -- line added
...
               when '-' => -- line deleted
...
               when ' ' => -- line stays the same
...
               when others =>
                  raise Parse with "unexpected character "
...
            end case;

Attentive reader will note that we exit the loop based exclusively on the expected line count, which is the information communicated in the hunk @@ prelude.

There are some obvious improvements for future releases, the aformentioned newline at end of file issue. I'd also like to port this implementation to SHA 512 branch, to allow testing in the current workflow. The SHA port particularly will let me test Ada to C interoperability. Going back to the Wednesday schedule, I will address one of these in the next release.

  1. I'm more and more impressed with Ada as a language, unfortunately after extensive use I've ran into various issues with core library, which generally left a poor impression. A lot of the decision very much leave the impression of http://btcbase.org/log/2018-03-08#1787319 []
  2. Vpatch is a streaming tool, and none of the files are read twice. So hashing is happening online, and we hash AND attempt to patch in parellel. If patching attempt fails (because patch information doesn't match file's contents), we complete hashing and report either of the errors. Hashing has higher priority, but if the hash is valid we'll report patching error instead. []
  3. End_Of_File also has an odd behavior in that it will report True, if there's an end of file OR if there's a newline followed by end of file. This means that if you're using traditional "while not eof" loop, you're going to lose last newline. []

One way to get rid of the dependence on the Ada standard console I/O (incl. Text_IO) is illustrated in FFA -- from Ch. 4 and onward.

Posted 2018-03-2420:48 by Stanislav Datskovskiy

Ty, I studied it before I started working on vpatch. Getting rid of Ada's standard IO functions is the ultimate goal, considering how uneven the standard is, but there are two reasons why I haven't done it upfront.

One is that Text_IO comes with a bunch of reader functions, that I didn't want to spend time rewriting, e.g. @@ reader does Get(A_Line_Numbers.Start); to grab a textual integer from input stream, elsewhere I use Get(V(1..No_Hash_Label'Length)); to read a length of string, etc.

The second reason is that Ada specifications for IO are decent and also provide the naming/calling conventions that I see no reason to deviate from, e.g. Get(Thing)/Get(File,Thing) to read Thing from Standard_Input/File, End_Of_File, Create, etc. Right now file reading uses Text_IO, I'm going to replace it with code that uses Sequential_IO specialized on Character. This later approach is functionally equivalent to your GetChar/PutChar approach, but at the same time it's a subset of functionality provided by Text_IO, so I can drop in new Sequential_IO based functionality, and only rewrite missing procedures to the interface expected by current Text_IO based code. Presumably I can then further refine this approach by replacing Sequential_IO specialized on Character by code that uses GetChar/PutChar but conforms to same interfaces.

Posted 2018-03-2515:54 by phf

Post a comment

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>