BARKS IN THE WIND: vtools genesis

Posted on 2018-02-21 18:00 vtools

Original vdiff as written by asciilifeform is a tiny awk script that calls out to standard1 sha512sum and diff. It was a great illustration of the concept, but inadvertently suffered from in-band issues, poor portability across the unixes in use by the republic. This post introduces the vtools project that is supposed to address some of these shortcomings as well as deliver some long expected features.

First release consists of two patches, one is the genesis, a stripped down version of diff2, the other is SHA-512 code bolted to the differ. I've decided to keep the two parts separate, since the next release will be explicitly about replacing SHA-512 with Keccak, and because the coupling between two parts might be educational for people who might want to hack on this differ themselves. Lacking a republican SHA, I'm using one I've lifted from Busybox.

The result is feature equivalent with the current vdiff, and should produce equivalent patches on the same code base.3

I took diff from GNU diffutils 3.6 and stripped it down to parts absolutely necessary for the functioning of vdiff. Specifically awk vdiff passes -rNu flags, which makes the operation recursive, produces diffs for missing files, and generates output in unified format. Diff codebase is split between lib and src. The former includes copy-pasted code, that's shared between GNU projects. There's a lot of redundancy there, difftools carries an entire compatibility substrate with it; a lot of code there I could eliminate at the expense of "portability". It's unclear to me how much the result has suffered, since the functionality in lib folder lacks internal consistency. Code that theoretically could run on DOS, shares space with code that has hard unix assumptions baked in. The other directory, src is where diff proper lives. The distinction is arbitrary, but I've kept it for now, because it is aiding me in the exploration of the code.

The project has a top-level Makefile which will from now on build all the different tools. Functionality ought to be self evident, press the tree, call make at top level, you get a vdiff executable. Vdiff takes some extra arguments, left over from diff, I pruned them down to only ones that are still at all relevant, but their availability shouldn't be relied on.

Testing, comments and hate are welcome.4

Somewhat meaningless statistics

It takes about 9.5s to generate the entire set of patches from the current trb stable using awk vdiff, on the other hand this implementation takes 1.2s on my machine. For reference it takes about 0.2s to simply diff those files (producing broken vpatches). The way I tested this, I generated all the intermediate presses (test1.tbz2) for trb from genesis to makefiles, and then diffed all those presses against each other.5

cloc on diffutils-3.6

Language                     files          blank        comment           code
C                              338           9412          12349          49897
Bourne Shell                    96           9135           6533          40712
PO File                         32           9082          13029          29382
C/C++ Header                   166           4272           7421          22080
m4                             197           1283           1387          20227
TeX                              1            812           3694           7175
make                            13           1760           1469           3875
Perl                             1            103            117            451
sed                              2              0              0             16
SUM:                           846          35859          45999         173815

cloc on the fresh press of vdiff

Language                     files          blank        comment           code
C                               15            784            951           2740
C/C++ Header                    12            280            351            609
make                             1              5              0             34
SUM:                            28           1069           1302           3383

Lines of code is a somewhat meaningless metric in this case, since vdiff is not a replacement for diff proper. Perhaps a more relevant metric would've been a vdiff written from scratch, but lacking that we can marvel at great savings.

  1. no such thing in unix world []
  2. yes, I removed autotools []
  3. This is of course not true, because there's always something. When walking the trees for comparison, diff sorts file names according to some collation rules. Getting rid of internationalization means that collation rules can no longer be switched by the operator, instead C differ uses standard library's strcmp to sort names the same for everyone. Well turns out that en_US.UTF-8 places dot before dash, while C locale, which is what strcmp uses, places it after. Which means that for example mpi which contains files named mpi.h and mpi-internal.h will have files in different order when produced by C vdiff as opposed to awk vdiff. This might've been prevented if later had LC_ALL=C set, but as it stands most of the extant vpatches have been produced with whatever system locale.

    This can be easily demonstrated,

    # mkdir a b
    # echo foo > b/mpi.h
    # echo foo > b/mpi-internal.h
    # LC_ALL=C diff -ruN a b | grep '^diff'
    diff -ruN a/mpi-internal.h b/mpi-internal.h
    diff -ruN a/mpi.h b/mpi.h
    # LC_ALL=en_US.UTF-8 diff -ruN a b | grep '^diff'
    diff -ruN a/mpi.h b/mpi.h
    diff -ruN a/mpi-internal.h b/mpi-internal.h

    Now would be a good time to introduce standard republican alphabetic order.

    Like the internationalization there are potentially some changes that are the result of the cut, but these should be considered bugs. There is one explicit change though that was made which is related to the diagnostic output. What differ does when it encounters files that it can't produce a diff for is standardized by POSIX. It is expected to produce messages of the format "File foo is a directory while file bar is a regular file", etc. and output them in band. In case of diff this is perhaps useful behavior, but vpatch format doesn't recognize these sort of messages, so patch author has to remove them by hand. These messages are notoriously hard to spot, and during testing I found a leftover "Binary files a/logbot/logbot.fasl and b/logbot/logbot.fasl differ" in the published logbot's vpatch. So all the diagnostic messages now go to stderr. Vdiff's output should be standard format vpatch. []

  4. This theme is new, and stollen from elsewhere, specifically comment section is visually quirky. Bear with me, while I figure it out. []
  5. The same dataset can be used to test a vdiff'er on your system,

    tar xjf test1.tbz2
    cd test1
    bash run.bash <path to vdiff>

[...] a choice of vdiff tools to use, since phf conveniently just published the first part of his work on vtools. I can therefore happily report that his patches press fine and his resulting vdiff worked on this [...]

Posted 2018-08-0709:02 by EuCrypt Chapter 11: Serpent « Ossasepia

[...] and Ada programming languages use "--" as comment marker. This was part of the motivation behind vtools, which took the approach of avoiding the system's existing "diff" program in favor of a [...]

Posted 2020-03-3114:02 by Adventures in the forest of V « Fixpoint

Post a comment

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>