SORTIE Collaborative Demo Report

At Waterloo and Queen's, the CPPX and PBS tools participated in the SORTIE collaborative demo. This report was written 2001 Sep 19 by Andrew Malton.

Section 1. Introduction

The Software Bookshelf is a web-based paradigm for the presentation and navigation of information representing large software systems. The Portable Bookshelf (PBS) is one implementation of this concept. The PBS Toolkit is our set of tools for the generation of a PBS Bookshelf.

CPPX is a free, open source, general purpose parser and fact extractor for C++. It relies on the preprocessing, parsing, and semantic analysis of GNU g++, and produces a graph according to the Datrix fact model, in either GXL, TA, or VCG format, suitable for use in architecture recovery, data flow analysis, pointer analysis, program slicing, query techniques, source code visualization, object recovery, restructuring, refactoring, remodularization, and the like. When invoked with a source file, CPPX first passes the file to g++, which compiles it with a "compile-only" option. There are hooks within the GCC archicture for extracting debugging information, and upon them we have hung code to dump GCC's internal semantic graph to a (binary) format convenient for CPPX's purposes. The schema (rules and regulations) of GCC's graph are suitable for normal compilation (viz. for execution) but not for design recovery. So after the dump, CPPX applies a collection of tiny "systolic" transformation steps to make the graph conform to a schema more suitable for design analysis. The final result graph is emitted in GXL (or TA, if you like). The target schema is that of Bell Canada's Datrix project.

The CPPX compiler was built between January and April, 2001. It would be desirable and possible to add a linker, so that the GXL results of separate compilation can be combined into one graph. But we have not done it yet. It would also be a good thing to exercise CPPX more thoroughly, and remove some of its bugs. Interested researchers can look at the web site.

For the SORTIE demo, the following people did some work:

Section 2. Experience Report

Our goal for this project was to make a Software Bookshelf for SORTIE visualized in PBS based on data extracted by CPPX. In addition to participating in the Collaborative Demo itself, meeting our own goal would

In principle a Software Bookshelf can be based on any suitable and useful graph schema; but in practice it is built up from lower-level material extracted by fact extractors which are part of the PBS toolkit and expect data to conform to paritcular schemas which have been developed already. PBS includes software tools to analyse the lower-level material and produce a suitable high-level view. This in principle cannot be fully automated as it relies on informal information about the subject software's structure. Sometimes this information is obtained from interviews with developers, and in general need not be wholly extracted from the code.

The following steps would be required:

  1. Migrate the SORTIE code base to g++ from Borland.
  2. Extract a (Datrix) fact base from the migrated code base, using CPPX.
  3. Slice out from the Datrix fact base a factbase view suitable as the basis of a software bookshelf.
  4. Migrate the slice (by hand, or using grok or other tools) to the basic level of architecture which PBS already understands and can process.
  5. Use already-existing PBS tools (plus some knowledge of the code's structure obtained by inspection) to elevate the basic level to the software bookshelf level.
  6. Analyse and discuss possible improvements in the SORTIE architecture, based on the Bookshelf model.
  7. Make recommendations for architectural improvement.

To date Malton spent about 7 days learning Borland C++, learning the source code of SORTIE, and migrating it to g++. Svetinovic spent three or four days creating scripts for steps 4 and 5 above (using already-existing Datrix databases). Dean and Malton both spent several days making CPPX cooperate with the latest release (3.0) of GCC (the GNU compiler collection which includes g++). We have spent no time on steps 5 through 7 because all our difficulties arose prior to having an architectural view.

The difficulties were:

Here follows some discussion of the difficulties:

Section 2.1 Source Incompleteness

CPPX operates by actually compiling the subject code, using a nonrobust parsing technique. The advantage of this is that the resulting low-level extracted facts are as correct as the underlying compiler: but the disadvantage is of course that a buildable code base, or at least syntax analysable one, is required.

The given code base naturally depends on support software which is part of the Borland C++ Builder IDE. This includes:

all of which is accessed through header files which are not part of the SORTIE release. There are 45 missing directly-included header files. At first I hoped to reconstruct them by hand, or by reference to the (supplied) DFM files, or use g++'s versions of standard things. But after a couple of days of nosing about gave up this as hopeless. I would lose all the advantage of CPPX's low-level accuracy. I retained the hope of using standard includes from g++ until I realized just how variable "standard includes" are from between systems!

So, we decided to buy Borland C++ Builder Standard Edition (BCB) and copy its include library (fair use!). The SORTIE system was revealed to depend on 207 include files. Some of these are essential "user-supplied" code which Borland stores in its own directory vcl/ instead of in the application source directory, and must be rebuilt before exporting the missing includes.

After preprocessing the 31KLOC of the SORTIE code base became about 50MLOC, so the resulting code base for low-level analysis is actually quite large. I have not analysed which parts of the resulting code base very carefully, but it seems that about 40% of the inclusion was Windows support, 30% was VCL support, about 15% was STL stuff, and the rest (15%) was expansion of SORTIE code itself (including stuff embedded in VCL directories).

Section 2.2 Syntax and Semantic Incompatibilities

Neither Borland C++, which is the source language of SORTIE, nor GNU C++, which is the language whose compiler is the front end for CPPX, guarantees conformance to ANSI C++. Consequently, to use CPPX we must do a source migration. The migration was the one completed technical work of our SORTIE effort. At this time 58 of the 63 SORTIE source files have been syntax-analysed by g++. (Note, or course, that this doesn't mean I've ported SORTIE to open source!) The remaining problems are not significant, but it is a work in progress.

Between the two compilers there are preprocessor inconsistencies and syntax inconsistencies and semantic inconsistencies.

I handled preprocessor inconsistencies by trial-and-error, introducing new macros by hand, and in some cases introducing new stub redirectors. I have a list of these inconsistencies but it makes fairly boring reading.

I handled syntax inconsistencies, which are mostly (not all!) fixable locally (that is, without needing remote information such as would be obtained from a symbol table) by means of an editor script. There are about 16 to 20 things to look for. Typical things which other collaborators saw in the SORTIE code base proper are __fastcall and __published. But there are many other things in the missing include files. Because of the intermidiate fixup step, compilation then required:

  1. apply the preprocessor (using -E)
  2. apply the editor script
  3. apply the compiler (using -c)

I handled semantic inconsistencies, on the other hand, by editing the code base. This was because the inconsistencies were rarer and much more specific than the syntax inconsistencies, and because when they occurred in header files I wanted to have the same replacement everywhere.

In an appending there's a list of the inconsistencies, incompatibilities, and extensions which I discovered and had to deal with. Others still lurk.

Section 2.3 Large Code Bases

The code base as delivered has 29 023 nonblank lines of code. The code base after inclusion has 50 352 095 nonblank lines of code, 312 312 class definitions, 203 004 variable and data member definitions, and 1 876 560 method definitions. (These data are obtained from the g++ dumps). Of course probably 99% of these are irrelevant to the "real" SORTIE design, but at the low fact level at which CPPX is designed to work, all these entities are recovered, and then analysis is needed to separate the identifiers made visible by source inclusion from the identifiers which are "really" part of the SORTIE code base. We have not solved this problem.

(These numbers are really true, although I have difficulty believing them. I suspect that the best explanation for why there are so many classes and so many many methods is that OO architectural style and C++ overloading combine to generate vast numbers of little-used variations.)

Section 2.4 Architectural Style

SORTIE is a Borland C++ Builder application, although there is evidence that it used to be in plain old C. This means that it is (now) code hung on the hooks of a vertical framework, and (as is typical of builder-managed code bases) there is lots of auto-generated code and lots of user-supplied design squirreled away into binary files, obscure directories, and hidden behind toolbar panels.

Furthermore, the basic structure of the framework is event-driven, reactive, and graphical. The builder model doesn't attempt to separate the computational aspects from the user-interface ones. In fact, it can be argued that SORTIE and generally systems built with builders are not "source based" at all. That is, the design medium is not source code. Another way to say this is: we don'thave the source code, the maintenance artifacts, of SORTIE. What we have is the Builder's internal "machine language" which happens to be a dialect of C++. Design recovery from such source-encoded implementations is inherently more difficult than design recovery from real source code, or at least requires careful choise of assumptions, because

Of couse the"real" application code ought to exhibit such good characteristics, but the point is that by looking at the "source" it's difficult to impossible to distinguish the real maintenance artifacts within it.

Section 3. Collaboration Partners

We considered making use of Tim Lethbridge's extract DMM model of the SORTIE code and visualizing it using PBS. However after consideration of the goals of the Collaboration as a whole and our own as well, we decided to put our efforts into using CPPX and dealing with the migration issues. Since some of those issues impact at the middle-model level (e.g. the use of __property and __closure) we knew that without an accurate low-level parse, even the middle-level facts wouldn't be perfectly reliable.

Section 4. Solution to tasks

The only conclusion we have drawn from this partial analysis is that the SORTIE application seems to participate in the vertical framework architecture of BCB to a sufficent extent that we could not understand it at the "source" level in the time we allocated to the project. This suggests a need for frameworks-oriented design recovery and analysis, but doesn't say much about the architecture of SORTIE.

Section 5. Appendix - Summary of Migration issues

Those written in bold face appeared in the SORTIE code base itself. The others appeared in the missing include files. This is not a complete list, it's just all I know at the moment.

Section 5.1 Preprocessor issues

Section 5.1 Syntax issues

Section 5.1 Semantic issues