REPORT Holger Kienle and Xiaomin Wu 1. Introduction 1.1. Tool Rigi is an interactive, visual tool designed to help developers better understand and redocument their software. Rigi includes parsers to read the source code of the subject software and produce a graph of extracted artifacts such as procedures, variables, calls, and data accesses. To manage the complexity of the graph, an editor allows the software engineer to automatically or manually collapse related artifacts into subsystems. These subsystems typically represent concepts such as abstract data types or personnel assignments. The created hierarchy can be navigated, analyzed, and presented using various automatic or user-guided graphical layouts. The discovered structural information is useful for making informed development and management decisions. The information serves as documentation that is up-to-date and accurate because it is derived from the actual source code. Thus, Rigi helps to understand legacy software systems where the existing documentation may be missing or lacking. Rigi aids reengineering tasks that need to discover design information in existing software. Rigi also includes a tool to view the subject software's source code using a web browser. The tool produces a hypertext version of the source code that includes hypertext links for related source artifacts (for example, each function call has a link to the definition of the function that was called). 1.2. Team The team consists of two graduate students (Holger Kienle and Xiaomin Wu). Both had little prior knowledge of the Rigi tool. One team member (Holger Kienle) had previous experience with the Bauhaus reverse engineering tool, which is based on the Rigi visualization engine. He was involved in a previous tool demonstration that consisted of reengineering the xfig application [CEK00]. Another graduate student (Johannes Martin) who is intimately familiar with Rigi was consulted various times on technical issues. He is the maintainer of the Rigi C and C++ parsers and familiar with parsing programs for use with Rigi [MWW00]. He did not, however, participate in the SORTIE reengineering effort. 2. Experience Report The following steps were taken to analyze SORTIE with Rigi: 1. Familiarization with Rigi 2. Familiarization with SORTIE 3. Generation of RSF - with Rigi C++ parser. - with TkSee C++ parser. The first approach involved the writing of "stubs" for the Borland VCL library. The second approach involved the development of a converter from GXL to RSF. 4. Postprocessing of RSF 5. Structural redocumentation with Rigi 2.1. Familiarization with Rigi Since both team members did not have prior experience with the Rigi tool, some time was spent upfront to gain a basic understanding of its capabilities. Familiarization with Rigi included reading the "Rigi's User Manual" (which was found to be a valuable resource, albeit sometimes not detailed enough on topics such as RSF and RCL), and running the semi-automatic demos that are part of the Rigi distribution. During this first exploration of Rigi we encountered several problems. Rigi is extremely slow on Linux with modern window managers (such as Gnome and KDE1/2). Invocation of a Web browser from Rigi to view the hypertext version of the source code did not work. Lastly, we got confused with the various RSF formats (3-tuple unstructured RSF, partly structured RSF, structured RSF, and 4-tuple unstructured RSF). Different tools expect different RSF formats as input and can produce different RSF formats. As a reaction to the encountered difficulties, we set up a Rigi Wiki [RigiWiki] (with the TWiki software [TWiki]) to document solutions for the problems that we encountered. The Rigi Wiki is a central information source with links to already existing (but previously scattered) information and new information derived from our experiences. For example, we started a Rigi FAQ (which as of this writing contains ten questions and answers) to help new Rigi users. Familiarization with Rigi took us about 16 hours, but these were spread over about two weeks. We frequently got stuck and had to ask questions to Johannes Martin via email. It would have been valuable to have a knowledgeable person as part of the team or in close physical proximity. 2.2. Familiarization with SORTIE Initial information about the SORTIE code stated that it was written in C++. Based on this assumption, we planned to briefly inspect the SORTIE code to pinpoint possible compilation problems prior to running the code through the Rigi C++ parser. This inspection revealed that the "C++" source is a proprietary C++ dialect developed by Borland. Furthermore, the code is tightly coupled with a Borland specific GUI library, the Visual Component Library (VCL): Borland C++ "is an unusual product in that it is a C++ tool built on top of an Object Pascal object framework called the Visual Component Library, or VCL. In fact, all of the VCL, and almost all of the IDE, is written not in C++, but in Object Pascal. [VCL]" "VCL is based on components, properties, and events, while OWL and MFC have none of these features. In particular, events support something called the delegation model, which is an alternative to simple inheritance. The VCL fully supports all standard OOP concepts such as inheritance, polymorphism, and encapsulation. What it brings to the party are components, properties, and events. (Events are also known as closures.) [VCL]" It seems that VCL was introduced with Borland C++ 5.0. We had access to an older Borland C++ 4.5 compiler, but VCL was not part of it. Soon, it became obvious to us that in order to compile and understand the SORTIE code, knowledge about Borland C++ is necessary. Since we neither had a Borland compiler nor the accompanying documentation, we snooped around the Web. The most valuable source proved to be an online book [BB] and the Borland Community Web site [BC]. We also downloaded a free version of the Borland C++ compiler, but this test version (not too surprisingly) did not contain the VCL. 2.2.1. Borland IDE Part of Borland's C++ IDE (dubbed C++ Builder 5 [BBuilder]) is a GUI builder, which automatically generates code. This makes it hard to tell in SORTIE which part of the code has been generated and which part is hand-written. Since we do not have access to the Borland IDE, we can only make intelligent guesses about its capabilities. The GUI has the capability to assemble a Form interactively. Information about these Forms are stored in a file with the suffix dfm: "This file contains the details of the objects contained in a form. It can be view as text by right clicking on the form and selecting view as text from the pop-up menu, or it can be converted to text and back using the convert.exe found in the bin directory. Caution should be used in altering this file as changes to it could prevent the IDE from being able to load the form. This file is critical to moving or rebuilding the project. [BDSS]" The SORTIE code comes with the (binary) .dfm files. In order to inspect the contents of these files, we asked Sachen (the SORTIE domain expert) to convert the .dfm files to text by running the convert.exe utility. Each Form is associated with a C++ class, which resides in its own .h/.cpp file. The skeleton of the automatically generated .h file looks as follows: class TMyForm : public TForm { __published: // IDE-managed Components ... private: // User declarations public: // User declarations virtual __fastcall TMyForm(TComponent* Owner); }; extern TMyForm *MyForm; After the __published keyword, the components of the Form and the components' methods are listed. The components, along with their properties and events, are described in the corresponding .dfm file. The skeleton of the automatically generated .cpp file looks as follows: #pragma resource "*.dfm" TMyForm *MyForm; __fastcall TMyForm::TMyForm(TComponent* Owner) : TForm(Owner) { } The pragma directive indicates that a corresponding .dfm file exists. Every Form has a corresponding global variable. This variable is typically initialized as follows: Application->CreateForm(__classid(TMyForm), &MyForm); SORTIE does this initialization in the file SORTIEBC.cpp (see below). The SORTIE source has 48 .dfm files out of 63 .cpp files. This indicates that a large part of the code has been written with the help of the GUI builder. The 63 .cpp files can be grouped as follows: 48 GUI-builder based (corresponding .dfm file) 1 SORTIEBC.cpp, main application file (automatically generated) 1 globals.cpp, empty 13 User created classes (not subclassed from VCL) (Except for SORTIEBC.cpp, every .cpp files has a corresponding .h file.) Most of the contents of the .dfm files deal with properties that do not help program understanding, such as form width, font color, and title bar name. However, at least two properties are of interest: (1) events and (2) pull-down menus. Both are (partially) lost in the generated code. The VCL introduces the concept of events, which can be specified with the IDE: "Events provide a means for a component to notify the user of some pre-defined occurrence within the component. Such an occurrence might be a button click or the pressing of a key on a keyboard. [BS]" "...C++Builder generates the function declaration, and saves the relationship between the event and event handler into the .dfm file. The .dfm file is compiled into the target application's resource. When an application built with C++Builder starts up, VCL reads the form from the resource and assign properties and events of the components on the form. [BS]" By looking at the textual representation of the .dfm files, we identified the following events: OnActivate 4 OnChange 77 OnClick 246 OnClose 5 OnCreate 2 OnDblClick 17 OnDragDrop 2 OnEnter 4 OnExit 2 OnKeyPress 1 OnMouseDown 3 OnMouseMove 2 OnMouseUp 19 OnPaint 7 OnResize 4 OnScroll 2 OnShow 23 There is a convention that for every event "OnE" associated with component "C", C's owner (i.e., the Form) contains a method void __fastcall CE(TObject *Sender); Unfortunately, this convention is not always obeyed in SORTIE. This makes it hard--or even impossible without the .dfm files--to identify a component's events. For example, the component PointGLIMenu has an OnClick event called GLITestClick. Another useful information that is contained in the .dfm files is the pull-down menu structure along with the menu items' callbacks (all OnClick events). Here is the pull-down menu structure of the main application window (extracted from sortiapp.dfm). The text in round brackets are bubble-help messages. File Load SEM file... - New parameter file... (Create a parameter file from scratch) Load parameter file... (Load an existing parameter file) Save parameter file... (Save current parameters to disk) - Open playback file... - Open text editor... (Open text editor) - Create map file... Load map file... Save tree map... - Load Harvest Regime... - Create light location file... - Exit (Quit SORTIE W) Edit Location... (Plot and sky parameters dialog) Species... (Species parameters dialog) Initial densities... (Set initial population characteristics) Substrate... Disturbance (Create or review a harvest regime) Harvest... Wind... View Legend Plot map --Graphs-------- Rel. basal area Abs. basal area Rel. density Abs. density Substrate --Histograms----- Dbh distribution Height distribution --Tables---------- Data Table Harvest Table --Other------- GLI Map Speed Bar Output Setup Output Files... Write data files... (Select data to save) Write playback file... - Setup harvest output... - Calculate GLI Entire Plot... From Point File... Model Run (Run SORTIE W) Pause (Pause run) Reset (Reset run) - Load run... (Open previously saved model run) Save run... (Save current run) Window Cascade Tile Arrange Icons Help About... The .dfm files also reveal the type of files that can be saved and loaded by the SORTIE application. This information is kept in the Filter setting of the TSaveDialog and TOpenDialog components: *.sgli GLI Test Results *.sll Light location files GLI location file *.hvr Harvest Results Files Harvest Regime Files *.sem Sortie Elevation Map Files *.par Parameter files *.spb SORTIE playback files *.stmf Tree Map Files *.stm Alternate version (of Tree Map Files) The IDE also automatically generates the main application file. For SORTIE, the file's name is SORTIEBC.cpp. The file lists all entities that represent the application and the main function. 2.2.2. Borland C++ Dialect An article available at the Borland Community Web site discusses several C++ enhancements [Intersimone]: - Borland-specific Keywords: The following table lists the keywords along with the number of their occurrence in the SORTIE code. __classid 47 (all except one in SORTIEBC.cpp and sortiapp.cpp) __closure 0 __fastcall 1501 __property 0 __published 49 (In this table and elsewhere, the number of occurrence has been determined with grep, followed by a brief inspection of the results. Hence, the actual number might be less!) The keywords __fastcall and __published can be safely eliminated for the compilation process. The __classid keyword (which is rather a builtin function) takes the name of a class. It is the VCL's introspection mechanism. Luckily, here are few occurrences and they are concentrated in three SORTIE files. - Exception handling: Borland introduces a finalization mechanism with the __finally keyword. Luckily, SORTIE does not make use of this feature. The following types are caught in catch statements: Exception 1 (all in SORTIEBC.cpp) EAccessViolation 4 (all in srtmdiio.cpp) std::bad_alloc 28 char* 1 (all in srtmdiio.cpp) Exception is the root class of the VCL exception hierarchy. EAccessViolation is a subclass of Exception. The occurrences of std::bad_alloc are used to guard sequences of new statements. The SORTIE application throws only 10 exceptions. All of them appear in a single file (srtmdiio.cpp) and throw a char* type. In summary, exception handling can be safely ignored for the analysis of SORTIE, since it is not part of the "normal" control flow. This is fortunate since most tools do not support exception. For example, neither the Rigi C++ schema nor the Dagstuhl Middle Model (DMM) can currently express exception code. - Data types: VCL introduces several specific data types (for compatibility with the underlying ObjectPascal). Of these, SORTIE seems to make only use of AnsiString (185 occurrences), which is a specialized string class similar to STL's ANSI string class. The following table lists several of AnsiString's methods: c_str 572 (returns a char* version of the string) Length() 2 ToInt() 1 ToDouble() 1 2.2.3. Discussion All findings up to this point have been done by manual inspection of the code and searching/filtering with grep, sed, and sort. At this stage, we did not try to understand the control and data flow within SORTIE. Our objective was to understand the compilation model of the Borland C++ Builder, to get to know the Borland C++ dialect, and to get a rough idea of the code base. Our findings in this phase gave valuable information for the parsing process. An important part of the tool collaboration is to share results with other researchers. We published the online resources that we found as well as the information described in this section on the first author's Web site [Kienle]. The link to the Web site was posted to the sortie-demo mailing list [SD]. As pointed out to us by Jens Jahnke, it is questionable whether neither having a Borland C++ compiler, nor having the VCL's API is a realistic reverse engineering scenario. (Granted, the source code of the VCL might be lost or not accessible, but the published API to compile against should be available.) Furthermore, while it is common for reverse engineering tasks to ignore library code (such as the C standard library) in order to not clutter the analysis results, it is questionable whether this should be done in the case of object-oriented framework code. For object-oriented code it might be important to know the full inheritance hierarchy along with redefined methods. This is especially important for object-oriented frameworks, such as the VCL, to detect the framework's "hooks." Finally, it would have been helpful to have Sachen's "A Summary of the User Requirements for the SORTIE Reengineering Project" [Gendron] available earlier. Reverse engineering program code without even knowing the domain is hard (and probably not realistic). This step was accoplished in roughly 20 hours spread over a couple of weeks. 2.3. Generation of RSF Our code inspection revealed that we had to anticipate parsing problems. Because neither of us had experiences in parsing C++ code for RSF generation, we decided to pursuit two separate parsing approaches. The first approach is using the Rigi C++ parser to parse the source code. The second approach is using the TkSee C/C++ parser, which generates GXL output. Both parsers' function is the extraction of artifacts from the subject system. 2.3.1 Rigi C++ Parser The Rigi C++ Parser is built on top of the IBM Visual Age C++ compiler. In order to generate RSF, the Rigi C++ parser expects code that will pass the IBM Visual Age C++ compiler. Xiaomin describes the difficulties in her final report [Wu01]: "When using the Rigi C++ parser to parse the program, the source code was loaded into the Visual Age C++ compiler first. However, it is very difficult to get the code compiled. Thousand lines of error messages came out. By analyzing the error messages, we found that the problems can be classified to two categories. The first one is syntax errors. It is caused by the syntax differences between Borland C++ and Visual Age C++. One example is that some keywords that are unique to Borland C++ can not be identified by Visual Age C++. Some other errors are about the operators, variable types, identifiers, etc. To solve this problem, some modifications have been made to the source code. Another kind of problem is about the class library of C++ compilers. The subject system was developed using the Borland C++ builder, and referenced the Borland C++ class library a lot, but these classes are not available in Visual Age C++ library. To solve this problem, many supplements to the Visual Age C++ library have been made. For example, 37 dummy header files that should be in the VCL (Visual Component Library) of Borland C++ have been written and put into the class library of Visual Age C++ (under the ./include/ subdirectory of VA), and a header file that contains all the definitions of the missing super-classes has been written and put into the class library of Visual Age C++. One thing I need to mention here is that the definitions of these super-classes might not be the same as what they are in the Borland class library, since our only purpose is to get the source code compiled and I wrote the classes according to the information provided by the error messages, so only the variables and functions that were referenced by SORTIE were reconstructed in the super-class." The generated output has to be postprocessed in order clean up the RSF. Since the Rigi C++ parser generates output by querying Visual Age's repository (called CodeStore), it also reports artifacts in CodeStore that are not present in the original source (such as standard operators and constructors inserted during the compilation process). Luckily, these can be easily filtered because of the employed naming scheme. All names look as follows: name^file^line_number^column The following example shows C++ code along with the parser generated names: class X { char c; // X::c^X.cpp^2^3 void setChar (char c) {...} // c^X.cpp^3^17 } Artifacts that do not have a corresponding source location show up as name^^0^0. Artifacts that do not have a source code name show up as @number^file^line_number^column. (In a few circumstance, the parser also emits nodes with the name (10 occurrences, all in library code) and (76 occurrences; 28 of these in library code)). 2.3.2 TkSee C++ Parser (University of Ottawa) The TkSee C++ Parser, which is still in an early development stage, is based on Source-Navigator [SN]. Source-Navigator seems to have only a partial parser. This means on one hand that the parser is more robust (i.e., can deal with C++ flavors and variations), but on the other hand, the parse is not as precise. In fact, the parser does not necessarily report if it gets out of sync and produces wrong results. For example, the TkSee C++ Parser's home page states [Marchen]: "Access specifier __published [...] is treated as class member variable of type int." In order to obtain a meaningful parse, we applied the following preprocessing steps to the code: (1) removal of Borland specific keywords, pragmas, and directives and (2) fixing of file names ("sortiefile" sed script provided with the SORTIE source). We finally ran the parser (version 1.39b) on the source to obtain GXL output. The GXL output adheres to the Dagstuhl Middle Model (DMM) [DMM01]. This fact was our primary motivation for using this parser. We iterated through several parser releases and submitted several bug reports. 2.3.2.1. Schema Conversion The second part of the parsing process consisted of the transformation from GXL to RSF. This is not a mere translation of exchange formats, since the underlying schema on both sides of the translation process needs to be preserved. Hence, we had to find a mapping from the DMM into the Rigi C++ schema. Furthermore, in order to compare the RSF obtained with this approach to the RSF obtained with the Rigi C++ parser, we tried to generate RSF that is close to the Rigi C++ parser. The RSF Rigi C++ schema looks as follows: - Riginode: Unknown Collapse Function Prototype Variable Constant Datatype Subsystem Class Namespace - Rigiarc: extends Class Class isInNamespace Unknown Namespace isMemberOf Unknown Class calls Function Unknown accesses Function Unknown isTheSameAs Datatype Unknown returns Unknown Unknown references Function Unknown hasType Unknown Unknown isDefinedIn Unknown Function isCastTo Unknown Unknown level composite multiarc - Rigiattr: attr Node file attr Node lineno attr Node position attr Node annotate attr Node nodeurl attr Node tagged attr Node classKind attr Node accessKind attr Arc name attr Arc annotate attr Arc arcurl Both schemas are middle-model schemas (i.e., they have coarse-grained artifacts). Still, there is often not a direct one-to-one mapping. The following tables show the (preliminary) mapping of nodes, edges, and attributes between the two schemas. cppparse refers to the Rigi C++ domain as specified in the Rigi HTML documentation (author: Johannes Martin). - Nodes cppparse DMM Function Routine who is pointed to by a Defines arc: SourcePart --Defines--> Routine Method who is pointed to by a Defines arc: SourcePart --Defines--> Method --IsMethodOf--> Class Prototype Routine who is pointed to by a Declares arc: SourcePart --Declares--> Routine Method who is pointed to by a Declares arc: SourcePart --Declares--> Method --IsMethodOf--> Class Variable Variable (who is pointed to by Defines/Declares arc) SourcePart --Defines/Declares--> Variable Formal Parameter SourcePart --Defines--> Variable --IsParameterOf--> Routine SourcePart --Defines--> Variable --IsParameterOf--> Method PARSER BUG: FormalParameter should be used instead of Variable. Field SourcePart --Defines--> Field --IsFieldOf--> Class Constant Constant (who is pointed to by Defines arc): SourcePart --Defines--> Constant SourcePart --Defines--> EnumerationLiteral (supported since parser version 1.37b) Datatype Type who is pointed to by a Defines arc: SourcePart --Defines--> Type SourcePart --Defines--> EnumeratedType NOTE: EnumeratedTypes (in Classes) are missing a IsPartOf/IsEnumeratedTypeOf arc. NOTE: Predefined types (int, bool) are not captured. These have only Declares arcs: SourcePart --Declares--> Type PARSER BUG: typedefs are not handled correctly -- they get lost. Class Class (who is pointed to by a Defines arc): SourcePart --Defines--> Class Namespace - NOTE: NOT (YET) PART OF DMM - Relationships cppparse DMM extends Class --InheritsFrom--> Class SourcePart --Defines--> Class --InheritsFrom--> Class isInNamespace - NOTE: NOT (YET) PART OF DMM isMemberOf Method --IsMethodOf--> Class Field --IsFieldOf--> Class Q: Are there other members as well (e.g, enums and typedefs in classes). isTheSameAs Type --IsDefinedInTermsOf--> Type PARSER BUG: Not handled yet (version 1.39b) accesses SetsComponent/UsesComponent PARSER BUG: Not handled yet (version 1.39b). Not too useful for SORTIE anyway: Since SORTIE directly accesses public instance variables (instead of accessor functions), lots of spurious references would be generated. calls Routine/Method --Invokes--> Routine/Method returns Routine --IsReturnValueOf--> Type Method --IsReturnValueOf--> Type references Method --Uses--> Variable Note: Formal parameters (BUG: represented as Variables) show up as Uses. To detect them: Variable --IsParameterOf--> Method hasType Variable --IsOfType--> Type Constant --IsOfType--> Type EnumerationLiteral --IsEmunerationLiteralOf--> EnumeratedType isDefinedIn SourcePart --Contains--> SourcePart --Defines--> X X = { Variable, Constant, Routine, Type, EnumeratedType } - Attributes cppparse DMM classKind n/a accessKind "access" attribute (of Method/Field) Translation from one schema to the other proved to be not straightforward. Even though there is documentation for DMM, we had to closely analyze the GXL to understand the semantic of the parser's output. Furthermore, the parser slightly deviates from the DMM (e.g., instead of FormalParameter the more general type Variable is emitted for formal parameters), generates wrong output (e.g., an earlier version of the parser produced incorrect method containment and access attributes for methods) and does not yet emit all DMM constructs (e.g., an earlier version of the parser did not emit Constants). Our findings resulted in several bug reports and suggested improvements of the parser. On the Rigi side, we made the following observations: The frequent use of "Unkown" (see Rigi C++ schema above) for source and target nodes severely limits understanding of the schema specification -- it is less self-documenting than other schemas. Similarly, Rigi's attribute model is too imprecise. The C++ schema is not sufficiently documented (e.g., missing documentation of classKind and accessKind attributes). Furthermore, the documentation was partially wrong in explaining the naming scheme and very vague on which artifacts actually get extracted from the code. The latter can be partially explained by the fact that the Rigi parser selectively dumps the CodeStore without necessarily knowing CodeStore's working. Finally, the Rigi C++ schema has no representation for enumeration literals. Since SORTIE uses enums, this would be desirable. On the positive side, our findings improved documentation and the discovery of a significant parser bug. 2.3.2.2. XSL Implementation Originally, we were hoping to use Jeff Michaud's GXL <-> RSF converter written in Java, but is still is in early beta stage and currently not further developed [GXLRSF]. Our converter that performs the mapping from GXL to RSF has been written in XSL. XSL is reasonably well suited for a prototype implementation. The XSL code is written by hand. There is a single script for each artifact that is extracted. This approach yields redundant code, but the code is easier to read and debug. This is important since XSL is not a language that is human-readable in a reasonable way (IMHO). As an example, the following XSL code matches subclass relationships (i.e., SourcePart --Defines--> Class --InheritsFrom--> Class) and outputs a corresponding RSF extends tuple: # Generated from gxl2rsf_extends.xsl extends Not show is the utility code (in util.xsl) that handles the printing of names. A significant part of the code deals with generating name that are similar to the ones generated by the Rigi C++ parser. XSL was an interesting experiment for us. Since we had no prior XSL knowledge, the code was rewritten several times as our understanding increased. The resulting code has not been optimized for speed, but readability. We had trouble finding a good XSL processor until Sergei Marchen pointed us to Xalan. The Xalan for Java implementation (ran uncompiled on a JVM) was stable enough to run our scripts, but execution speed proved to be a problem, especially for large GXL files. Translation of all files took around 9 days and bigger files (such as srticode.cpp) took more than a day. The parser translates every source file to a corresponding GXL file. (The Ottawa parser can also generate a single GXL file capturing all source files, but this file is too big to process with an XSL script.) Unfortunately, this means that names of artifacts that do not have a definition cannot be resolved. These names have to be resolved by a primitive "linker" in a separate postprocessing step. For example, for the class TABAGraphForm is represented with two nodes: TABAGraphForm^abagraph.h^21^6 and TABAGraphForm^^^. The first, full name is generated when the class's definition is parsed (in abagraph.h). The latter is generated whenever there is a reference to the class in a different file than abagraph.h. We are currently in the process of writing a linker that translates abbreviated names to full names. Even full names can be ambiguous because the naming scheme (taken from the original Rigi parser) is not unambiguous. Here is Sergei's reply to the question "Will I get the same result when I process every file separatley compared to using the -l/-s option with all files?": "They will be still valid but completely independent from each other. The thing is that IDs of the "same" nodes in a different files will be different and vice versa - "different" nodes in different files can have the same ID because in this case in every file the numbering of nodes will be started from "1" every time you run the parser. In case of list of files (specified by "-l" switch or autogenerated by "-s (-f)" switch(es)) parser keeps track of elements IDs so node, representing type "int" for example will have the same ID in every GXL file containing an "edge(s)" to 'type "int"'. So some or all of those files can be kind of "linked" together and they still will correctly represent relationships between nodes. If you specify "-A" switch parser just outputs all the information into one file and this file will contain only one node for type "int" and it will most likely have same ID as in case of multiple files output... Same about all other things. However locally defined (and referenced) variables will have different IDs from the globally defined variables with the same name so we can easily distinguish one from another." Development of the converter took about 25 hours, spread over several weeks. 2.3.2.3. Comparison of RSF and GXL It is interesting to do a comparison of the file sizes between GXL and RSF for all of SORTIE: GXL RSF lines 1.495.211 36.929 characters 35.143.782 2.921.113 Note that the conversion from GXL to RSF looses some information (e.g., the path attribute and the Contains relationships of SourceParts). Still, even when this is taken into account, GXL's file size is still considerably larger. 2.3.3. Comparison We would like to compare the RSF generated by both approaches. Identifying their differences (especially inconsistencies) should help us to detect parser bugs/characteristics and strengths/weaknesses of the respective approaches. So far we did not have time to do this. 2.4. Postprocessing of RSF Xiaoming started to analyze SORTIE based on the RSF obtained with the Rigi parser. Her findings are summarized in her report [Wu01]. We did not have the time to analyze SORTIE based on the RSF obtained with the Ottawa C++ parser. 3. Collaboration Partners 3.1. Peer Tools As described above, we used the TkSee C++ Parser (University of Ottawa), which is a parser similar in functionality to the Rigi C++ parser. The TkSee C++ parser output is GXL (and TA++), which forced us to write a converter from GXL to RSF. Writing of the converter took a considerable amount of time -- time that could have been more profitably spent. We hope that the research community will converge towards a common exchange format (preferably GXL) to ease tool interoperability. But even if a common exchange format exists, the problem of schema conversion remains. In fact, the schema conversion from DMM to Rigi's C++ schema was the most difficult part of the translation process. 3.2. Complementary Tools In the future, we would like to use the clustering results of the TkSee participants, but this will mean, of course, yet another converter. Besides Rigi we used a variety of Unix tools, mainly grep, sed, and awk. We also used emacs to browse and search in files (etags to search in emacs over multiple files). In general, there were few offers of participants of output that they produced. Thus, we currently do not even know if there was potential to use other's output. (Maybe participants talked to each other with private email instead of posting results; this was the case to a certain extent in our communication with the TkSee participants.) We offered output, but did not receive response of participants. It appears that Andrew Malton used the textual output of the dfm files that we posted, but his lead him to a dead end. 3.3 General Observations Communication among the participants was far less than we expected. As of this writing, only 22 messages were sent over the email list. Personally, I did not like the eGroups list. After the list was switched over to Yahoo, I did not register again, because I do not want to give personal information away. It is not very reassuring for that "Yahoo! will try to provide more relevant content and advertising based on the information collected on this page and on the Yahoo! products and services you use." Having a mailing list in addition to a central, open, non-commercial bulletin-board such as a Wiki might be a better way (but this is a matter of personal taste I guess). On way to increase collaboration might be to encourage participants to post results in stages. Every participant could have a status bar showing their current stage. For example, participants should post their obtained output (and experiences) as soon as they succeed parsing the code. Thus, other participants learn early on from first successful parsing experiences or can skip the parsing stage by using other participant's output. Since parsing proved to be a difficult part of the assignment (by far more difficult than xfig), this could have save redundant (and frustrating) work for other teams. References [BB] Borland C++ Builder, http://read.cnread.net/dnwl/cxsj/c/bcb/. [BC] Borland Community Web site, http://community.borland.com/. [BBuilder] Borland C++ Builder home page, http://www.borland.com/bcppbuilder/. [BDSS] Borland Developer Support Staff, "Delphi 3 file types with descriptions", Borland Community Web site, http://community.borland.com/article/0,1410,16552,00.html. [BS] Borland Staff, "Borland C++Builder White Paper: C++Builder and VCL", Borland Community Web site, http://community.borland.com/article/0,1410,10095,00.html. [CEK00] Jorg Czeranski, Thomas Eisenbarth, Holger Kienle, Rainer Koschke, and Daniel Simon, "Analyzing xfig Using the Bauhaus Tool", 7th Working Conference on Reverse Engineering (WCRE '00), pages 197-199, Brisbane, November 2000. [DMM01] "The Dagstuhl Middle Model (DMM)", http://scgwiki.iam.unibe.ch:8080/Exchange/2. [Gendron] Sachen Gendron, "A Summary of the User Requirements for the SORTIE Reengineering Project", http://www.csr.UVic.CA/chisel/collab/casestudy.html [GXLRSF] GXL to RSF converter, http://crown.csc.UVic.CA/cgi-bin/twiki/view/Main/RigiReleases. [Intersimone] David Intersimone, "VCL++: Applying the Power of the VCL to C++ by Randy Haben", Borland Community Web site, http://community.borland.com/article/0,1410,20565,00.html. [Kienle] Holger Kienle, "SORTIE Reverse Engineering Tools Collaboration Page", http://cedar.csc.uvic.ca/kienle/view/Main/SORTIE. [Marchen] Sergei Marchen, "C/C++ Parser with TA++ and GXL output", University of Ottawa, http://137.122.24.96/dmm/. [MWW00] Johannes Martin, Kenny Wong, Bruce Winter, and Hausi Muller, "Analyzing xfig Using the Rigi Tool Suite", 7th Working Conference on Reverse Engineering (WCRE '00), pages 207-209, Brisbane, November 2000. [RigiWiki] http://crown.csc.UVic.CA/cgi-bin/twiki/view/Main/WebHome. [SD] Sortie-demo mailing list and Yahoo! Groups, http://sg.groups.yahoo.com/group/sortie-demo. [SN] "The Source-Navigator IDE", http://sources.redhat.com/sourcenav/. [TWiki] TWiki home page, http://www.twiki.org/. [VCL] http://read.cnread.net/dnwl/cxsj/c/bcb/001.htm. [Wu01] Xiaomin Wu, "Final Report of Directed study -- Topic: Software Reverse Engineering", http://www.csc.uvic.ca/~xwu/final-report.htm.