\
Software Composition Group - SORTIE Report
Software Composition Group
SORTIE Report
Michele Lanza, Gabriela Arevalo, Daniel Schweizer, Daniele Talerico
Table of Contents
- Introduction
- Description of Tools
- Reverse Engineering Team
- Preliminary Work
- Analysis
- Metric Analysis
- Statistical Analysis
- Sortie Explained
- Our View
- Conclusion
- Suggestions & Opinions
- Conclusion
1. Introduction
1.1. Description of Tools
For our analysis we used the SCG P.U.R.E. toolset written by members and students of the Software Composition Group, as well as one commercial parser.
- Sniff+, a commercial parser and integrated development environment for various languages with which we have parsed the Sortie system.
- Moose, our language independent reengineering environment, written by various members of the SCG since 1998.
- CodeCrawler, a visualization tool which combines visualization techniques with metrics. CodeCrawler is written by Michele Lanza and is based on Moose.
- MooseClassifier, an extension to CodeCrawler written by Daniele Talerico.
- MooseExplorer, a tool which enables textual navigation of Moose Models, developed as part of the diploma by Pietro Malorgio.
- MooseFinder, a query engine which enables us to compose complex queries. It was part of the diploma thesis of Lukas Steiger.
- MooseNavigator, an extension to CodeCrawler written by Daniel Schweizer as part of his diploma thesis.
1.2. Reverse Engineering Team
The team for the Sortie experience was composed of the following people:
- Gabriela Arevalo, did her master on software architecture and is currently doing her Ph.D. on components.
- Michele Lanza, did his master on reverse engineering and is currently doing his Ph.D. on reverse engineering and software evolution.
- Daniel Schweizer, currently doing his diploma thesis on reverse engineering and navigation of metamodels.
- Daniele Talerico, currently doing his diploma thesis on reverse engineering.
2. Preliminary Work
Parsing & Loading
The first step consisted of parsing the source code using
Sniff+. Parsing did not pose a major problem. Using a tool called
Sniff2Famix we used the symbol table generated by Sniff+ to generate a
CDIF file, which contains a textual representation of all software
entities contained in Sortie. Using that file we can load Sortie into
Moose, our Reengineering Environment. Moose is
language-independent. This whole process took less than one hour.
3. Analysis
The analysis made is based on two aspects of the SORTIE system:
- Structure of the different parts of the system (definition of classes, attributes
and methods).
- Communication and collaboration between different parts of the system.
3.1. Metric Analysis
We ran our metric engine on Sortie, which took a few seconds. An
overview of Sortie can be seen in the table below.
| Entities |
Number |
| Classes + (Structs) |
63 + (6) |
| Methods |
763 |
| Attributes |
1935 |
| Functions |
5 |
| Inheritance Definitions |
10 |
| Invocations |
683 |
| Attribute Accesses |
6736 |
| Formal Parameters |
985 |
| Global Variables |
72 |
Considering the huge number of attributes and attribute accesses and
the low number of inheritance relationships, we first made a general
analysis of the system, and afterwards a study of specific parts of
the system. In the figure below we see a first visualization of the
system using CodeCrawler.
From this picture we can see aspects like inheritance, size of the
classes (number of defined attributes and methods) and namespaces.
Inheritance defintions: The figure shows the inheritance
hierarchy of the system, which is very flat. There are only 10
inheritance definitions.
Namespaces: The colors represent the different (artificial)
name spaces we have detected. It seems the classes can be categorized
into the following groups:
- Dialog Classes (12) which contain the substrings
Dialog (10) or Dlg (2).
- Form Classes (33) which contain the substrings Form
(22) or Fm (11).
- Sortie Classes (6) which contain the substring
Sortie
- Structs (6) which have lowercase names with two exceptions:
TGridSubstrate and TPlotPoint
- The Rest (11) which does not fit the above conventions but
which in some cases is a plain case of name policy breach, i.e., the
classes should have one of the above substrings in their name but they
do not.
Size of Classes:
In the above view we display all classes of the system and use metrics
to render the size of the nodes as follows:
- The wider a class is the more attributes it defines (NOA)
- The taller the class is the more methods it defines (NOM)
Below we see a summary of some metrics of the largest classes of Sortie.
| Class Name |
NOA Number of Attributes |
NOM Number of Methods |
WLOC Total Lines of Code |
Average LOC per Method |
| TMainWindow |
237 |
78 |
2099 |
27 |
| THarvestDialog |
124 |
66 |
2746 |
42 |
| TSpeciesDialog |
255 |
12 |
567 |
47 |
| TPlotDialog |
144 |
25 |
744 |
30 |
| TSortieIO |
146 |
28 |
2946 |
105 |
3.1. Statistical Analysis
Focusing more in a deep analysis, and using the number of attributes
and methods per class, we proposed to make an analysis of how the
class distribution is, seen in an statistical way. In the figures
below, we show where the classes (presented in the table) are located
in the distribution.
Subsequently, we present the following distributions:
- Number of Defined Attributes per Class
- Number of Defined Methods per Class
- Number of Defined Attributes compared to Defined Methods
- Defined and Invoked Methods
Number of Defined Attributes per Class
Based on the number of attributes defined in each class, the next
figure shows how the distribution of the classes is: not uniform. Most
of the classes have less than 20 attributes but then we see a high
increase. As we saw in the previous table, TSortieIO and
TPlotDialog are two classes with an average of 150 attributes
and the classes THarvestDialog and TSpeciesDialog have
approximately 240 and 260 attributes respectively. The largest
classes in number of attributes belong to the Dialog namespace.
Number of Defined Methods per Class
Thinking in terms of behavior, we analyzed the number of methods
defined in the classes. This distribution is more uniform than the
previous one, except for the classes THarvestDialog and
TMainWindow that contain approximately 70 and 80 methods
respectively. This distribution has the same features as we detected
with attributes. The largest classes belong to the Dialog
namespace.
Number of defined Attributes compared to defined Methods
As we saw in the first figure, we saw that if we analyze the number of
attributes and methods, many classes seem to be data-containers without
almost no behavior. In the first figure we saw that the classes are
wider than tall. In the following picture, we present the information
in a different way. We see the number of attributes in blue and number
of methods in red. For example, the class TSpeciesDialog has
little behavior compared to the number of defined attributes.
Defined and Invoked Methods
The concept of classes as data-containers can also be seen when we
make a comparison between the defined methods and the invoked methods
of a class in the system. The next picture shows that only a few of
the classes have methods that are invoked in the rest of the
system. This fact makes the system appear smaller than it is, if we think in
terms of the level of interaction between the classes. When we see the
list of non-invoked methods we see that the most of them are related
to the interface communication, for example
OKBtnClick(TObject*), SpeciesListBoxDblClick(TObject*),
CancelBtnClick(TObject*) in the class TWindstormDialog.
Most of these "non-invoked" methods are called in the files *.dfm, as we
verified to get a confirmation.
The Classes TSetDensitySizeForm, TDisturbanceDialog,
TWindowstormDialog, TEditLambdaForm, TTimeStepFm, TPlotDimForm,
TSetMapscaleForm, TPrinterTypeForm, TSpatialInterp,
TSaveHarvestResultsDialog, TGrowthEquationsForm, TAboutFm and
TSavePBFm have two main features:
- all their methods are not invoked in the rest of the system
- their methods do not invoke any method of the rest of the system
As we said previously, these classes seem to belong to the interface
part of the system. This reduces the amount of classes that really
model the domain of this system.
Global Variables
In the system, we discovered at first 72 global variables. But in a
deeper analysis, we see that the number of global variables are close
to the number of classes (69). When we have a look at the code, we see
declaration like: extern TTreeMapFm *TreeMapFm or TTreeMapFm
*TreeMapFm and TreeMapFm is the global variable. The
keyword extern is used to make local names have external
linkage. Like this, in the files classes can be declared local. When
they are declared as extern, they can be used outside the scope of the
files where they were declared. Thus, in this system, we consider that
there are no global variables. We think this is also a sign of
inexperience of the developer during the porting of Sortie from C to
C++.
Conclusions about the system structure
After looking at the metrics we draw some conclusions:
- The large number of attributes and the use of GUI classes indicates a mixing of domain model
and GUI. This can only be termed as wrong implementation decisions.
- The large number of lines of code of the classes also shows some
present or future problems: working with large files is bad from a
cognitive point of view.
- The average length of the methods is also somewhat high, in
certain cases over 100 (without counting the .h-file). This can be a
possible indicator for procedural coding style, or a lack of a
refactoring policy by the developer.
4. Sortie Explained
4.1. Our View
To understand the basic structure of Sortie we looked at the central
class called TMainWindow, which is also by far the biggest
class. This class contains a method called RunSimulation(),
which is the key to the understanding of Sortie. The whole Program is
basically a procedural system written in C++. We know it was ported
from C, which strengthens this supposition.
TMainWindow::RunSimulation() contains several calls to certain parts of the system. Although we do not possess any domain knowledge from the comments within this method, it seems like calls to subparts which do this:
- Harvest
- Light
- GLI
- Bath Light
- Growth
- Windstorm
- Mortality
- Substrate
- Disperse
- Planting
- Demographics
- I/O
Using this information we generated the following figure:
In this figure we see all classes and structs of Sortie. The edges represent the invocations between the classes. We obtained this figure after removing 3 classes which do not have any domain, but which get invoked a lot:
- TYesNoForm
- TErrorForm
- TAboutForm
4.2. Conclusion
Every subpart of Sortie mentioned above has a similar structure: A
class called TSortiexxx (in the figure the yellow ones) which can use
Dialog classes (red) and/or Form classes (cyan). With the Dialogs
parameters can be set. The Form classes could be there for the
Output. The last "piece" is the IO part where files are saved,
etc. There's also a "piece" for batch processing. The fact that the
program was ported by someone not expert in O-O is indicated by the
missing encapsulation (attributes are directly accessed all the time,
as good as no private methods), by the flat hierarchies and in general
by the non-O-O way of writing code.
5. Suggestions & Opinions
Here is a list of suggestions and opinions about the Sortie system:
- Domain mixed with GUI:We guess that during the port from C
to C++ the developer(s) made extensive use of a technology they
embraced with too much fervor: The GUI framework of Borland
C++. Nearly all classes are subclasses of Borland GUI classes, which
results in a dangerous mixing of concerns: Porting the Sortie system
to other C++ dialects or even languages will involve problems.
- Procedural Coding Style:In the figure below we see the
collaboration relationships between the classes and structs in the
system. What strikes the eye is the low number of edges: 3. This means
that the whole system, although written in a (hybrid) object-oriented
language does not exploit that paradigm: the functionality within the
Sortie classes is not being used by communicating objects, rather by a
sequence of classes independent of each other. Noteworthy is also that
the collaborations are between classes and structs. Again a sign for
procedural thinking.
- Domain Dispersion:A general impression of the system is that the actual domain is dispersed throughout the system. Therefore it is hard to locate a certain aspect of the domain within a certain class or number of classes. This has two negative effects on the system:
- Low extensibility:If the domain needs to be extended, for example a new type of forest, the developer needs to patch his code in several places.
- Low migration potential:The dependence introduced by the domain dispersion and further supported by the GUI-guided development makes it nearly to impossible to migrate this product towards another programming language or even C++ dialect.
- We identified a particularly strange imbalance between data and behavior in a number of classes, some of which we list below:
- TEditSpeciesForm (16 Attributes, 6 Methods, 63 LOC)
- TGrowthEquationsForm (41 Attributes, 1 Methods, 0 LOC)
- TSavePBFm (9 Attributes, 1 Methods, 0 LOC)
- TSetAxisForm (15 Attributes, 1 Methods, 0 LOC)
- TSetDensitySizeForm (9 Attributes, 6 Methods, 32 LOC)
- TSizeClassDialog (65 Attributes, 3 Methods, 90 LOC)
- TWindstormDialog (71 Attributes, 5 Methods, 60 LOC)
One would expect these classes to be abstract because they introduce methods without implementing them, and contain a lot of data. We can come up with possible reasons for this:
- The classes in question are unfinished and still under development.
- The GUI-guided development style leads the developer to first define the GUI and then to assign functionalities to the system dependent on the GUI. This can cause holes in the assigned functionalities.
- The development style which Borland Builder supports enables the developer to link GUI elements and functionality using .dfm files. This has the negative effect that the domain is further dispersed within non-source files.
- The lack of polymorphism is a negative sign: one would expect that
polymorphism would be used to model the different kinds of forests,
soils, etc., but this seems not to be the case.
6. Conclusion
After looking at the Sortie case study we must say that the
reengineering requirements are somewhat unrealistic. We do not think
that the code of Sortie can easily be reused, and considered the small
size of Sortie would rather propose to rewrite the system using the
existing knowledge. For a new architecture we propose to first get a
clean notion and implementation of the domain models present in this
system and document it thoroughly. This would at least enable to
implement new types of forests, etc. with little programming
effort. The requirement that non-programmers be able to introduce new
types of forests, etc. is unrealistic in this setting and would
require a major implementational as well as economical effort to move
Sortie towards a framework architecture.
Software Composition Group, 18-09-2001