Draft: 21 November, 1998
The following is prepared in anticipation of the CASCON meetings where we will discuss exchange formats for information about software systems. I will use the acronym SIEF to represent such formats in general.
After presenting the general problem, I will outline requirements for a SIEF. I will illustrate the requirements by giving a proposal for an SIEF which is a slight extension of TA++. TA++ is the format developed at the University of Ottawa, and adds a specific scheme to the Tuple-Attribute language developed by Ric Holt's group at the University of Toronto. The '++' suffix indicates that we have provided a specific schema which all tools using TA++ must understand.
The main problem is that there are many tools for manipulating information about software systems when performing reverse engineering or program understanding. However, there are several different languages for storing that information. Each toolset has its own advantages and specialized functions: in particular, each toolset can parse and process code written in a certain set of programming languages, and perform its own types of analyses. It would be nice if I could take data extracted by one tool, and use other tools to analyze that data (and vice-versa). To do that currently, one has to develop a parser for every language one wishes to analyze. Furthermore, one has to reparse the code for different analysis tools -- often a non-trivial process.
Following the development of a standard SIEF, it will be possible to create a public repository of SIEF data representing a wide variety of systems (software guinea pigs). Research groups can then perform scientific research into program understanding, without having to write their own parsing infrastructure. In particular, researchers will be able to validate their findings by running experiments on data from many different systems.
There are alternatives to the standard proposed herein. One alternative is to use the CASE Data Data Interchange Format (CDIF) or the XMI format developed by IBM However, these formats contains a large amount of analysis level information that we do not need.
a) Providing a representation of high-level architectural information about very large software systems, in particular large legacy systems. Note: This contrasts with an alternative objective of storing complete detail about what each line of code represents.
b) Providing a medium of exchange of data among heterogeneous parsers and program understanding tools.
c) Ease of processability. Rationale: To allow efficient transfer or data and computation with the data. Rationale: If processing the data is not easy, then using SIEF does not have much of an advantage over re-parsing the original source code on demand. SIEF data should contain high-level abstractions that have been extracted from code.
a) The SIEF data should be stored in simple ASCII. Rationale: This will allow: a) Human readability on all platforms; b) Processing by all parser architectures. The only possible drawback is potential problems representing the names of software objects that are written in non-ASCII character sets. The solution to this is to allow encoding of such names into ASCII.
b) The SIEF data syntax should not involve complex nested structures (e.g. nested parentheses). Rationale: It is important to make the data as easy to process as possible. One design alternative would involve many nested substructures (perhaps even approaching the level of detail of parse trees); such detailed data defeats the ease-of-processing objective. Implication: Simple relations with attributes would seem appropriate.
c) There should be an object-oriented hierarchy of classes (types) of 'software objects', with inheritance of associations (relations) and attributes from superclasses to subclasses. Rationale: There are many classes of objects that one might expect to find in an SIEF -- we have found that these fall into a natural hierarchy. Ensuring that they are organized into a hierarchy will facilitate their use by object oriented software and will add power to the representation. Note: See section 5 for more specific requirements.
d) The object-oriented hierarchy should have no multiple-inheritance. Rationale: Makes it easier to implement systems in languages that do not support multiple inheritance.
e) SIEF data should be stored as files; one SIEF data per input source file. There will need to be a way to uniquely refer to objects from file to file. Rationale: a) The alternative, generating one large file, might result in large, unwieldy results. a) This permits program understanding tools to work with subsets of files that represent subsystems.
a) All systems using the SIEF should be able to read SIEF data that conforms to a basic specification derived from these requirements.
b) There will be a 'core' set of SIEF classes, associations and attributes that all SIEF programs will be able to understand, and that will be generated by all conforming parsers from all programming languages where it makes sense. Rationale: To ensure that we can run experiments we want to be confident that certain basic information about any system is available. Note: If a class such as 'ClassSource' is in the core, this requirement does not imply that instances of 'ClassSource' would be generated from a non-object-oriented system. However one would expect that instances of 'Routine' or one of its subclasses would be found in any system. Note: Section 5 provides more detailed requirements.
c) There will be an 'auxilliary' set of classes, associations and attributes that, while considered standard, do not have to be produced by all parsers, even if the information is present in the source code. Rationale: This permits smaller, simpler parsers and systems. At the same time, it permits more advanced uses of the SIEF.
c) The syntax of the SIEF should permit parsers or other programs to generate SIEF data with new (non-standard) classes, relations or attributes without affecting programs which read the SIEF. These will be called 'extensions'.
c) Programs using the SIEF should be able to recognize and ignore classes and attributes that are auxilliary or extensions.
a) The SIEF should be able to represent software systems where there are multiple items (e.g. routines, variables) with the same name. The SIEF should be able to distinguish the different definitions of items with the same name. Rationale: 1) There can be many items with the same name in different scopes. 2) There can be items with the same name in the same scope, but guarded by different preprocessor directives.
b) The SIEF does not need to contain all the information required to resolve the scope of references to items with the same name: The SIEF must be able to represent 'reference existences' or 'unresolved references', Rationale: a) Fully resolving all references within scope may be too expensive or require too much storage space. 2) It may be desirable to store information about a partial or incomplete system. 3) References to elements from external libraries might not be easily resolvable. 4) Many program comprehension tasks can be performed without full resolution of scope and references. None of this precludes extending the SIEF to allow for resolution of references and scope.
a) The core classes and attributes should be able to represent architectural information present in:
b) The above does not imply that conforming systems have to have parsers for any of these langauges.
c) The SIEF should be capable of handling systems composed of multiple source languages.
d) The SIEF should not impose restrictions on the naming conventions for software objects (file names, directory names etc.)
e) The SIEF should, in general, represent information about code as programmers would see it. In particular, it should be capable of containing information about un-preprocessed code.
f) The SIEF should be able to store (as auxilliary information, not core information) basic data about the system it represents, including (where available from the original system):
g) References to software objects are given as either:
h) Core classes whose instances should be generated by any conforming parser, where found in input data, include the following. Each is followed by a line of TA syntax.
SCHEME TUPLE :
SoftwareObject: Ultimate abstract superclass of all SIEF classes
$INHERIT SoftwareObject $ENTITY
SourceUnit: Abstract class representing any block of text containing source.
$INHERIT SourceUnit SoftwareObject
SourceFile: Representing discrete files.
$INHERIT SourceFile SourceUnit
SourceWithinFile: Abstract subclass of SourceUnit. Represents units of source code text that are found within files, but which are considered discrete architectural entities.
$INHERIT SourceWithinFile SourceUnit
ClassSource: Representing the source code of any class present in the system.
$INHERIT ClassSource SourceWithinFile
RoutineSource: Representing the source code of any routine. Whether the routine is called a function, procedure, subroutine, method etc. depends on the source language -- in the SIEF this distinction is not explicit. Specific attributes and associations will provide information about such distinctions.
$INHERIT RoutineSource SourceWithinFile
Definition: Abstract superclass of classes that represent definitions found in the source code. See requirement 4a.
$INHERIT Definition SoftwareObject
TypedDefinition. Abstract superclass of definitions that have a type.
$INHERIT TypedDefinition Definition
StandaloneDefinition: Abstract superclass of typed definitions that can be independently referreed to by other elements in the source code. As opposed to elements, like fields, that cannot be independently referred to.
$INHERIT StandaloneDefinition TypedDefinition
TypeDef: Represents definitions of types.
$INHERIT TypeDef StandaloneDefinition
DatumDef: Represents definitions of global variables. It is not necessary for definitions of local variables to be generated.
$INHERIT DatumDef StandaloneDefinition
ReferenceExistence: Abstract superclass of classes that represent unspecified or resolved references found in a SourceUnit. See requirement 4b.
$INHERIT ReferenceExistence SoftwareObject
FileInclusionExistence: Represents the inclusion in a file of another file, which may or may not exist or be known.
$INHERIT FileInclusionExistence ReferenceExistence
DataUseExistence: Represents a reference in a SourceUnit to data which may or may not exist or be known. Only references to global data is required to be generated. Where more than one reference to the same data occurs, only one DataUseExistence is generated. Note that a user of the SIEF might process the information to conclude that the DataUseExistence actually will be bound to a particular StandaloneDefinition or even to a SourceWithinFile (class or routine) that is being referred to via a variable.
$INHERIT DataUseExistence ReferenceExistence
RoutineCallExistence: Represents a call in a SourceUnit to a routine which may or may not exist or be known. Where more than one call to the same routine occurs, only one RoutineCallExistence is generated. Note a user of the SIEF might process the information to conclude that the call is being made to a particular RoutineSource. Alternatively the call may be indirect to a variable that changes at run time and contains a reference to a RoutineSource.
$INHERIT RoutineCallExistence ReferenceExistence
TypeUseExistence: Represents a use in a SourceUnit to a type which may or may not exist or be known. Where more than one use of the same type occurs, only one TypeUseExistence is generated. Note that a user of the SIEF might proces the information to conclude that the TypeUseExistence refers to a particular TypeDefinition or ClassSource.
$INHERIT TypeUseExistence DataUseExistence
i) Core associations whose links should be generated by any conforming parser, where found in the input data, include the following. Each relation is shown in TA notation, with the related classes following. It is believed that the relations are self explanatory.
potentiallyIncludedIn FileInclusionExistence SourceUnit definedBy StandaloneDefinition SourceUnit containingSource SourceWithinFile SourceUnit- Note that since this association is inherited, routines can contain other routines, classes can contain routines (methods) etc.
potentiallyCalledBy RoutineCallExistence RoutineSource usedInSource DataUseExistence SourceUnit ofType TypedDefinition TypeUseExistence
j) Core attributes whose occurrences should be generated by any conforming parser, where found in the input data, include:
SourceFile { path }- The path identifies the file within the directory structure.
SourceWithinFile { startChar endChar } StandaloneDefinition { startChar endChar }
- Identify the start and end character within the file, at which the source code or definition is found. Note that line number is not enough because some items might start in the middle of lines.
k) Auxilliary classes that, when instances are present in SIEF files, will be interpreted in a consistent manner by all interested tools, include:
Subsystem can be used to represent packages, directories etc..
$INHERIT Subsystem SoftwareObject
CommentTermExistence can be used to represent important words or phrases found in comments.
$INHERIT CommentTermExistence ReferenceExistence
ManifestConstExistence can be used to represent character strings, numbers etc. that are explicit in the source code.
$INHERIT ManifestConstExistence DataUseExistence
EnumerationConst can be used to represent values that are defined as part of an enumerated type.
$INHERIT EnumerationConst Definition
Field can be used to represent the components of a record type.
$INHERIT Field TypedDefinition
Two specific subclasses of TypeDef
$INHERIT RecordTypeDef TypeDef $INHERIT EnumeratedTypeDef TypeDef
l) Auxilliary associations that, when links are present in SIEF files, will be interpreted in a consistent manner by all interested tools, include:
declaredAsFormalArgsIn DatumDef RoutineSource returnType RoutineSource TypeUseExistence foundInSource CommentTermExistence SourceUnit isEnumerationMemberOf EnumerationConst EnumeratedTypeDef isFieldMemberOf Field RecordTypeDef isMemberOf SourceFile Subsystemm) Auxilliary attributes that, when occurrences are present in SIEF files, will be interpreted in a consistent manner by all interested tools, include:
SourceFile { version dateChanged size}- The version is an arbitrary string
- The dateChanged is in format yyyymmdd
- The size is in characters.
RoutineSource { isClassMethod }- For a method, flags that it is a class method (static in C++/Java)
RoutineSource { visibility }- Language-dependent string (e.g. "public").
DatumDef { isConst }- Flags whether it is a constant
The TA syntax should be used. This has a good balance of simplicity (few syntactic constructs) and compactness of the resulting data.
We could use RIGI, developed at the University of Victoria by Hausi A. Müller's group, however, he attribute extensions of TA make TA seem a better choice.
Another alternative is XML, but it would be more wordy since there would be many more tags etc. in the resulting data. Using TA would result in a more specialized format than if we used XML, but it would probably be more efficient.
The following is the complete TA scheme we propose. This is the metadata. For those who don't know TA,
SCHEME TUPLE : $INHERIT SoftwareObject $ENTITY $INHERIT SourceUnit SoftwareObject $INHERIT Definition SoftwareObject $INHERIT ReferenceExistence SoftwareObject $INHERIT Subsystem SoftwareObject $INHERIT SourceFile SourceUnit $INHERIT SourceWithinFile SourceUnit $INHERIT RoutineSource SourceWithinFile $INHERIT ClassSource SourceWithinFile $INHERIT TypedDefinition Definition $INHERIT EnumerationConst Definition $INHERIT StandaloneDefinition TypedDefinition $INHERIT Field TypedDefinition $INHERIT TypeDef StandaloneDefinition $INHERIT DatumDef StandaloneDefinition $INHERIT RecordTypeDef TypeDef $INHERIT EnumeratedTypeDef TypeDef $INHERIT CommentTermExistence ReferenceExistence $INHERIT FileInclusionExistence ReferenceExistence $INHERIT DataUseExistence ReferenceExistence $INHERIT RoutineCallExistence ReferenceExistence $INHERIT ManifestConstExistence DataUseExistence $INHERIT TypeUseExistence DataUseExistence potentiallyIncludedIn FileInclusionExistence SourceUnit definedBy StandaloneDefinition SourceUnit containingSource SourceWithinFile SourceUnit declaredAsFormalArgsIn DatumDef RoutineSource returnType RoutineSource TypeUseExistence potentiallyCalledBy RoutineCallExistence RoutineSource foundInSource CommentTermExistence SourceUnit usedInSource DataUseExistence SourceUnit ofType TypedDefinition TypeUseExistence isEnumerationMemberOf EnumerationConst EnumeratedTypeDef isFieldMemberOf Field RecordTypeDef isMemberOf SourceFile Subsystem SCHEME ATTRIBUTE : SourceFile { path version dateChanged size } SourceWithinFile { startChar endChar } RoutineSource { isClassMethod, visibility } StandaloneDefinition { startChar endChar } DatumDef { isConst }
The following file, called ftfmctab.asm.ta, describes source code for a file in assembly language called ftfmctab.asm. It has been slightly fictionalized to protect confidentiality of information.
One thing to note: All names of software objects are composed of a unique object identifier followed by a '!' followed by the name of the object. This helps us keep track of the many objects which have the same name, and also makes processing more efficient.
FACT TUPLE: $INSTANCE 0!ftfmctab.asm SourceFile $INSTANCE 81!ftfmctab SpecialCodeExistence foundInSource 81!ftfmctab 0!ftfmctab.asm $INSTANCE 171!tabmac.inc FileInclusionExistence potentiallyIncludedIn 171!tabmac.inc 0!ftfmctab.asm $INSTANCE 186!list RoutineCallExistence potentiallyCalledBy 186!list 0!ftfmctab.asm $INSTANCE 193!ftfmctab SpecialCodeExistence foundInSource 193!ftfmctab 0!ftfmctab.asm $INSTANCE 261!ftf_tbl DataUseExistence usedInSource 261!ftf_tbl 0!ftfmctab.asm $INSTANCE 253!tabstrt RoutineCallExistence potentiallyCalledBy 253!tabstrt 0!ftfmctab.asm $INSTANCE 272!tabadd RoutineCallExistence potentiallyCalledBy 272!tabadd 0!ftfmctab.asm $INSTANCE 292!tstart RoutineCallExistence potentiallyCalledBy 292!tstart 0!ftfmctab.asm $INSTANCE 307!chartab RoutineCallExistence potentiallyCalledBy 307!chartab 0!ftfmctab.asm $INSTANCE 231!ftf_tbl DatumDef definedBy 231!ftf_tbl 0!ftfmctab.asm $INSTANCE 231!ASSEMBLER_TYPE TypeUseExistence type 231!ftf_tbl 231!ASSEMBLER_TYPE usedInSource 231!ASSEMBLER_TYPE 0!ftfmctab.asm FACT ATTRIBUTE: 0!ftfmctab.asm { dateChanged = 19970101 path = /usr/bigsystem/version4/source/ size = 329 version = 1} 231!ftf_tbl { endChar = 238 isConst = 0 startChar = 231}