Requirements and Proposal for a Software Information Exchange Format (SIEF) Standard

Draft: 21 November, 1998

Introduction

The following is prepared in anticipation of the CASCON meetings where we will discuss exchange formats for information about software systems. I will use the acronym SIEF to represent such formats in general.

After presenting the general problem, I will outline requirements for a SIEF. I will illustrate the requirements by giving a proposal for an SIEF which is a slight extension of TA++. TA++ is the format developed at the University of Ottawa, and adds a specific scheme to the Tuple-Attribute language developed by Ric Holt's group at the University of Toronto. The '++' suffix indicates that we have provided a specific schema which all tools using TA++ must understand.

What is the problem we are trying to solve?

The main problem is that there are many tools for manipulating information about software systems when performing reverse engineering or program understanding. However, there are several different languages for storing that information. Each toolset has its own advantages and specialized functions: in particular, each toolset can parse and process code written in a certain set of programming languages, and perform its own types of analyses. It would be nice if I could take data extracted by one tool, and use other tools to analyze that data (and vice-versa). To do that currently, one has to develop a parser for every language one wishes to analyze. Furthermore, one has to reparse the code for different analysis tools -- often a non-trivial process.

Following the development of a standard SIEF, it will be possible to create a public repository of SIEF data representing a wide variety of systems (software guinea pigs). Research groups can then perform scientific research into program understanding, without having to write their own parsing infrastructure. In particular, researchers will be able to validate their findings by running experiments on data from many different systems.

There are alternatives to the standard proposed herein. One alternative is to use the CASE Data Data Interchange Format (CDIF) or the XMI format developed by IBM However, these formats contains a large amount of analysis level information that we do not need.

Requirements for an SIEF

1. Main objectives behind the SIEF

a) Providing a representation of high-level architectural information about very large software systems, in particular large legacy systems. Note: This contrasts with an alternative objective of storing complete detail about what each line of code represents.

b) Providing a medium of exchange of data among heterogeneous parsers and program understanding tools.

c) Ease of processability. Rationale: To allow efficient transfer or data and computation with the data. Rationale: If processing the data is not easy, then using SIEF does not have much of an advantage over re-parsing the original source code on demand. SIEF data should contain high-level abstractions that have been extracted from code.

2. General characteristics of SIEF data

a) The SIEF data should be stored in simple ASCII. Rationale: This will allow: a) Human readability on all platforms; b) Processing by all parser architectures. The only possible drawback is potential problems representing the names of software objects that are written in non-ASCII character sets. The solution to this is to allow encoding of such names into ASCII.

b) The SIEF data syntax should not involve complex nested structures (e.g. nested parentheses). Rationale: It is important to make the data as easy to process as possible. One design alternative would involve many nested substructures (perhaps even approaching the level of detail of parse trees); such detailed data defeats the ease-of-processing objective. Implication: Simple relations with attributes would seem appropriate.

c) There should be an object-oriented hierarchy of classes (types) of 'software objects', with inheritance of associations (relations) and attributes from superclasses to subclasses. Rationale: There are many classes of objects that one might expect to find in an SIEF -- we have found that these fall into a natural hierarchy. Ensuring that they are organized into a hierarchy will facilitate their use by object oriented software and will add power to the representation. Note: See section 5 for more specific requirements.

d) The object-oriented hierarchy should have no multiple-inheritance. Rationale: Makes it easier to implement systems in languages that do not support multiple inheritance.

e) SIEF data should be stored as files; one SIEF data per input source file. There will need to be a way to uniquely refer to objects from file to file. Rationale: a) The alternative, generating one large file, might result in large, unwieldy results. a) This permits program understanding tools to work with subsets of files that represent subsystems.

3. Conformity and Extensibility

a) All systems using the SIEF should be able to read SIEF data that conforms to a basic specification derived from these requirements.

b) There will be a 'core' set of SIEF classes, associations and attributes that all SIEF programs will be able to understand, and that will be generated by all conforming parsers from all programming languages where it makes sense. Rationale: To ensure that we can run experiments we want to be confident that certain basic information about any system is available. Note: If a class such as 'ClassSource' is in the core, this requirement does not imply that instances of 'ClassSource' would be generated from a non-object-oriented system. However one would expect that instances of 'Routine' or one of its subclasses would be found in any system. Note: Section 5 provides more detailed requirements.

c) There will be an 'auxilliary' set of classes, associations and attributes that, while considered standard, do not have to be produced by all parsers, even if the information is present in the source code. Rationale: This permits smaller, simpler parsers and systems. At the same time, it permits more advanced uses of the SIEF.

c) The syntax of the SIEF should permit parsers or other programs to generate SIEF data with new (non-standard) classes, relations or attributes without affecting programs which read the SIEF. These will be called 'extensions'.

c) Programs using the SIEF should be able to recognize and ignore classes and attributes that are auxilliary or extensions.

4. Duplicate Names, Scoping, Definitins and References.

a) The SIEF should be able to represent software systems where there are multiple items (e.g. routines, variables) with the same name. The SIEF should be able to distinguish the different definitions of items with the same name. Rationale: 1) There can be many items with the same name in different scopes. 2) There can be items with the same name in the same scope, but guarded by different preprocessor directives.

b) The SIEF does not need to contain all the information required to resolve the scope of references to items with the same name: The SIEF must be able to represent 'reference existences' or 'unresolved references', Rationale: a) Fully resolving all references within scope may be too expensive or require too much storage space. 2) It may be desirable to store information about a partial or incomplete system. 3) References to elements from external libraries might not be easily resolvable. 4) Many program comprehension tasks can be performed without full resolution of scope and references. None of this precludes extending the SIEF to allow for resolution of references and scope.

5. Specific Languages and Features that Should be Supported

a) The core classes and attributes should be able to represent architectural information present in:

i. Both object-oriented and non-object-oriented languages, in general.
ii. As a minimum: C, C++, Java, Pascal, COBOL, FORTRAN, Assembler. Note: The extensibility requirements should allow other langauges to be represented, but this list is presented to guide development.

b) The above does not imply that conforming systems have to have parsers for any of these langauges.

c) The SIEF should be capable of handling systems composed of multiple source languages.

d) The SIEF should not impose restrictions on the naming conventions for software objects (file names, directory names etc.)

e) The SIEF should, in general, represent information about code as programmers would see it. In particular, it should be capable of containing information about un-preprocessed code.

f) The SIEF should be able to store (as auxilliary information, not core information) basic data about the system it represents, including (where available from the original system):

i. Programming language version and / or compiler version.
ii. Software system version (including where stored)
iii. Dates of creation files and versions of files.

g) References to software objects are given as either:

i. The name as found in the source code, if it exists and is unique.
ii. A unique identifier followed by an exclamation mark followed by the name.

h) Core classes whose instances should be generated by any conforming parser, where found in input data, include the following. Each is followed by a line of TA syntax.

    SCHEME TUPLE :

SoftwareObject: Ultimate abstract superclass of all SIEF classes

    $INHERIT  SoftwareObject           $ENTITY

SourceUnit: Abstract class representing any block of text containing source.

    $INHERIT  SourceUnit               SoftwareObject

SourceFile: Representing discrete files.

    $INHERIT  SourceFile               SourceUnit

SourceWithinFile: Abstract subclass of SourceUnit. Represents units of source code text that are found within files, but which are considered discrete architectural entities.

    $INHERIT  SourceWithinFile         SourceUnit

ClassSource: Representing the source code of any class present in the system.

    $INHERIT  ClassSource         SourceWithinFile

RoutineSource: Representing the source code of any routine. Whether the routine is called a function, procedure, subroutine, method etc. depends on the source language -- in the SIEF this distinction is not explicit. Specific attributes and associations will provide information about such distinctions.

    $INHERIT  RoutineSource            SourceWithinFile

Definition: Abstract superclass of classes that represent definitions found in the source code. See requirement 4a.

    $INHERIT  Definition               SoftwareObject

TypedDefinition. Abstract superclass of definitions that have a type.

    $INHERIT  TypedDefinition          Definition

StandaloneDefinition: Abstract superclass of typed definitions that can be independently referreed to by other elements in the source code. As opposed to elements, like fields, that cannot be independently referred to.

    $INHERIT  StandaloneDefinition     TypedDefinition

TypeDef: Represents definitions of types.

    $INHERIT  TypeDef                  StandaloneDefinition

DatumDef: Represents definitions of global variables. It is not necessary for definitions of local variables to be generated.

    $INHERIT  DatumDef                 StandaloneDefinition

ReferenceExistence: Abstract superclass of classes that represent unspecified or resolved references found in a SourceUnit. See requirement 4b.

    $INHERIT  ReferenceExistence       SoftwareObject

FileInclusionExistence: Represents the inclusion in a file of another file, which may or may not exist or be known.

    $INHERIT  FileInclusionExistence   ReferenceExistence

DataUseExistence: Represents a reference in a SourceUnit to data which may or may not exist or be known. Only references to global data is required to be generated. Where more than one reference to the same data occurs, only one DataUseExistence is generated. Note that a user of the SIEF might process the information to conclude that the DataUseExistence actually will be bound to a particular StandaloneDefinition or even to a SourceWithinFile (class or routine) that is being referred to via a variable.

    $INHERIT  DataUseExistence         ReferenceExistence

RoutineCallExistence: Represents a call in a SourceUnit to a routine which may or may not exist or be known. Where more than one call to the same routine occurs, only one RoutineCallExistence is generated. Note a user of the SIEF might process the information to conclude that the call is being made to a particular RoutineSource. Alternatively the call may be indirect to a variable that changes at run time and contains a reference to a RoutineSource.

    $INHERIT  RoutineCallExistence     ReferenceExistence

TypeUseExistence: Represents a use in a SourceUnit to a type which may or may not exist or be known. Where more than one use of the same type occurs, only one TypeUseExistence is generated. Note that a user of the SIEF might proces the information to conclude that the TypeUseExistence refers to a particular TypeDefinition or ClassSource.

    $INHERIT  TypeUseExistence         DataUseExistence

i) Core associations whose links should be generated by any conforming parser, where found in the input data, include the following. Each relation is shown in TA notation, with the related classes following. It is believed that the relations are self explanatory.

    potentiallyIncludedIn  FileInclusionExistence SourceUnit

    definedBy              StandaloneDefinition   SourceUnit

    containingSource       SourceWithinFile       SourceUnit

- Note that since this association is inherited, routines can contain other routines, classes can contain routines (methods) etc.

    potentiallyCalledBy    RoutineCallExistence   RoutineSource

    usedInSource           DataUseExistence       SourceUnit

    ofType                 TypedDefinition        TypeUseExistence

j) Core attributes whose occurrences should be generated by any conforming parser, where found in the input data, include:

    SourceFile             { path }

- The path identifies the file within the directory structure.

    SourceWithinFile       { startChar endChar }
    StandaloneDefinition   { startChar endChar }

- Identify the start and end character within the file, at which the source code or definition is found. Note that line number is not enough because some items might start in the middle of lines.

k) Auxilliary classes that, when instances are present in SIEF files, will be interpreted in a consistent manner by all interested tools, include:

Subsystem can be used to represent packages, directories etc..

    $INHERIT  Subsystem                SoftwareObject

CommentTermExistence can be used to represent important words or phrases found in comments.

    $INHERIT  CommentTermExistence     ReferenceExistence

ManifestConstExistence can be used to represent character strings, numbers etc. that are explicit in the source code.

    $INHERIT  ManifestConstExistence   DataUseExistence

EnumerationConst can be used to represent values that are defined as part of an enumerated type.

    $INHERIT  EnumerationConst         Definition

Field can be used to represent the components of a record type.

    $INHERIT  Field                    TypedDefinition

Two specific subclasses of TypeDef

    $INHERIT  RecordTypeDef            TypeDef
    $INHERIT  EnumeratedTypeDef        TypeDef

l) Auxilliary associations that, when links are present in SIEF files, will be interpreted in a consistent manner by all interested tools, include:

    declaredAsFormalArgsIn DatumDef               RoutineSource
    returnType             RoutineSource          TypeUseExistence
    foundInSource          CommentTermExistence   SourceUnit
    isEnumerationMemberOf  EnumerationConst       EnumeratedTypeDef
    isFieldMemberOf        Field                  RecordTypeDef
    isMemberOf             SourceFile             Subsystem

m) Auxilliary attributes that, when occurrences are present in SIEF files, will be interpreted in a consistent manner by all interested tools, include:

    SourceFile             { version dateChanged size}

- The version is an arbitrary string

- The dateChanged is in format yyyymmdd

- The size is in characters.

    RoutineSource          { isClassMethod }

- For a method, flags that it is a class method (static in C++/Java)

    RoutineSource          { visibility }

- Language-dependent string (e.g. "public").

    DatumDef               { isConst }

- Flags whether it is a constant

6. Syntax

The TA syntax should be used. This has a good balance of simplicity (few syntactic constructs) and compactness of the resulting data.

We could use RIGI, developed at the University of Victoria by Hausi A. M�ller's group, however, he attribute extensions of TA make TA seem a better choice.

Another alternative is XML, but it would be more wordy since there would be many more tags etc. in the resulting data. Using TA would result in a more specialized format than if we used XML, but it would probably be more efficient.

APPENDIX 1. Complete TA++ scheme

The following is the complete TA scheme we propose. This is the metadata. For those who don't know TA,

The $INHERIT lines describe the classes in the inheritance hierarchy
The following lines describe other relations that can exist between classes
The lines following SCHEME ATTRIBUTE give the attributes which are expected to be found for instances of each class.
The actual data

    SCHEME TUPLE :

    $INHERIT  SoftwareObject           $ENTITY
    $INHERIT  SourceUnit               SoftwareObject
    $INHERIT  Definition               SoftwareObject
    $INHERIT  ReferenceExistence       SoftwareObject
    $INHERIT  Subsystem                SoftwareObject
    $INHERIT  SourceFile               SourceUnit
    $INHERIT  SourceWithinFile         SourceUnit
    $INHERIT  RoutineSource            SourceWithinFile
    $INHERIT  ClassSource              SourceWithinFile
    $INHERIT  TypedDefinition          Definition
    $INHERIT  EnumerationConst         Definition
    $INHERIT  StandaloneDefinition     TypedDefinition
    $INHERIT  Field                    TypedDefinition
    $INHERIT  TypeDef                  StandaloneDefinition
    $INHERIT  DatumDef                 StandaloneDefinition
    $INHERIT  RecordTypeDef            TypeDef
    $INHERIT  EnumeratedTypeDef        TypeDef
    $INHERIT  CommentTermExistence     ReferenceExistence
    $INHERIT  FileInclusionExistence   ReferenceExistence 
    $INHERIT  DataUseExistence         ReferenceExistence
    $INHERIT  RoutineCallExistence     ReferenceExistence
    $INHERIT  ManifestConstExistence   DataUseExistence
    $INHERIT  TypeUseExistence         DataUseExistence 
    potentiallyIncludedIn  FileInclusionExistence SourceUnit
    definedBy              StandaloneDefinition   SourceUnit
    containingSource       SourceWithinFile       SourceUnit      
    declaredAsFormalArgsIn DatumDef               RoutineSource    
    returnType             RoutineSource          TypeUseExistence
    potentiallyCalledBy    RoutineCallExistence   RoutineSource   
    foundInSource          CommentTermExistence   SourceUnit
    usedInSource           DataUseExistence       SourceUnit
    ofType                 TypedDefinition        TypeUseExistence
    isEnumerationMemberOf  EnumerationConst       EnumeratedTypeDef 
    isFieldMemberOf        Field                  RecordTypeDef
    isMemberOf             SourceFile             Subsystem

    SCHEME ATTRIBUTE :
    
    SourceFile             { path version dateChanged size }
    SourceWithinFile       { startChar endChar }
    RoutineSource          { isClassMethod, visibility }
    StandaloneDefinition   { startChar endChar }
    DatumDef               { isConst }

APPENDIX 2: Example file

The following file, called ftfmctab.asm.ta, describes source code for a file in assembly language called ftfmctab.asm. It has been slightly fictionalized to protect confidentiality of information.

One thing to note: All names of software objects are composed of a unique object identifier followed by a '!' followed by the name of the object. This helps us keep track of the many objects which have the same name, and also makes processing more efficient.

FACT TUPLE:

$INSTANCE 0!ftfmctab.asm SourceFile
$INSTANCE 81!ftfmctab SpecialCodeExistence
foundInSource 81!ftfmctab 0!ftfmctab.asm
$INSTANCE 171!tabmac.inc FileInclusionExistence
potentiallyIncludedIn 171!tabmac.inc 0!ftfmctab.asm
$INSTANCE 186!list RoutineCallExistence
potentiallyCalledBy 186!list 0!ftfmctab.asm
$INSTANCE 193!ftfmctab SpecialCodeExistence
foundInSource 193!ftfmctab 0!ftfmctab.asm
$INSTANCE 261!ftf_tbl DataUseExistence
usedInSource 261!ftf_tbl 0!ftfmctab.asm
$INSTANCE 253!tabstrt RoutineCallExistence
potentiallyCalledBy 253!tabstrt 0!ftfmctab.asm
$INSTANCE 272!tabadd RoutineCallExistence
potentiallyCalledBy 272!tabadd 0!ftfmctab.asm
$INSTANCE 292!tstart RoutineCallExistence
potentiallyCalledBy 292!tstart 0!ftfmctab.asm
$INSTANCE 307!chartab RoutineCallExistence
potentiallyCalledBy 307!chartab 0!ftfmctab.asm
$INSTANCE 231!ftf_tbl DatumDef
definedBy 231!ftf_tbl 0!ftfmctab.asm
$INSTANCE 231!ASSEMBLER_TYPE TypeUseExistence
type 231!ftf_tbl 231!ASSEMBLER_TYPE
usedInSource 231!ASSEMBLER_TYPE 0!ftfmctab.asm
FACT ATTRIBUTE:

0!ftfmctab.asm { 
        dateChanged = 19970101
        path = /usr/bigsystem/version4/source/
        size = 329
        version = 1}
231!ftf_tbl { 
        endChar = 238
        isConst = 0
        startChar = 231}

Acknowledgements

This work was sponsored by the Consortium for Software Engineering Research (CSER) and Mitel Corporation , and supported by NSERC. Nicolas Anquetil and others in the KBRE group were major contributors to the design of TA++.