Supporting Software Maintenance by Mining Software Update Records

 

By

Jelber Sayyad Shirabad

 

 

 

 

 

 

 

 

 

 

 

 

 

Thesis submitted to the Faculty of Graduate and Post-Doctoral Studies in partial fulfillment of the requirements for the degree of

 

Doctor of Philosophy in Computer Science

 

Ottawa-Carleton Institute for Computer Science

School of Information Technology and Engineering

University of Ottawa,

Ottawa Ontario Canada

 

This thesis is also available as a pdf file.

 

Jelber Sayyad Shirabad, May 2003



 

 

 

To my father, mother and sister

for their endless love and support

 

 

 

 

 

 

 

 

 

 



 

 

 

I have yet to see any problem, however complicated, which, when you looked at it in the right way, did not become still more complicated.

Poul Anderson (1926-2001) Science Fiction Writer

 



Abstract

 

It is well known that maintenance is the most expensive stage of the software life cycle. Most large real world software systems consist of a very large number of source code files. Important knowledge about different aspects of a software system is embedded in a rich set of implicit relationships among these files. Those relationships are partly reflected in system documentation at its different levels, but more often than not are never made explicit and become part of the expertise of system maintainers. Finding existing relations between source code components is a difficult task, especially in the case of legacy systems.

When a maintenance programmer is looking at a piece of code in a source file, one of the important questions that he or she needs to answer is: which other files should I know about, i.e. what else might be relevant to this piece of code?. This is an example of a more general Relevance Relation that maps a set of entities in a software system into a relevance value.

How can we discover and render explicit these relationships without looking over the shoulder of a programmer involved in a maintenance task? We turn to inductive methods that are capable of extracting structural patterns or models from data. They can learn concepts or models from experience observed in the past to predict outcomes of future unseen cases.

This thesis lies at the intersection of two research fields, which has been widely ignored by researchers in the machine learning and software engineering communities. It investigates the application of inductive methods to daily software maintenance at the source code level. Therefore in this thesis we first lay out the general idea of relevance among different entities in a software system. Then using inductive learning methods and a variety of data sources used by maintenance programmers, we extract (i.e. learn) what we call a maintenance relevance relation among files in a large legacy system. In effect we learn from past maintenance experience in the form of problem reports and update records, to be able to make predictions that are useful in future maintenance activities. This relation, which is called the Co-update relation, predicts whether updating one source file may require a change in another file.

To learn the Co-update relation we have performed a large number of experiments using syntactic features such as function calls or variable definitions. We have also performed experiments that use text based features such as source code comments and problem reports, and the combination of these features. The results obtained show that while using syntactic features is encouraging in terms of the predictive power of the results of learning, using text based features yields highly accurate models, with precision and recall measures that make these models viable to be used in a real world setting. As part of the contribution of this thesis we also report on challenges encountered in the process and the lessons learned.


Acknowledgments

 

I would like to express my deepest gratitude and thanks to my supervisors Dr. Stan  Matwin and Dr. Timothy C. Lethbridge for imparting their knowledge and wisdom to me, and their support, care, trust and friendship. I thank Tim for his very generous financial support through the years, which relieved the financial burden associated with my studies to a great extent. I would also like to thank Stan for his support for the last few years which allowed me stay focused on my research at a time that I had to face fluctuations in funds due to external factors. He also generously provided me with a new and more powerful PC in the last year of my study, which speeded up running the experiments quite considerably and was instrumental in allowing me to include some of the more interesting results in this thesis.

Tim was the primary proofreader of my thesis and I would like to thank him for his thorough review of its many drafts. His feedback allowed me to present the thesis and our joint publications in a way that is understandable to an audience who may not necessary have a background in machine learning. I am also grateful to Stan for corrections and comments I received from him especially on machine learning related topics and many pointers he provided me which certainly helped me with improving the quality of my research and this thesis. I have learned a lot from both my supervisors.

I would like to express my appreciation and thanks to the members of my defense committee Dr. Eleni Stroulia the external examiner from the University of Alberta, Dr. Lionel Briand of Carleton University, and Doctors Nathalie Japkowicz and Liam Payton from the University of Ottawa, for their insightful comments and suggestion to improve the quality of this thesis. I am grateful to Dr. Robert Holte for his comments on my thesis proposal and our follow up discussions, which have greatly influenced my research.

Funding for this research was made available by the Consortium for Software Engineering Research (CSER), Mitel Networks, and The Natural Science and Engineering Foundation of Canada (NSERC). The University of Ottawa provided me with an admission scholarship and tuition fee waiver for the first two years of my studies. This was a great help for me and I am very thankful the university for it.

Many thanks to the members of SX 2000 team, the support staff, and the research management at Mitel Networks and in particular Steve Lyon, Steve Szeto, Soo Tung, Ian Duncan, Chris Halford, Peter Perry and Daisy Fung for the assistance they provided.

Special thanks to Dr. Sylvia Boyd who trusted me with teaching Prolog Concepts Laboratory for the first time, and Dr. Lou Birta for his continuous encouragement and friendship.

Over the years I have received excellent software and hardware support from Keith White, Michel Racine, Marc Fortier, Roger Montcalm, and Alan Stewart. Their assistance is greatly appreciated.

My life as a student would have been more challenging if it wasnt for many good people who over years of my study at the University of Ottawa consistently showed their friendship. In no particular order, I would like to thank my colleagues and friends Viviana Nastase, the Azizi family, Francisco Herrera, Gina Gonzalez, Rossana Andrade, Stephane Som, Alan Williams, Masahide Nakamura, Khalid Khidhir, Felipe Contreras, the Yeghikian family, the Zamora family, Nicolas Anquetil, Kathia Oliveira, Amy Yi, Chris Drummond, Artush Abolian, Jack Dadourian, Ana Shannette and Vahik Manookian for their help, caring and many pleasant memories over years. A special thank you goes to Vivi for allowing me to use her computer at the department during those last days before submitting my thesis for the defense. I also thank her for the excellent work she did as my teaching assistant, as I have come to realize the impact a knowledgeable and reliable TA has on the time that I have to spend on teaching a course. Many thanks to my uncle and aunt who are my family in Canada. Visiting them during Christmas and other occasions helped me to unwind and think less about my research. I have enjoyed many of casual conversation we had around the dinner table. At this juncture, let me also thank everyone else who in one way or the other has had a positive impact on my life and study, and who I have omitted in this acknowledgment.

I still remember my mothers expression when I first told her that I was going to Canada to continue my studies. I remember my father, mother, and sisters face as we said goodbye in the airport. I saw love then, and I have been blessed with their love and unwavering support during all these years that we have been apart. I feel most blessed for having them as my family, and from the bottom of my hart I thank them for their sacrifices and all they have done for me. The road to the PhD had many ups and downs for me. Now that I am at the end of this road, I feel very humbled by the whole experience, but I know my family is very proud of me. As a small token of my appreciation, I am dedicating this thesis to them.



Table of Contents

 

 

 

Abstract       ........................................................................................................................ i

Acknowledgments........................................................................................................... iii

Table of Contents........................................................................................................... vii

Index of Figures............................................................................................................... xi

Index of Tables.............................................................................................................. xiii

Chapter 1 Motivation........................................................................................................ 1

1.1 Legacy Software Systems and Software Maintenance.............................................. 1

1.2 Comprehending Legacy Systems............................................................................... 3

1.3 Knowledge Based Software Engineering vs. Application of Inductive Methods in Software Engineering.................................................................................................. 5

1.4 The Goal of the Thesis.............................................................................................. 6

1.5 Definitions and an Application................................................................................. 8

1.6 Contribution of this Thesis..................................................................................... 15

Chapter 2 Related Research.......................................................................................... 17

2.1 Introduction............................................................................................................. 17

2.2 AI and Software Maintenance................................................................................. 18

2.2.1 Knowledge Based Software Engineering........................................................... 19

2.2.2 Knowledge Based Systems Applied to Software Maintenance....................... 22

2.2.2.1 Code Generation (COGEN)...................................................................... 22

2.2.2.2 MACS....................................................................................................... 23

2.2.2.3 Problem Manager (PM)............................................................................. 24

2.2.2.4 Intelligent Program Editor (IPE)................................................................ 26

2.2.2.5 Intelligent Semantic Lisp-program Analyzer (ISLA)................................ 27

2.2.2.6 Maintainers Assistant............................................................................... 27

2.2.2.7 Program Understanding Using Plan Recognition Algorithms.................... 28

2.2.2.8 Model Oriented Reengineering Process for HCI (MORPH)..................... 30

2.2.2.9 Programmers Apprentice......................................................................... 31

2.2.2.10 Recognizer............................................................................................... 32

2.2.2.11 Program Analysis Tool (PAT)................................................................ 33

2.2.2.12 LASSIE.................................................................................................... 34

2.2.2.13 Knowledge-Based Assistant for Reverse Engineering (KARE).............. 35

2.2.2.14 Summary of Knowledge-Based Maintenance Support Systems............. 37

2.2.3 Inductive Approaches Applied to Software Maintenance............................... 38

2.2.4 Application of Inductive Learning in Software Maintenance........................... 39

2.2.4.1 Fault Density Prediction........................................................................... 39

2.2.4.2 Design Recovery........................................................................................ 40

2.2.4.3 Locating High-Risk Software modules...................................................... 40

2.2.4.4 Software Quality Prediction...................................................................... 42

2.2.4.5 Case-Base Reasoning and Inductive Logic Programming Applied to Software Reuse.......................................................................................... 43

2.2.4.6 Estimating Software Development Effort................................................. 44

2.2.5 Summary........................................................................................................... 45

Chapter 3 Methodology.................................................................................................. 47

3.1. Introduction............................................................................................................ 47

3.2. General Approach.................................................................................................. 49

3.2.1 Pre-Learning Stage............................................................................................ 51

3.2.1.1 Identifying Sources of Data and Knowledge............................................. 51

3.2.1.2 Acquire Data and Knowledge.................................................................... 53

3.2.1.2.1 Creating examples of Relevant and Not Relevant File Pairs............... 54

3.2.1.2.2 Finding Relevant and Not Relevant File Pairs.................................... 54

3.2.1.2.3 Heuristics Based on File Update Information.................................... 56

3.2.1.2.4 Features Used in Defining the Relevance Concept............................. 60

3.2.1.2.5 Features Extracted From the Source Code.......................................... 61

3.2.1.2.6 File Reference Attributes.................................................................... 63

3.2.1.2.7 Data Flow Attributes......................................................................... 65

3.2.1.2.8 Control Flow Attributes..................................................................... 66

3.2.1.2.9 File Name Related Attributes............................................................. 70

3.2.1.2.10 Text Based Attributes...................................................................... 72

3.2.1.3 Preprocessing............................................................................................. 73

3.2.2 Learning Stage................................................................................................... 73

3.2.2.1 Transformation and Encoding.................................................................... 73

3.2.2.2 Learning..................................................................................................... 74

3.2.2.3. Evaluating the Usefulness of the Learned Concept.................................. 75

3.2.2.4. Parameter Tuning...................................................................................... 79

3.2.3 Post Learning Stage........................................................................................... 79

3.3. Limitations of Our Research................................................................................... 80

Chapter 4 Experiments................................................................................................... 81

4.1 Background on the System Under Study................................................................ 81

4.2 Creating Training and Testing Data sets.................................................................. 82

4.3 Removing Class Noise............................................................................................. 82

4.4 Background on Experimental Setup......................................................................... 83

4.4.1 Experiments Using C5.0................................................................................... 84

4.4.2 Experiments With Set Covering Machines....................................................... 95

4.5 The Base Experiments........................................................................................... 101

4.5.1 Creating File Pair Data Sets for the Base Experiments................................... 102

4.5.2 Creating Training and Testing Data Sets for the Base Experiments............... 106

4.5.3 Repeating Relevant Examples......................................................................... 110

4.6 Analysis of the Decision Trees............................................................................. 113

4.7 Using the 1R Algorithm to Create Classifiers....................................................... 114

4.8 Removing Class Noise........................................................................................... 116

4.9 Discretizing Numeric Attributes........................................................................... 119

4.10 Using Text Based Features.................................................................................. 128

4.10.1 Source File Comment Words as Features..................................................... 132

4.10.2 Problem Report Words as Features.............................................................. 135

4.11 Combination of Features..................................................................................... 140

4.11.1 Combination of Syntactic and Source File Comment Word Features.......... 142

4.11.2 Combination of Syntactic and Problem Report Word Features................... 145

4.11.3 Combination of Source File Comment and Problem Report Features.......... 148

4.11.3.1 Juxtaposition of Source File Comment and Problem Report Features.. 148

4.11.3.2 Union of Source File Comment and Problem Report Features............. 151

4.12 Summary.............................................................................................................. 155

Chapter 5 Conclusion and Future Work.................................................................... 159

5.1 Summary and Concluding Remarks....................................................................... 159

5.2 Future Work.......................................................................................................... 165

Appendix A A Three‑Class Learning Problem 169

A.1 Heuristics Based on User Interactions................................................................. 169

A.2 Collecting the Result of Software Engineers Interactions..................................... 170

A.3 Cleaning the Log................................................................................................... 171

A.4 Dividing the Logs into Sessions............................................................................ 171

A.5 Extracting Pairs of Potentially Relevant Software Units...................................... 172

A.6 Example Label Conflict Resolution...................................................................... 173

A.7 Format of log files................................................................................................. 173

Appendix B An Alternative to the Hold Out Method................................................ 177

B.1 Basic Training and Testing File Pairs.................................................................... 178

B.2 Alternatives to Basic Training and Testing File Pairs........................................... 179

B.2.1 Alternative to the Training Set....................................................................... 179

B.3 Precision, Recall and F-Measure Comparisons.................................................... 181

B.4 A Hold Out Evaluation Approach........................................................................ 187

B.5 Summary............................................................................................................... 193

Appendix C Detailed Experiment Results................................................................. 195

Appendix D A Classifier Evaluation Questionnaire................................................ 207

D.1 A Sample Questionnaire....................................................................................... 208

D.2 Instructions Accompanying the Questionnaire.................................................... 211

Appendix E A Sample Decision Tree.......................................................................... 215

E.1 A Decision Tree Learned Form Problem Report Feature Set............................... 215

References  .................................................................................................................... 227

 


 


Index of Figures

 

 

 

Figure 1.1    Continues and Discrete Versions of a Relevance Relation R......................... 9

Figure 1.2    A System Relevance Graph and its Two System Relevance Subgraphs..... 10

Figure 3.1    Creating a Classifier from a Set of Examples................................................ 48

Figure 3.2    Classifying a New Case................................................................................ 49

Figure 3.3    The Process of Learning a Maintenance Relevance Relation....................... 50

Figure 3.4    The Process of Submitting an Update to Address a Problem Report.......... 53

Figure 3.5    File pairs, Features, and Examples............................................................... 55

Figure 3.6    Relation Between Relevant and Not Relevant Pairs.................................... 60

Figure 3.7    A Decision Tree........................................................................................... 75

Figure 3.8    A Confusion Matrix..................................................................................... 75

Figure 3.9    An ROC Curve............................................................................................. 79

Figure 4.1    ROC Comparison of Stratified Training Experiments................................. 94

Figure 4.2    Creating Training Sets with Different Class Ratios................................... 108

Figure 4.3    ROC Plots of Base Experiments 1 and 2................................................... 109

Figure 4.4    ROC Plots of Base Experiments 3 and 4................................................... 112

Figure 4.5    Comparing the Best ROC Plots when Relevant Pairs are and are not Repeated..................................................................................................... 112

Figure 4.6    ROC Plots of C5.0 and 1R for Base Experiment 3.................................... 115

Figure 4.7    ROC Plots of Class Noise Removal Experiments...................................... 116

Figure 4.8    Decision Tree Size Plots of Class Noise Removal Experiments................ 118

Figure 4.9    ROC Plots for Method 1........................................................................... 120

Figure 4.10  Decision Tree Size Plots of Method 1....................................................... 121

Figure 4.11  ROC Plots for Method 2........................................................................... 123

Figure 4.12  Decision Tree Size Plots of Method 2....................................................... 124

Figure 4.13  ROC Comparison of the Last Stages of Methods 1 and 2......................... 126

Figure 4.14  Decision Tree Size Plots of Methods 1 and 2............................................ 126

Figure 4.15  Creation of Bag of Words Vectors............................................................. 129

Figure 4.16  ROC Comparison of the File Comment and Syntactic Attribute Based Classifiers................................................................................................... 133

Figure 4.17  Creating Source File-Bag of Words Using Problem Reports...................... 135

Figure 4.18  ROC Comparison of the Problem Report and Syntactic Attribute Based Classifiers................................................................................................... 138

Figure 4.19  ROC Comparison of the File Comment Classifiers with and without Syntactic Attributes................................................................................... 143

Figure 4.20  ROC Comparison of the Problem Report Classifiers with and without Syntactic Attributes................................................................................... 146

Figure 4.21  ROC Comparison of the Problem Report Feature Classifiers with Juxtaposed Combination of Used Problem Report and Source File Comment Features Classifiers.................................................................................... 149

Figure 4.22  ROC Comparison of the Problem Report Feature Classifiers with the Union of Used Problem Report and Source File Comment Feature Classifiers.... 152

Figure 4.23  ROC Comparison of the Problem Report Feature Classifiers with the Union of Used Problem Report and Source File Comment Feature Classifiers.... 154

Figure B.1   Effect of Different Test Repositories on the Precision of the Relevant Class Using 1997‑1998 G20 Training Sets 181

Figure B.2   Effect of Different Test Repositories on the Recall of the Relevant Class Using 1997-1998 G20 Training Sets........................................................... 182

Figure B.3   Effect of Different Test Repositories on the F1 Measure of the Relevant Class Using 1997‑1998 G20 Training Sets 182

Figure B.4   Effect of Different Repositories on the Precision of the Relevant Class Using 1997-1998 NGSL Training Sets....................................................... 184

Figure B.5   Effect of Different Test Repositories on the Recall of the Relevant Class Using 1997-1998 NGSL Training Sets....................................................... 184

Figure B.6   Effect of Different Test Repositories on the F1 Measure the Relevant Class Using 1997‑1998 NGSL Training Sets 185

Figure B.7   Comparison of the Best Precision Results in Figures B.1 and B.4............ 186

Figure B.8   Comparison of the best Recall Results in Figures B.2 and B.5.................. 186

Figure B.9   Comparison of the Best F1 Measure Results in Figures B.3 and B.6........ 187

Figure B.10 Comparison of the Best Precision Results in Figure B.7 and 2/3 - 1/3 Split.................................................................................................................... 189

Figure B.11 Comparison of the Best Recall Results in Figure B.8 and 2/3 - 1/3 Split.. 189

Figure B.12 Comparison of the Best F1 Measure Results in Figure B.9 and 2/3 - 1/3 Split.................................................................................................................... 190

Figure B.13 Comparison of the Best Precision Results in Figure B.7 and 2/3 - 1/3 Repeated Relevant Split............................................................................. 191

Figure B.14 Comparison of the Best Recall Results in Figure B.8 and 2/3 - 1/3 Repeated Relevant Split............................................................................................. 192

Figure B.15 Comparison of the Best F1 Measure Results in Figure B.9 and 2/3 - 1/3 Repeated Relevant split............................................................................. 192

Figure E.1   Decision Tree Generated by Experiment 16 for Imbalance Ratio 50......... 215

 


Index of Tables

 

 

 

Table 2.1     Summary of KBSE Systems Discussed....................................................... 37

Table 3.1     Distribution of Files by Their Type............................................................ 58

Table 3.2     Summary of Syntactic Attributes................................................................ 72

Table 4.1     Distribution of SX 2000 Source Code Among Different Files..................... 82

Table 4.2     Training and Testing Class Ratios for Skewed Experiments........................ 85

Table 4.3     Result of Removing Class Noise from Skewed Training Sets...................... 86

Table 4.4     Learning from Less Skewed Training Sets.................................................... 88

Table 4.5     The Effect of Removing Class Noise from Training Sets............................. 88

Table 4.6     Removing Training Relevant Examples from Testing Not Relevant Examples (Noisy Data)................................................................................................ 89

Table 4.7     Removing Training Relevant Examples from Testing Not Relevant Examples (Noise Free Data)......................................................................................... 90

Table 4.8     Training and Testing Class Ratios for Single Combination Experiments..... 90

Table 4.9     Single Combination Versus All Permutation................................................ 91

Table 4.10   Using Relevant Based Approach to Create the Training Not Relevant Examples...................................................................................................... 91

Table 4.11   The Effect of Using Relevant Based Training Files..................................... 92

Table 4.12   Pascal Only Repositories............................................................................. 93

Table 4.13   Results of Experimenting With Pascal Only File Pairs................................ 93

Table 4.14   Training and Testing Repositories Used for Stratified Training Experiments...................................................................................................................... 94

Table 4.15   Conjunction of Balls for Loss Ratio 1 and Distance Metrics L1 and L2..... 98

Table 4.16   Disjunction of Balls for Loss Ratio 1 and Distance Metrics L1 and L2...... 99

Table 4.17   Conjunction of Balls for Loss Ratio 10, Distance Metrics L1, and FSF 1 and 2.................................................................................................................... 99

Table 4.18   Conjunction of Balls for Loss Ratio 10, Distance Metrics L2, and FSF 1 and 2.................................................................................................................. 100

Table 4.19   Disjunction of Balls for Loss ratio 10, Distance Metrics L2, and FSF 1 and 2.................................................................................................................... 100

Table 4.20   Disjunction of Balls for Loss Ratio 10, Distance Metrics L2, and FSF 1 and 2.................................................................................................................. 101

Table 4.21   Data to which File Pair Labeling Heuristics were Applied........................ 102

Table 4.22   Distribution of Group Sizes Created by Co-Update Heuristic................. 104

Table 4.23   Training and Testing Repositories Used in Base Experiments 1 and 2...... 105

Table 4.24   Attributes Used in the Base Experiments.................................................. 107

Table 4.25   Training and Testing Repositories Used in Base Experiments 3 and 4...... 111

Table 4.26   Top Nodes in the Decision Trees of Base Experiment 3........................... 114

Table 4.27   The Normalized Distance of ROC Points from the Perfect Classifier for Class Noise Removal Experiments............................................................. 117

Table 4.28   Decision Tree Size of Class Noise Removal Experiments......................... 118

Table 4.29   The Normalized Distance of ROC Points from the Perfect Classifier for Method 1.................................................................................................... 121

Table 4.30   The Normalized Distance of ROC Points from the Perfect Classifier for Method 2.................................................................................................... 124

Table 4.31   The Normalized Distance of ROC Points from the Perfect Classifier for Methods 1 and 2........................................................................................ 127

Table 4.32   Decision Tree Size Comparison of Methods 1 and 2................................ 127

Table 4.33   The Normalized Distance of ROC Points from the Perfect Classifier for Plots in Figure 4.16.................................................................................... 134

Table 4.34   Decision Tree Size Comparison for Plots in Figure 4.16........................... 134

Table 4.35   The Normalized Distance of ROC Points from the Perfect Classifier for Plots in Figure 4.18.................................................................................... 138

Table 4.36   Decision Tree Size Comparison for Plots in Figure 4.18........................... 140

Table 4.37   The Normalized Distance of ROC Points from the Perfect Classifier for Plots in Figure 4.19.................................................................................... 144

Table 4.38   Decision Tree Size Comparison for Plots in Figure 4.19........................... 144

Table 4.39   The Normalized Distance of ROC Points from the Perfect Classifier for Plots in Figure 4.20.................................................................................... 147

Table 4.40   Decision Tree Size Comparison for Plots in Figure 4.20 and Syntactic Attribute Based Decision Trees................................................................. 147

Table 4.41   The Normalized Distance of ROC Points from the Perfect Classifier for Plots in Figure 4.21.................................................................................... 150

Table 4.42   Decision Tree Size Comparison for Plots in Figure 4.21........................... 150

Table 4.43   The Normalized Distance of ROC Points from the Perfect Classifier for Plots in Figure 4.22.................................................................................... 152

Table 4.44   Decision Tree Size Comparison for Plots in Figure 4.22........................... 153

Table 4.45   The Normalized Distance of ROC Points from the Perfect Classifier for Plots in Figure 4.23.................................................................................... 154

Table 4.46   Decision Tree Size Comparison for Plots in Figure 4.23........................... 155

Table 4.47   Experiments Summary............................................................................... 157

Table 4.47   Experiments Summary (Continued )...................................................... 158

Table A.1     Format of the Log Lines............................................................................. 174

Table A.2     Log Commands and their Meaning............................................................. 174

Table A.2     Log Commands and their Meaning (Continued )................................... 175

Table B.1     Alternatives to Basic Training and Testing Pairs....................................... 180

Table B.2     Alternatives to Basic Training and Testing Pairs (Extended Version)....... 188

Table B.3     Training and Testing Sets Using the Hold Out Method with Repeated Relevant Examples..................................................................................... 191

Table C.1    Detailed Data for Base Experiment 1 (Table 4.24 First Half).................... 196

Table C.2    Detailed Data for Base Experiment 2 (Table 4.24 Second Half)................ 196

Table C.3    Detailed Data for Base Experiment 3 (Table 4.26 First Half).................... 197

Table C.4    Detailed Data for Base Experiment 4 (Table 4.26 Second Half)................ 197

Table C.5    Detailed Data for Experiment 5.................................................................. 198

Table C.6    Detailed Data for Experiment 6 (Single Copy Noise Removal)................. 198

Table C.7    Detailed Data for Experiment 6 (Multi-Copy Noise Removal)................. 199

Table C.8    Detailed Data for Experiment 8.................................................................. 199

Table C.9    Detailed Data for Experiment 9.................................................................. 200

Table C.10  Detailed Data for Experiment 11................................................................ 200

Table C.11  Detailed Data for Experiment 12................................................................ 201

Table C.12  Detailed Data for Experiment 13................................................................ 201

Table C.13  Detailed Data for Experiment 14................................................................ 202

Table C.14  Detailed Data for Experiment 15................................................................ 202

Table C.15  Detailed Data for Experiment 16................................................................ 203

Table C.16  Detailed Data for Experiment 17................................................................ 203

Table C.17  Detailed Data for Experiment 18................................................................ 204

Table C.18  Detailed Data for Experiment 19................................................................ 204

Table C.19  Detailed Data for Experiment 20................................................................ 205

Table C.20  Detailed Data for Experiment 21................................................................ 205

 


 

 


Chapter 1

 

Motivation

 

 

 

This thesis is devoted to the development and testing of an approach to extract models of a software system that are geared towards assisting software maintainers. The objective here is to learn relations from the past maintenance experience that could be useful in future source code level maintenance activities.    

1.1 Legacy Software Systems and Software Maintenance

Legacy software systems are a fact of life in the software engineering community. Many important software systems controlling essential aspects of modern society were developed many years ago. Legacy systems are old systems that still need to be maintained [Sommerville 2001 pp. 582]. It was estimated that in 1990 there were 120 billion lines of code in existence, and at the time the majority of those systems were already considered to be legacy systems [Ulrich 1990][1].

Most legacy systems exhibit the following properties:

      They are old

      They are large

      They are essential for the operation of the organization which uses them

      The existing documentation about them is often incomplete or out of date

      Many of them were designed before the time of structured programming [Choi and Scacchi 1990]

      They have been developed and modified by several people over many years

      There is no single person who possesses a complete understanding of the system

Although systems written in the 1960s and running on mainframe systems are classical cases of legacy systems, in practice much newer systems developed in the 1980s and 1990s, using structured design and other modern techniques are also considered to be legacy at the beginning of the twenty-first century.

According to ANSI/IEEE standard 729-1983, maintenance is the modification of a software product after delivery to correct faults, to improve performance or other attributes, or to adapt the product to a changed environment [Chikofsky and Cross 1990 p. 14].

It is known that software maintenance is a time consuming and costly part of the software life cycle. Whether newer software development technologies, such as object orientation, have been effective in practice to alleviate the maintenance burden, is perhaps debatable, but the above statement has been, and is, true for legacy software systems.

While design decisions and the rationale for them are sometimes documented, more often modifications to the source code are the main or the only indication of design changes [Rugaber et al 1990 p. 46]. The lack of up to date requirement, design and code documents has played a major role in making maintenance a difficult, expensive and error prone task.

The cost of maintenance varies widely depending on the application domain. It can range from broadly comparable with system development cost, to up to four times as high as the cost of development for real time embedded systems[2].[Sommerville 2001 p. 607]. There have also been reports of extreme cases in which maintenance cost has been as high as 130 times the development cost [Boehm 1975]. Other researchers have suggested that on average 67% of the total cost of the product can be attributed to maintenance [Lientz, Swanson, and Tompkins 1978][Schach 2002 p. 494]. While [Tracz 1988] put this number close to 70% [Tracz 1988], there are others who suggest an 80-20 rule, which says 20% of the development work is performed during original application development, and the other 80% is performed during maintenance [Conger 1994 p. 740], and the indications are that the cost is likely to increase [McCartney 1991 p. xvii]. [Shari 2001 p. 479] suggests that this increase in the cost of maintenance in the first decade of the new century to be 80% of a typical systems lifetime cost.

Software maintenance has been referred to as a severe problem [McCartney 1991 p. xviii], or even as being the most difficult of all aspects of software production [Devanbu and Ballard 1991 p. 26] [Schach 2002 p. 494].

Yet despite developing a general understanding of the importance of maintenance in the software life cycle, and an abundance of very large legacy software, academia has not paid as much attention to the topic as it deserves. The papers related to software maintenance are less than 1% of the papers annually published on software topics [Kwon et al 1998 p.1].

1.2 Comprehending Legacy Systems

Four kinds of maintenance activities are:

      Corrective which is concerned with fixing reported errors

      Adaptive which means changing software to accommodate changes in the environment such as new hardware

      Perfective which involves implementing new requirements (enhancement) [Sommerville 2001 p. 605]

      Preventive, also known as software reengineering, which makes changes to computer programs so that they can be more easily corrected, adapted, and enhanced [Pressman 2001 p. 23].

About 20 percent of maintenance work is dedicated to fixing mistakes. The remaining 80 percent is dedicated to adapting the system to a new external environment, implementing enhancement requests from users, and reengineering an application for future use [Pressman 2001 p. 805].

Regardless of the type of maintenance, to maintain a software system one needs to understand it. Better understanding of the software system will assist the maintenance programmer in performing his or her task.

One of most important research areas related to software maintenance is Reverse Engineering. Reverse Engineering is the process of analyzing a system to:

      Identify its components and the relations among them;

      Create an alternative representation for the system at the same or higher level of abstraction

Two widely discussed areas of reverse engineering are redocumentation and design recovery.

Redocumentation involves providing alternative views of a system at relatively the same abstraction level. For example data flow, or control flow diagrams of a system.

Design recovery uses domain knowledge, external information, and deductive and fuzzy reasoning to identify higher level abstractions than the ones that can be obtained by examining the system directly [Chikofsky and Cross 1990 p.15].

In general, most tools developed to assist programmers with comprehending software start from the source code and try to present it in a more abstract level that is easier for a human to understand.

Traditional reverse engineering tools supporting redocumentation mostly rely on static source code analysis methods [Palthepu et al 1997]. They provide graphical views of the interconnections between components of the software system, sometime at different levels of abstraction, but tend to leave the answer to the question of the relation of components to each other in the context of software maintenance to the software engineer. Examples of systems that provide such a capability are Rigi, Imagix4D, and SNiFF++ [Bellay and Gall 1996]. The software engineer should make this decision by browsing through provided information. This becomes more problematic, when the relation is non-trivial and is influenced by a combination of other relations among components.

According to Biggerstaff, Design recovery recreates design abstractions from a combination of code, existing design documentation (if available), personal experience, and general knowledge about problem and application domain. In short, design recovery must reproduce all of the information required for a person to fully understand what a program does, how it does it, why it does it, and so forth. Thus, design recovery deals with a far wider range of information than found in conventional software engineering representations or code [Biggerstaff 1989 p. 36].

In relatively recent years there has been a revival of emphasis on the importance of knowledge and its incorporation in reverse engineering tools [Clayton et al 1998 p. 76][Bellay and Gall 1998][Kontogiannis 1995 p. 95]. Such systems in effect will fall into the intersection of Reverse Engineering and Knowledge Based Software Engineering (KBSE).

1.3 Knowledge Based Software Engineering vs. Application of Inductive Methods in Software Engineering

The application of artificial intelligence technology to software engineering is known as Knowledge Based Software Engineering (KBSE) [Lowry and Duran 1989 p.243]. While this definition is fairly broad, most KBSE systems explicitly encode the knowledge that they employ [McCartney 1991 p. xix]. KBSE systems designed for assisting software engineers in low-level everyday maintenance tasks have the potential of representing and deducing the relations among components of a software system at the expense of requiring a fairly extensive body of knowledge and employing, sometimes computationally demanding, deductions and other algorithms. In other words most such systems are fairly knowledge rich. As we will discuss in Chapter 2, KBSE systems tend to employ expert systems or knowledge bases as their underlying technology.

While there has been a fair body of work that has applied deductive methods to different aspects of software engineering, the application of inductive methods (also known as Machine Learning) in software engineering has received far less attention. In Chapter 2 we will review some of the research conducted in this area.

It has been argued that learning systems have the potential of going beyond performance of an expert so as to find new relations among concepts by examining examples of successfully solved cases. In effect this could allow the incorporation of knowledge into a system without the need for a knowledge engineer. In other words, using inductive methods to extract the knowledge that helps a software engineer in understanding a software system, is an alternative to more traditional KBSE techniques. We should point out that this does not mean that one cannot incorporate expert knowledge in the process. On the contrary it is believed that such a contribution can increase the potential gain obtained by using inductive methods [Weiss and Kulikowski 1991 p. 3]. However, as it will become clearer in Chapters 2 and 3, unlike KBSE systems, expert knowledge is not coded in the form of a large knowledge base.

1.4 The Goal of the Thesis

Software engineers who maintain legacy software systems usually work very close to the source code. This often implies that they have to know about the components of the system involved in a problem e.g. files, routines, and the interconnection of these components with other components of the system. As will be discussed in Chapter 2, KBSE research, in general, has been concerned with topics that may not be directly addressing software maintenance activities, or at least not at this level of granularity. Although we will present examples that provide assistance at the source code level, most of them are prototypes that are not applied to real world software systems. This general trend of lack of focus on software maintenance is partly due to the ideal of KBSE that, if fully implemented, will remove the source level maintenance activities by automating the process of generating the software from its specification. While this would be of great value for systems to be created in the future, it does not apply to the large body of currently existing legacy systems.

On the other hand, inductive methods, which have been successfully used in different domains, have not been extensively applied to software maintenance. Software maintenance requires knowledge and experience. Maintenance assistant tools can benefit from such knowledge. Inductive methods are capable of extracting structural patterns or models from data. They can learn concepts from experience observed in the past to predict outcomes of future unseen cases. These methods do not require a knowledge acquisition step necessary to build a knowledge based system. Consequently, there seems to be a great opportunity in using these methods to aid software engineers in software maintenance

Broadly speaking, the aim of this thesis is to investigate the application of inductive methods, which are the focus of the field of Machine Learning [Mitchell 1997], to source code level daily software maintenance. This is the intersection of two research fields, which has been widely ignored by researchers in these communities. Considering the importance of maintenance, and the maturity of inductive learning methods, we believe the time has come for the software engineering and machine learning researchers to join forces in assisting software engineers in maintenance of software systems, especially legacy software at the source code level. We also believe research in this area can benefit machine learning community by providing a rich and nontrivial application area that challenges the community to improve their existing techniques, algorithms, and the overall current state of the technology.

Using inductive learning methods and a variety of data sources used by maintenance programmers, we extract, or learn, what we call a maintenance relevance relation among entities in the system under consideration. This can be seen as a design recovery process. The hope is that the extracted relation will inform and assist the maintenance programmer to understand some of existing complex relations or interactions within the target system. These relations may reveal the intrinsic interconnections between different component of the subject system which:

      may or may not be documented;

      are the consequent of the original design and implementation decisions, or side effects of changes that the system has gone through over its life time

Therefore in this thesis we lay out the general idea of relevance among entities in a software system, and then proceed to apply inductive learning methods to learn a specific relation which is geared towards the daily maintenance task. As part of the contribution of this thesis we report on challenges encountered in the process, the lessons learned, and the results that are achieved, whether positive or negative. Since research in the intersection of machine learning and software engineering is very rare, we strongly believe in the value of reporting negative results as it may help researchers in adjusting their expectations and avoiding pitfalls.

1.5 Definitions and an Application

In this section we provide the definitions of Relevance Relation and other concepts closely related to it. We will also describe a specific application that aims to extract a useful relevance relation in the context of software maintenance.

Definition: A Software Entity is any semantically or syntactically valid construct in a software system that can be seen as a unit with a well defined purpose[3].

Examples of software entities include documents, source files, routines, modules, variables etc.

Definition: A Predictor is a relation that maps one or more software entities to a value reflecting a prediction made about these entities.

Definition: A Relevance Relation is a predictor that maps two or more software entities to a value r quantifying how relevant i.e. connected or related, the entities are to each other. In other words r shows the strength of relevance among the entities. Therefore, relevant here is embedded in, and dependent on, the definition of the relation.

In its most general form, a relevance relation maps the entities into a real number between 0 and 1 showing the strength of the relevance among them. In this setting 0 stands for the lack of relevance, and 1 shows 100% relevance. We call such a relation a Continuous Relevance Relation. However, a real number may not necessarily always be the best choice, in which case the strength may be one of k discrete values. In the later case we call R a Discrete Relevance Relation. For instance if k is two, we could have a Boolean relevance relation R that maps entities e1,, en (n2) to true if, based on the definition of R, the entities e1,, en are Relevant to each other. If R maps e1,, en  to a false value, this means that, based on the definition of R, e1,, en are Not Relevant to each other.

Figure 1.1 shows the two varieties of relevance relations.

Figure 1.1 Continues and Discrete Versions of a Relevance Relation R

Definition: A System Relevance Graph for R or SRGR is a hypergraph where the set of vertices is a subset of the set of software entities in the system. Each hyperedge in the graph connects vertices that are relevant to each other according to a relevance relation R[4]. A System Relevance Subgraph SRGR,r is a subgraph of SRGR where software entities on each hyperedge are mapped to a relevance value r by relevance relation R.

Figure 1.2-a shows a system relevance graph for a system with five entities e1,, e5 and a relevance relation R that maps every two entities to one of two relevance values e.g. a Boolean relevance relation. Figures 1.2-b and 1.2-c show the subgraphs formed for each of these relevance values. Assume the solid lines correspond to a mapping to a true value and the broken lines correspond to a mapping to false value. In this case, the above graphs show that based on the definition of relation R the following pairs of entities are relevant to each other,

(e1,e3), (e1,e5), (e3,e5), (e4,e5)

and the following pairs are not relevant to each other,

(e1,e2), (e1,e4), (e2,e3), (e2,e4), ,(e2,e5), (e3,e4)

Figure 1.2 A System Relevance Graph and its Two System Relevance Subgraphs

If SRGR has n vertices, and relevance relation R maps m entities to a relevance value, the maximum number of system relevance subgraphs is:

number of hyperedges with m vertices in a graph with n vertices

Definition: A System Static Relevance Graph or SSRGR is an SRG in which R is defined in terms of static relations[5] in the source code e.g. calls or called‑by relations for routines.

The number of nodes in an SRG in a given release of a software system is constant. Therefore the topology of the SRG only depends on the definition of the relevance relation among the nodes. This means that a software system can be depicted by different SRGs. Two SRGs of the same system may share one or more connected sub-graphs suggesting the possibility of the existence of a meta‑relation between two relevance relations. For example an SRG based on the Is-In-Subsystem Relevance Relation and an SRG based on the Documents-Entity Relevance Relation may share a common sub‑graph suggesting that there is documentation for a subsystem.

We provided the definition of redocumentation and design recovery in section 1.2. Note that SRGs can be used in both cases during a reverse engineering effort. The level of abstraction represented by an SRG depends upon the definition of the relevance relation it is based on. The notion of relevance relation as defined in this thesis is flexible, in that the definition of the relation can capture simple or very complex interactions between the entities mapped by the relation. The definition can be based on a combination of the existing relations among the entities themselves, or between the entities and other entities in the system. For instance a relevance relation between files can be defined in terms of the file inclusion relation among files, and a variable reference relation among files and variables to which they refer.

Most organizations that produce software have access to tools that allow programmers to browse or query low level static source code relations, such as what are the routines called or what are the variables referred to in a file. TkSee [Lethbridge and Anquetil 1997][Lethbridge 2000], a tool in use at Mitel Networks[6] to browse and explore software systems, is an example of this. By using such tools and studying source code it is possible to reverse engineer and associate design decisions with the source code, however some SSRGs can speed up this process considerably. TkSee uses a database that stores information generated by parsing the source files in the system. For instance, for each routine r in the software system, this database keeps a list of routines called by r[7]. Using this information one can create an SSRGcalls(ri,rj)Boolean in which calls is a relevance relation that maps two routine entities ri and rj to true if ri calls ri in the source code, and false otherwise. Note that SSRGcalls(ri,rj)Boolean is an example of an SRG that can be used in redocumenting a system.

However, SRGs that are based on relevance relations defined in terms of a single source code level relation form the most basic and least complex subset of all possible SRGs. Although these SRGs can be extremely useful for a developer in the maintenance task, they fail to capture non trivial relations that exists among entities in a software system.

In this thesis we investigate the existence of other relevance relations that may assist a maintenance programmer in his or her job. To be more specific, within the context of maintenance activity, we are trying to seek the answers to the following questions:

      Using a combination of static source code relations and/or other sources of information, can we find other useful and non trivial relevance relation(s) in a software system?

A Static Relevance Relation R is defined in terms of static source code relations.

Within the context of the system under study, and among different static relations in the source code that are candidates to be used in the definition of R, we would like to investigate the existence of an importance order among these relations. In other words, we would like to be able to answer the following question:

What are the more important static relations which a maintenance programmer should be aware of [8]?

We believe finding a Maintenance Relevance Relation (MRR) could provide an answer to these questions. Such a relation should somehow incorporate information about the changes applied to the system as part of the maintenance process. Consequently, it has the potential to reflect the interconnections among the entities of a system, which are influenced by the changes applied to the system. Such interconnections may or may not conform to the original or perceived design of the system. Instead they could show the existence and interplay of the actual interconnections among the entities, which cause changes in one entity to influence another ones

The definition of software entity given above is very broad in nature. It ranges from a single variable, to the whole system. In the context of software maintenance activity, there is a much smaller set of software entities that seem to be interesting. Examples of such software entities are files, routines, variables, modules, subsystems and documents.

When a maintenance programmer is looking at a piece of code, such as a file or a routine, one of the important questions that he needs to answer is:

which other files should I know about, i.e. what else might be relevant to this piece of code ?.

While this is a question faced in daily source level maintenance tasks, the ability to answer it is essential in understanding and maintaining any software system regardless of the type of maintenance activity.

In this thesis we will be focusing on a special kind of SRG in which the nodes are files[9]. We would like to learn a special class of maintenance relevance relations called the Co‑update relation:

            Co-update(fi, fj) {Relevant, Not-Relevant} where,

i j and fI and fj are any two files in the system.

Co-update(fi, fj) Relevant means that a change in fi may result in a change in fj, and vice versa[10].

A relevance relation such as Co-update could be used to assist a maintenance programmer in answering the above question about files.

To learn the Co-update MRR we propose the application of inductive learning methods to the data extracted from[11] the following sources of data corresponding to a real word legacy system[12]:

      Static source code analysis

      Historical maintenance records

In other words we mine[13] these sources of data to extract the Co-update relation. The idea here is that such a relation learned from the examples of the past maintenance experience could be used to assist the software engineers, specially the newcomers to the team, in future similar maintenance activities.

In theory there is no need to restrict the number of entities to be two. Our decision to learn this special case of relevance relation is based on the following observations:

      A two-entity relation provides the most basic kind of relevance relation in terms of number of entities. This also implies that it can potentially capture more general interconnections among entities of the system. A relation with three or more entities is more restricted than the basic two‑entity relation; since, by its nature, it is a more constrained and specialized form of relation.

      On a more pragmatic note, as one would expect there are usually fewer examples of individual changes involving larger numbers of entities. As will be discussed in Chapter 4, this is also the case for the particular software system used in this study.

It could also be argued that a mapping to a value between 0 and 1 is a better choice. Our more restricted definition is partly due to the learning method that we have experimented with, namely decision tree learning, and partly due to the fact that in a large software system SRG can be very complex. It is not clear to us whether providing a software engineer with a large set of objects with different relevance rankings will be more beneficial than providing him or her with a smaller set that is believed to contain the relevant objects. Of course a real valued variable can be easily discretized to a Boolean variable, however this will introduce the issue of finding the proper threshold value[14].

One of the challenges that we have faced in our research is labeling the examples from which the above MRR is learned. As will be discussed in Chapter 3, this thesis also proposes heuristics to automatically label the examples, without direct involvement of expert software engineers.

The remainder of this thesis is structured as follows:

In Chapter 2 we present a subset of research conducted in the area of software maintenance that is relevant to the subject of this thesis. We will cover research in KBSE and the application of inductive methods to software maintenance. Chapter 3 will provide the details of the method used to learn a Maintenance Relevance Relation. In Chapter 4, we present the experiments we have performed and results we have obtained. Chapter 5 will discuss our conclusions and plans for future work.

1.6 Contribution of this Thesis

Below we list the contributions of this thesis, i.e. the specific technical issues that to the best of our knowledge were not addressed before, and whose solutions presented in this work add to the state of the art in either machine learning or software maintenance methods:

      Proposing a representation of the maintenance task conducive to the use of machine learning.

      Engineering attributes such that on the one hand they can represent properties of software modules and on the other hand, values for these attributes can be acquired from the source code and software maintenance records.

      Producing a large repository of examples[15] from real-life data appropriate for the use of machine learning tools and techniques.

      Experimenting with a number of machine learning methods including decision tree learning, one rule learner, and set covering machines to learn the Co-update maintenance relevance relation and running empirical comparisons of these methods.

      Setting up an experimental protocol for the above including performing more than 570 training and testing runs using the syntactic attribute set, collecting the results, representing them in the form of ROC curves, tables etc.

      Analyzing the results and drawing conclusions about the relative performance of different training/testing split methods, learning methods, and example refinement techniques such as class noise removal etc.

      Introducing the textual attributes which exploit the semantic information implicit in the source file comments and problem reports

      Performing more than 140 experiments with textual attributes and their combinations together and with syntactic attributes. Collecting results and representing them in the form of variety of tables and plots including ROC curves

      Analyzing the results and drawing conclusions about on the relative performance of source file comment, problem report, and combination feature sets

      Preparing and submitting (at the time of this writing) several research publications. Publishing of [Sayyad Shirabad et al 2000, 2001]


Chapter 2

 

Related Research

 

 

 

2.1 Introduction

It is a known fact that software maintenance is a time consuming and costly part of software life cycle.

Software maintenance encompasses a wide range of topics including reverse engineering, re-engineering, program transformation, program comprehension, impact analysis, regression testing, reuse, software configuration management, web‑based software maintenance, maintenance process models, and maintenance standards [Kwon et al 1998 p.3].

Considering the wide range of topics falling under the umbrella of software maintenance, no single method or paradigm is expected to provide an answer all of its problems. Our focus is assisting maintenance programmers with the following problem:

When a maintenance programmer is looking at a piece of code, such as a source file, one of the important questions he needs to answer is,

Which other files  should I know about, i.e. what else might be relevant to this piece of code ?.

As discussed in Chapter 1, to provide an answer to this question we will learn a special kind of relevance relation among files in the system called the Co-update Maintenance Relevance Relation. We do this by applying machine learning techniques to data sets created by extracting information from the source code and program update history

The more traditional approach to the application of AI to software engineering has focused on the usage of expert systems and more generally, knowledge based techniques. However, the lack of direct attention to support maintenance programmers can also be seen in KBSE community research. In the relatively recent ICSE 16 workshop on Research Issues in the Intersection between Software Engineering and Artificial Intelligence, the discussions were about Knowledge representation and acquisition, reuse and reverse engineering, validation and verification, and software process [Kontogiannis and Selfridge 1995 p. 94]. However maintenance was identified as being one of the areas of high cost leverage [Kontogiannis and Selfridge 1995 p.96].

In the following section we will review the research projects that employ AI technology, and which we believe are directly applicable to the daily task of software maintenance. In general, these projects fall in one or more of reverse engineering, re-engineering, program transformation, and program comprehension research areas.

2.2 AI and Software Maintenance

Artificial intelligence is one of the oldest research areas in computer science. The Dartmouth Summer Research Project on Artificial Intelligence, organized in 1956 by John McCarthy, has been considered by many the birthplace of AI [Russell and Norvig 1995]. To that extent, AI covers a large body of methods and algorithms. Software engineering has been an application target for AI since the early 1980s, when research in Knowledge Based Software Engineering (KBSE) started[16]. In general, the methods used in traditional KBSE rely heavily on coded knowledge and sometimes deductive methods. On the other hand, inductive classification methods that are mostly considered to be part of Machine Learning research, attempt to extract models of the subject of interest from a set of labeled examples, without encoding a large amount of knowledge.

In following two sections, we will survey some of previous research that could be considered relevant to this thesis topic. Section 2.2.1 focuses on KBSE, while section 2.2.2 presents research that employs inductive methods.

2.2.1 Knowledge Based Software Engineering

In 1983, the Rome Air Development Center (now Rome Laboratory) published a report calling for the development of a Knowledge-Based Software Engineering Assistant (KBSA) which would employ artificial intelligence techniques to support all phases of the software development process [Green et al 1983]. This initiative eventually evolved into KBSA conferences, which later were renamed as Knowledge-Based Software Engineering (KBSE) conferences, and as of 1996 as Automated Software Engineering Conferences.

Knowledge Based Software Engineering, KBSE, is the application of AI technology to software engineering [Lowry and Duran 1989 p.243]. Expressed another way, it is an approach that integrates methods from AI and software engineering [McCartney 1991 p. xvii]. The prevalent feature of KBSE technology, as far as AI is concerned, is the use of knowledge-based technology through explicit coding of the knowledge. Practically, the focus of AI is dealing with knowledge in terms of representation, reasoning, and extracting useful information from large amounts of it [McCartney 1991 p. xix].

While different people have used the term knowledge base to refer to different things, it seems that the following two definitions are widely applicable across different projects

A knowledge base is:

      A collection of simple facts and general rules representing some universe of discourse [Frost  1986 p. 3], or

      A database augmented with the rules for reasoning about the data and inference engine that applies the rules [Lowry and Duran 1989 p. 245]

An alternative definition that includes the notion of inheritance and type is:

      A knowledge base is a collection of interrelated concepts[17] [Lethbridge T.C. 1994 pp. 193, 190].

To go beyond capabilities of CASE tools of the 1980s, KBSE uses knowledge-based and other AI techniques. The overall objective is to provide intelligent computer based assistance for all parts of the software life cycle [Lowry and Duran 1989 p. 245].

According to Green, KBSE has the following five goals: [Green et al 1983 pp. 381-382]

      To formalize the artifacts of software development and software engineering activities that produce these artifacts [Scherlis and Scott 1983].

      To use knowledge representation technology to record, organize, and retrieve the knowledge behind the design decisions that result in a software system.

      To produce knowledge-based assistance to synthesize and validate source code from formal specification. This will enable maintenance to be performed by altering the specification and then replaying the steps for synthesizing source code, with appropriate modifications. There is also a need for knowledge-based assistance to recover high level specifications from source code.

      To produce knowledge-based assistance to develop and validate specifications.

      To produce knowledge-based assistance to manage large software projects. [Lowry and Duran 1989 pp. 245-246].

The third goal above sheds some light on why maintenance, particularly at the source code level, has not been the primary focus of the KBSE research community. According to Lowry, the view of KBSE regarding the software evolution and maintenance envisions the changes happening by modifying the specification and re-deriving implementation rather than by directly modifying the implementation. This is a fundamental change in approach to the software life cycle [Lowry and Duran 1989 p. 245].

From early days of KBSE, researchers have attempted to create systems that will allow the synthesis of programs by going from a formal specification to executable code. Some of the approaches to the problem are theorem proving, transformational implementation, and interactive programming environments. [McCartney 1991 p. xxi].

The above view is perhaps very attractive for forward engineering. Once the technology is feasible and software is created in this manner, it could also be maintained by altering the specification and regenerating an updated version of the system. Some of the issues with this point of view of software development/maintenance:

      The researchers have mostly focused on the application of purely deductive methods in theorem proving which have the following problems:

      The proof procedures are computationally expensive, which means that large programs can not be produced

      These techniques require that the user provide a formal domain theory and specification, which is expensive and demands skills which are not commonly possessed by the average programmer. [McCartney 1991 p. xxii].

      There exist a large number of legacy software systems that have not been specified formally. It is highly unlikely that these systems will ever be formally specified, unless this can be done automatically. Unfortunately automatic extraction of formal specifications can turn out to be an even a more challenging task than one would expect [Kontogiannis and Selfridge 1995 p.95].

The goal of KBSE research in specification acquisition is to help users produce complete, consistent, and correct specifications. Specifications here refers to the following categories:

      a requirements document describes what the customer wants;

      a specification document describes the external behavior of a software system including the user interface; and

      a design document describes the internal structure of the software system, particularly, the hierarchical decomposition of the system into modules and the external data formats used by the modules. [Lowry and Duran 1989 p. 253].

The major idea is to use knowledge acquisition techniques to formulate a semantic model of a domain. Having such a semantic model in place, intelligent assistants can help users in developing the above mentioned specifications Duran 1989 p. 253]. Unfortunately, as is the case for program synthesis, domain modeling is the bottleneck in developing intelligent assistants for specification acquisition [Lowry and Duran 1989 p. 253][Iscoe et al. 1989].

Reverse Engineering is the process of analyzing, documenting and abstracting existing code. It is the first step in re-engineering. Despite the fact that the potential for applying KBSE techniques to reverse engineering has existed since the late 80s and early 90s, the task of extracting high level specifications from the code has always been a challenge [Green 1991 p. xv] especially for large complex systems [Kontogiannis and Selfridge 1995 p.95] .

KBSE applied to software re-engineering uses knowledge bases to represent detailed formal representations of software objects ranging from requirements to code. The argument here has been that these detailed representations will result in a more intelligent analysis, which in turn will result in better reverse engineering and re-engineering [Green 1991 p. xv].

2.2.2 Knowledge Based Systems Applied to Software Maintenance

In this section we briefly describe some of the knowledge based systems that are designed to assist maintenance programmers with their task.

2.2.2.1 Code Generation (COGEN)

COde GENeration (COGEN) [Liu 1995, Liu et al 1994] is a knowledge based system which is designed to be used for adaptive maintenance, which is a special case of general class of maintenance activities. More specifically, it is a CASE tool/Expert system which assists software maintainers with technology upgrade of legacy systems.

COGEN is used to convert one of the US Federal Aviation Administration systems from an older technology to a newer one. Some of the applications had more than 2 million lines of code. The system had over 1800 COBOL programs. The conversion involved reimplementing applications user interface and database transaction aspects, along with the conversion of the language and environment features. An initial experiment with plan recognition and planning for program understanding had shown that these methods could not scale up to the complexity of the conversion problem.

COGEN employs a database of expert conversion rules and a set of tools used in conversion tasks. The file to be processed is loaded into the system and a syntax tree is generated for it. This syntax tree is also translated into Prolog clauses and stored in the Prolog database. The knowledge base and the translator were implemented in Quintus Prolog. The system allows queries regarding program structure and various data and control flow features to be entered by software engineers (SEs). The transformations are performed on the syntax tree and the target source code is generated from the altered tree.

To acquire the conversion expertise, SEs who were knowledgeable in both the source and target technology standards were consulted. When there was more than one conversion scenario, a special rule was created for each case. A total of 264 conversion rules were implemented in COGEN. The design and implementation of the knowledge base took 3 man years.

The SEs using COGEN should be knowledgeable in both source and target platforms so that they can verify the resulting translations. The translation rules can be disabled by SEs. Conversion of online programs requires more involvement of SEs. They insert specialized tokens into the program which indicate where possible maps begin and end. The SEs usually must correct the generated code, because screen building heuristics are incomplete. They can also guide various stages of translation by fine tuning, setting or selecting parameters which determine specific conversion choices.

2.2.2.2 MACS

MACS is a research project of the European ESPRIT II project.. The basic premise of MACS is maintenance through understanding. Maintenance here refers to all maintenance activities. MACS offers multiple views of the same system along different dimensions known as worlds. MACS offers assistance with:

      understanding the structure of the program (Change Management World, Abstraction Recovery World, and Method Worlds)

      understanding the application design and development rationale (Reasoning World)

      helping the maintainer in carrying out the maintenance activities [Desclaux 1990 pp. 146-147]

MACS provides the above services through the usage of specialized knowledge bases which contain knowledge about application type, language, software development method, and background knowledge about maintenance activity and the MACS tool-set itself [Desclaux 1990 p. 146].

The navigation between different worlds supported by MACS is facilitated by another world called Interconnection World [Desclaux 1990 p. 147].

The system also maintains the knowledge about maintenance activities in the form of an Augmented Transition Network [Woods 1970]. Frame-based knowledge representation is used to represent the knowledge about application, software development methods, and programming language domains [Desclaux 1990 p. 147].

The rationale for design and maintenance decisions is kept in the Reasoning World but the system does not force the user to enter these decisions.

2.2.2.3 Problem Manager (PM)

PM is one of the expert managers of Carnegie Groups knowledge-based software development environment. In general, PM deals with the issue of problem management during software maintenance. These problems include bugs, changes, and enhancements. PM was built using the Knowledge Craft expert system development shell. The shell language CRL employed by Knowledge Craft allows presentation of concepts in frame-like objects called schemas [Alperin and Kedzierski 1987 p. 325]. The knowledge base employed by PM contains representations of software concepts such as configurations, activities, modules, people, and the relationships between them [Alperin and Kedzierski 1987 p. 324]. PM can use its knowledge along with the knowledge of other expert managers in the system to report, trace, and process problems. [Alperin and Kedzierski 1987 p. 321]. It interacts with the environments of other expert managers such as configuration and scheduling managers.

When a problem is detected, the maintainer can use the problem reporter component of PM to report it. For each developer, the system maintains a profile knowledge. For example if the maintainer is an expert, then the system will ask him to point out the module in which the problem occurred. The expert is then asked about the existence of a version of the target system in which the problem did not occur. By doing so, the system can find and display the modules which have changed in the correct version to generate the version which includes the error. The maintainer can view the description of a module, or view its sub-modules, or select it as the cause of the problem. He can read other problem reports involving different modules of the system, and look for the occurrence of the same problem in the past. It is also possible to view the relations between different modules, or by using plan manager, to find out what person is in charge of a module[18].  The user is then asked to judge the complexity and priority of the problem and provide additional information if desired.

The person in charge of fixing the problem is presented with all the problems related to a module in the form of a hierarchy. A problem report can be reassigned to another person. The problems can be scheduled to be fixed, and a PERT chart editor shows the scheduled problems with the critical path(s), and allows the user to manipulate the chart.

A great amount of effort has been put into determining a good representation of information involved in the life cycle of the software [Alperin and Kedzierski 1987 p. 324]. The users of PM should first enter the projects configuration information into the system. [Alperin and Kedzierski 1987 p. 326]. Although it is recommended to use the Configuration Manager, which is another assistant tool, at the start of a project to set up the system environment; for large existing systems, which are the most important and challenging ones in terms of maintenance, this will most probably be a prohibitive factor in the usage of PM.

2.2.2.4 Intelligent Program Editor (IPE)

Intelligent Program Editor (IPE) applies artificial intelligence to the task of manipulating and analyzing programs. [Shapiro, and McCune 1983]. The paper describes a proposal for this system. IPE is a knowledge based tool to support program maintenance and development. It represents a deep understanding of program structure and typical tasks performed by programmers. It explicitly represents textual, syntactic, semantic, and application related structures in programs [Shapiro, and McCune 1983 p.226]. The IPE interacts with other intelligent tools such as Documentation Assistant, and uses the knowledge provided by them. This system explicitly represents both the programming process, in a knowledge base known as The Programming Context Model (PCM), and uses a database called the Extended Program Model.(EPM). The EMP represents the functional structure of the code and provides access to it, while PCM identifies typical sequences of activities in the process of developing and maintaining programs. [Shapiro, and McCune 1983 p.227]. The structure of the program is represented from different points of view. They vary from low-level textual to explicit semantic structures that documents programmers intent for writing a particular piece of code. Other forms of knowledge used by the system include a vocabulary of program constructs, typical programming patterns which are similar to the idea of clichs, to be discussed later, intentional annotations, intentional aggregations and textual documentation. The knowledge employed by the system is represented in the form of a complex tree or graph structure of frames. The prototype was planned to be implemented on a Symbolics 3600 machine, and initially targeted the Ada and CMS-2 languages. It was expected that the research would heavily be relying on methods for information elicitation from the users.

2.2.2.5 Intelligent Semantic Lisp-program Analyzer (ISLA)

Intelligent Semantic Lisp-program Analyzer (ISLA) automatically scans Lisp code for code segments that may be causing a semantic error. It recommends changes to existing Lisp code to make the code less error prone and more understandable. [Walczak 1992 p. 102]. Semantic errors in ISLA are divided into different classes. The system takes into account the experience and programming style of the programmer in heuristics used in locating potential semantic errors. Besides using syntactic clues to locate the possible semantic errors, the system also uses knowledge gained by analyzing real word Lisp code. Examples of these are program length or complexity . [Walczak 1992 p. 103].ISLA uses heuristic production rules to evaluate Lisp code and to make suggestions for improvements and correction to possible semantic or logical errors. Each Semantic Error Class is stored as a collection of production rules in a rule-based knowledge base [Walczak 1992 p. 104].

2.2.2.6 Maintainers Assistant

Maintainers Assistant, a project from the Center for Software Maintenance at the University of Durham, is a code analysis tool intended to help programmers in understanding and restructuring programs [Ward et al 1988 p. 1, 12]. This is a knowledge based system which uses program plans, or clichs, as building blocks from which algorithms are constructed. To comprehend a program, it is transformed into a composition of recognized plans [Calliss et al 1988 pp. 3-4]. Other proposed sources of knowledge are:

      the knowledge about how maintenance programmers do their work,

      program class plans which are a group of plans common to a particular type of program,

      program specific knowledge which includes the internal representation of source code together with knowledge gained from using other code analysis tools [Calliss et al 1988 pp. 4-5]

The source program is viewed through a browser in Maintainers Assistant. The assistant allows the source to be manipulated by:

      modifying the source using editing commands,

      applying a transformation from the predefined transformations library, and

      asking the system to search for a sequence of transformations which will produce a desired effect [Ward et al 1988 p. 6]

If transformations are too complicated to be derived automatically, a programmer can explicitly select the suitable transformation. If the applicability conditions of a transformation can not be automatically verified, the system asks the user to confirm it and records this fact as part of generated documents [Ward et al 1988 p. 7].

2.2.2.7 Program Understanding Using Plan Recognition Algorithms

Program understanding is often described as the process of finding program plans in the source code. Most program understanding algorithms use a library of programming plans, and apply different heuristics to locate instances of these plans in the source code. Because of the close relationship between program understanding and plan recognition, researchers have applied plan recognition algorithms to program understanding. Program understanding can be precise or imprecise. In the case of precise program understanding every instance of a particular plan is recognized, without any false positives or false negatives. In imprecise program understanding, the plan recognition mechanism is allowed to guess about the existence of a plan or miss instances of a plan in the library. Plan recognition algorithms determine the best unified context which causally explains a set of perceived events as they are observed. A Context is a hierarchical set of goals and plans which describe the observed actions.

This process assumes that there exists a body of knowledge that describes and limits the type and combination of plans which may occur. Most AI plan recognition algorithms are based on the algorithm presented by Kautz and Allen [Kautz and Allen 1986]. Kautz and Allens approach is based on deductive inference. It applies a specialized forward chaining process to rules that are based on an exhaustive body of knowledge about actions in a particular domain in the form of an action hierarchy [Quilici et al 1998 pp. 1-4]. As discussed in [Quilici et al 1998 pp.8-13] the Kautz and Allen plan recognition approach suffers from the following problems:

      It may find an incorrect explanation for the program, even when there is sufficient knowledge to eliminate the wrong plan

      It may select a very misleading explanation graph

      It is not efficient

Diagram-based Environment for Cooperative Object Oriented Plan Extraction (DECODE) [Chin and Quilici 1996] is an interactive (cooperative) environment in which programmer and the system work together to understand legacy software. It is used to extract designs from legacy COBOL programs. DECODE has three components: an automated programming plan recognizer, a knowledge base which is used to record extracted design information, and a structured notebook for editing and querying the extracted design . DECODEs algorithm is code driven as opposed to library driven as is the one used in Concept Recognizer [Kozacynski et al 1994]. The details of DECODEs concept recognition algorithm can be found in [Quilici 1994]. Library driven approaches consider all plans in the library, while code driven approaches consider only the subset of those plans that contain already recognized components. In Concept Recognizer, plans are divided into two parts: a description of plan attributes, and a set of common implementation patterns. The implementation patterns are represented as a set of components of the plan and a set of constraints (relations which must hold between components). DECODE extends the plan representation used in Concept Recognizer, by adding support for indexing, and organization of the plan library to reduce the number of constraints that must be evaluated and the amount of matching that must be performed between the code and the plan library. DECODE also extends the plan representations to support definition of plans which are conditionally implied by other plans.

DECODE builds an initial knowledge base about a problem by forming a partial understanding of it by examining it for standard programming patterns. This partial understanding can be extended by a programmer using the structured notebook. The structured notebook allows the programmer to create design primitives, examine selected code fragments and link them to design primitives. It also enables the programmer to ask conceptual queries about the code and its relationship with the extracted design. For instance, the program can query the system about the purpose of a particular piece of code, the location of the code corresponding to a particular design element, or unrecognized design elements in the source code.

An alternative approach to plan recognition is modeling the problem as a Constraint-Satisfaction Problem (CSP) [Quilici et al 1998][Quilici and Woods 1997]. A CSP consists of a set of variables, a finite domain value set for each variable, and a set of constraints among variables that restrict domain value assignments. Each variable ranges over a domain consisting of actual program Abstract Syntax Trees (AST) or recognized subplans that satisfy a set of constraints on the type of the variable. The actual occurrences of each of these components in the source code correspond to possible domain values for the variable. The data flow and control flow relations which must be held between different components are modeled as inter-variable constraints between plan variables. This modeling is called MAP-CSP. A solution to the MAP-CSP is any assignment of domain values to plan variables that satisfies the constraints between variables, and corresponds to an instance of a plan that has been identified.

The plan library is normally divided into layers, in which plans at each level are constructed from lower level plans. The bottom layer corresponds to plans whose components are only ASTs. The next layer includes the plans whose components are either AST items or plans from lower level, and so on. Quilici has also proposed a CSP framework for Evaluating program understanding algorithms [Quilici and Woods 1997].

2.2.2.8 Model Oriented Reengineering Process for HCI[19] (MORPH)

The Model Oriented Reengineering Process for HCI (MORPH) [Moore and Rugaber 1997] [Rugaber 1999] is a process for reengineering text based legacy applications user interfaces to a WIMP[20] style graphical user interface, and a toolset that supports the process. The steps in the process are:

      Analyzing the source code to detect interaction objects from the source code

      Building a model of the existing user interface based on the results obtained in the first step

      Transforming the resulting model to a graphical environment

MORPH maintains a set of user interface abstractions that can be recognized from the legacy code. A hierarchy of concepts, composed of abstract interaction objects[21] is stored in MORPHs knowledge base. To allow transformation from abstraction to a particular Graphical User Interface (GUI) toolkit by inferencing, components of each toolkit are also represented in the knowledge base. After a domain analysis of user interfaces, MORPH knowledge base was built in CLASSIC [Borgida et al 1989]. The domain analysis was performed in both top down and bottom up fashion. A taxonomy of possible user interactions was built by analyzing 22 legacy systems, using static and dynamic methods. After studying user interface community literature, the taxonomy was augmented by interactions that were not found in the legacy system but could conceivably be part of text based user interface.

CLASSIC provides a variety of inferencing capabilities. In particular it provides classification, which allows concepts to be identified as more general or specific cases of a given concept. Classification was used in MORPH to infer the most specific toolkit object for the description of a given abstract object.

2.2.2.9 Programmers Apprentice

The goal of the Programmers Apprentice project [Waters 1982][Rich and Waters 1988][Rich and Waters 1990] is to create a software system that will act as a junior partner and critic for a software engineer.

The Programmers Apprentice heavily relies on the shared knowledge between the software engineer and the apprentice. Two kinds of knowledge have been envisioned [Rich and Waters 1990 p.4]:

      knowledge about the system under consideration. For instance knowledge about system requirements, design, implementation and dependencies among them.

      a large shared vocabulary of software engineering terms and concepts. These shared concepts are called clichs. Clichs range from high level requirements and design ideas, to low level implementation techniques.

The major activity in the Programmers Apprentice project is identifying and codifying useful clichs and relations between them. Each apprentice requires a large library of clichs [Rich and Waters 1990 p.5].

A clich contains both fixed parts and parts that vary in the different occurrences of it. A clich can also have constraints that limit the implementation of parts, or compute some parts from other parts.

Plan calculus is used as a formal representation for programs and algorithmic clichs.

2.2.2.10 Recognizer

Recognizer is a prototype that demonstrates the feasibility of clich recognition. Recognizer can build a hierarchical representation of a program given a library of algorithmic clichs [Rich and Waters 1990 p.171]. The demonstration system is applied to programs written in Common Lisp. Given an appropriate plan library, Recognizer can create a document which explains the nature of the program.. Recognizers main output is a graph grammar derivation tree. A paraphraser generates the document from the derivation tree.

At the heart of the Recognizer is a flow-graph parser. A flow graph is a labeled, directed, acyclic graph in which input and output ports of the nodes are connected by edges. Each node type has a fixed number of input and output ports and fan-in and fan-out are allowed. A flow graph is derived from a context-free flow-graph grammar. The grammar is a set of rewrite rules, which specify how a node in the graph is replaced by a subgraph [Rich and Waters 1990 p.175].

The flow graph to be parsed by the Recognizer is computed from the source code. Also, the clich library is translated to an attributed flow graph grammar. Plans are encoded as individual rules. The flow graph generated from the source code is parsed against the grammar and a set of constraints is derived from the clich library. The result is the derivation tree [Rich and Waters 1990 pp.173-180]

2.2.2.11 Program Analysis Tool (PAT)

Program Analysis Tool (PAT) is a knowledge-based tool for analyzing programs [Harandi and Ning 1990]. PAT tries to help maintainers answer the following questions:

      What does the program do

      How are the high level concepts encoded in terms of low level concepts

      Are the concepts which are recognized implemented incorrectly [Harandi and Ning 1990 p. 75]

PAT first converts the source program to a set of language independent objects called events, and stores them in an event base. Events represent Program Knowledge. They are organized in an object oriented event classification hierarchy. At the lowest level, events represent language constructs like statements and declarations. At the highest level, events can represent algorithms for common problems such as sorting. An event can inherit attributes from its ancestors in the event hierarchy, and it can also have its own attributes. The Understander uses this event base and recognizes high level events (function oriented concepts) and adds the newly recognized events to the event base. The process is repeated until there is no more high level events that can be recognized. This final set of events, provides the answer to the first question above.

The Understander uses a deductive inference rule engine. It uses a library of program plans as inference rules to derive new high level events. Plans represent the Analysis Knowledge of PAT. Each plan contains information understanding, paraphrasing, and debugging knowledge. When the Understander identifies a new event, a Justification Based Truth Maintenance System (JTMS) records the result and its justification. The Explanation Generator component uses the JTMS to show how high-level events are derived from low level events, and answers the second question. After the events are loaded into the event base, the Understander starts the recognition process by calling the plans from the plan base and tests if their events match the input events.

The program plans contain knowledge about the near miss implementations of the events that are recognized by the plan. Using this knowledge, a debugger can recognize possible mis‑implementation, and answers the third question.

The Editor component of the system allows the maintainer to interactively modify the program. If necessary, the JTMS is updated appropriately. The Paraphraser component can create program documentation by tracing the JTMS from the top, and from the information in the text slot of each event.

A PAT analysis does not prove anything rigorously. PAT was a prototype with 100 program event classes and a few dozen plan rules. The authors believe it would require at least several hundred event classes and plans to be practically applied.

2.2.2.12 LASSIE

LaSSIE [Devanbu and Ballard 1991] is an experimental knowledge-based information system. It has been built to assist maintainers of an AT&T telephone switch or private branch exchange system (PBX) called Definity 75/85. This PBX system consists of approximately 1 Million lines of C code. LaSSIE runs on Symbolics 3600 machines uner ZetaLisp/Flavors and is partly ported to run on Sun workstations. It consists of a knowledge base, a window interface, a graphical browsing tool and a customized version of a natural language interface. It has been designed to be used in a formulate-retrieve-reformulate setting. If the retrieved information is not satisfactory, the query can be reformulated using a description of the retrieved items or by exploring the knowledge base.

The knowledge base is a crucial component of LaSSIE. It primarily captures the conceptual viewpoint of the functioning of a software system, and some information regarding its structure. The description of the actions performed by Definity 75/85 was coded in the KANDOR knowledge representation system [Patel-Schneider 1984], which classifies them into a concept hierarchy using a formally defined subsumption interface operation. The LaSSIE knowledge base contains 200 frames and 3800 individuals describing the functional, architectural and code level concepts and their interrelationship in Definity 75/85. At the top level, the relationships between concepts are captured by various slot-filler relationships. The most specific action types correspond to individual functions or source files.

In addition to the above information, LaSSIE also maintains a knowledge base about the C language, C programming conventions, Unix file structure, and directories, C source files, header files, object files, and other detailed information such as defined-in and referenced-in relationship, along with Destiny 75/85 software methodology information such as file naming conventions.

The natural language interface of LaSSIE maintains data structures for each of several types of knowledge about Destiny 75/85 including a taxonomy of the domain, a lexicon and a list of compatibility tuples, which indicate plausible associations among objects.

While being successful in handling many classes of queries about the Definity 75/85 system, the authors admit that constructing a knowledge base is a labor intensive work

2.2.2.13 Knowledge-Based Assistant for Reverse Engineering (KARE)

Knowledge based Assistant for Reverse Engineering (KARE) is a prototype knowledge based reverse engineering tool for C programs [Palthepu 1997]. It uses granularity hierarchies for representing programming clichs. Granularity hierarchies are directed graphs in which nodes represent strategies. Strategies are connected to each other by two kinds of links:

      abstraction which provides generalization and approximation relations

      aggregation which provides the part‑whole relation.

Aggregation links relating an object to its parts are grouped together in what are called Kclusters. Abstraction links relating refined versions of a concept to more abstract concepts are clustered in L‑clusters. Intra‑cluster links in K‑clusters provide AND semantics, while inter‑cluster links provide OR semantics. Links in an L‑cluster provide an XOR semantic. The lowest level concepts can be directly detected by observers, which encode recognizers that can match instances of them in the real world. Contexts encode information regarding where in real word a particular object appears. They provide a focusing mechanism to limit the search space for recognizing an object. Constraints allow restrictions to be placed on recognition and Controls encode information used by the recognition algorithm. In the context of reverse engineering, granularity hierarchies are used to capture human strategic programming knowledge. KARE has three large clichs containing approximately 200 objects.

Agenda is a data structure in KARE that contains a list of clichs that the recognition algorithm should process. The recognition algorithm works bottom‑up. The user can specify part of the source code to which the recognition algorithm should be applied. A strategy object can be recognized if any of its refinements are recognized, or if all of its aggregation children in a K‑cluster are recognized, and the associated constraints are satisfied. The recognizers in KARE are functions that work on abstract syntax trees of C programs.

The reverse engineer using KARE is supposed to select the relevant files for clich recognition. The user can also select a specific clich to be recognized by KARE. He or she can also guide the recognition by intervening via the agenda. KARE was tested on three subsets of NCSA Mosaic version 2.6 source files. These subsets contained 1320, 5300, and 10909 lines of code. The system only had three large clichs to recognize linked lists, searching algorithms in composite data structures, and date and time operations [Palthepu 1998 pp. 99-105].

2.2.2.14 Summary of Knowledge-Based Maintenance Support Systems

Table 2.1 shows a summary of KBSE systems discussed in this chapter.

Table 2.1 Summary of KBSE Systems Discussed

Name

Specialization

Representation

Type of Knowledge

COGEN

Adaptive

Program Syntax tree,Prolog facts and rules

Transformation rules

MACS

 

Augmented Transition Network, Frame based representation

language, software development method, maintenance activity, tool set

PM

 

Frame-like schemata

configuration, activities, modules, people, and the relationships between them

IPE

 

A tree or graph of frames

textual, syntactic, semantic, and application related structures in programs, programming process, programming patterns

ISLA

Lisp programs semantic errors

Production rules

Programming style of the programmer, syntactic clues, domain analysis

Maintainers

Assistant

Program understanding/restructuring

Program plans

Program plans, maintenance activity, program type plans, source code

DECODE

 

Program plans, AST

A plan library

MORPH

Adaptive

CLASSIC (Slot and filler)

User interface interaction objects hierarchy

Programmers Apprentice

 

Plan calculus, clich

Application knowledge, A clich library

Recognizer

Program understanding

Flow graph, context-free flow grammar

A clich library

PAT

Program understanding/

Miss-implementation detection

Program Plans, Event objects

A plan library, an event hierarchy

LaSSIE

 

Frame based knowledge base system

Functional, Architectural, and source code level concepts and their interrelations. C programming language concepts, and software methodology

KARE

 

Granularity Hierarchy

A clich library of programming strategies

 

2.2.3 Inductive Approaches Applied to Software Maintenance

Knowledge acquisition is at the heart of the process of building most knowledge based software assistants. However, this process suffers from the following difficulties:

      it requires great effort to build and maintain a knowledge base,

      there is a shortage of trained knowledge engineers to interview experts and capture their knowledge,

      it is very time consuming, leading to lengthy development, and

      it must be continued if the system is to be maintained in routine use at a high level of performance.

The argument in favor of learning systems is that by examining the record of successfully solved cases they have the potential:

      to exceed the performance of experts and

      to discover new relationships among concepts and hypotheses

Thus the process of learning automatically holds out the promise of incorporating knowledge into the system without the need of a knowledge engineer.

Yet, there are strong practical reasons to expect that what can be learned directly from sample experience alone is limited, if it ignores the context within which problem solving is carried out. Thus there is a need to combine domain-specific expert knowledge with learning approaches [Weiss and Kulikowski 1991 p. 3]

Expert systems, based on explicit encoding of an experts knowledge, are viewed as alternatives to learning systems. In some applications, where expertise is limited, these learning methods may surpass an expert system in performance, as they can aggregate knowledge that has yet to be formalized. In other instances, learning approaches may provide a minimal threshold of performance that must be surpassed in order to justify the investment of building an expert system. [Weiss and Kulikowski 1991 p. 3]

While, perhaps for philosophical reasons such as eliminating the need to maintain the source code by automatic generation of programs from their specification, there has not been much work done in KBSE community to directly address source level maintenance issues of legacy systems, the body of work in applying inductive techniques to assist maintenance programmers is even smaller. In this section we present some applications of inductive methods in software engineering that could aid maintenance programmers in their work

2.2.4 Application of Inductive Learning in Software Maintenance

In this section we briefly describe some of the projects that employ inductive learning techniques which are designed to assist maintenance programmers with their task.

2.2.4.1 Fault Density Prediction

Inductive logic programming approaches have been applied to predict fault density in C++ classes [Cohen and Devanbu 1999,1997]. Each training example is a C++ class definition, represented as a calling tree. There are two classes, positive and negative, indicating whether there were faults in the class implementation. There were 122 examples, collected from the study of an entire software engineering class over one semester. Fifty eight examples were classified as positive. The subjects were asked to develop a medium sized information management system. The projects were tested by an independent group of software professionals and the components with faults were recorded. Relations describing class coupling and inheritance were extracted from the source code. This amounted to 20,929 background facts. Two ILP systems, FOIL [Quinlan 1990] and FLIPPER [Cohen 1995a], were used in this study. A 10-fold-cross validation[22] was used to estimate the error rates. While this study was more concerned with machine learning issues, the best reported error rate was 19.7%, which was obtained by FLIPPER The study also showed that the error rate of propositional learners such as C4.5 and RIPPER, having appropriate features, was not statistically significantly worse than the best of ILP results.

2.2.4.2 Design Recovery

Merlo [Merlo et al 1993] [Merlo and De Mori 1994] reports on the usage of neural nets in design recovery. The system is intended to be used along with more traditional approaches to design recovery. They have used a combination of top down domain analysis and a bottom up informal information analysis using neural networks. The informal information analyzed consists of comments and identifier names. The approach incorporates a human expert as part of the process. The expert provides a taxonomy of concepts recognizable from the comments in the source code, the abbreviations and other system tables used for preprocessing the source code, and selects a network architecture and topology.

Experiments were performed using simple Feed Forward and Recurrent networks with local feedback [Gori et al 1989]. The results showed that neural network classifiers performed better than a lexical matcher in terms of percentage of correctness (more than 60% better accuracy). The neural network performance was closer to the correct classification than the lexical matcher.

2.2.4.3 Locating High-Risk Software modules

Porter has used classification tree analysis (CTA) to automatically isolate high risk software modules [Porter 1994]. The term high-risk is used to denote software modules that possess some specific type of fault such as interface, logic etc. High-risk properties of interest, such as modules that have one or more interface faults, or take more than certain number of hours to maintain, define a target class. A target class is a set of modules that are likely to posses the property of interest. For each target class one classification model is generated. CTA is derived from the classification algorithms of ID3 [Quinlan 1986] and CART [Breiman et al 1984]. The data for this study was gathered from six NASA projects ranging in size from 7000-34000 lines of code. Each system had between 99 and 366 modules[23]. Over 1400 modules for these six systems were examined. Each module was represented by 19 attributes. The attributes cover a wide range of information from development efforts, faults, changes to size and static analysis. Correctness, completeness and consistency were measured for the generated trees[24]. Experiments showed that across all tree applications, for 72.0% of modules the fault class was correctly identified. Across all applications 82% of modules that had a fault were correctly identified. Among all applications, 17% of high risk predictions were actually high risk. The performance of CTA was compared to two simple strategies of randomly selecting n modules, and selecting the n largest modules. The results for applying CTA to the data gathered from actual software projects, with the goal of identifying four specific types of faults showed:

      CTA was an effective and feasible approach for identifying software modules with specific types of faults.

      Even when the percentage of actual faulty modules in the system was low, classification trees were effective in identifying them.

      Trees were successful in identifying most of high risk modules.

      Simpler classification strategies did not perform better than CTA.

Briand and his colleagues have applied Logistic Regression and Optimized Set Reduction (OSR) methods to build models to support identification and understanding of high risk components in Ada designs [Briand et al 1993]. They have used measures of design phase to identify the potential problems in the delivered product.

The OSR method was developed at the University of Maryland, and is based on both statistics and machine learning principles of induction of decision trees [Briand et al 1992]. Through usage of a search algorithm it generates a collection of logical expressions, known as patterns, which characterize the observable trends in the data. The models are built for two high risk classes, high isolation cost and high completion cost.  A component is considered to have high isolation cost if it requires more than one day to isolate and understand. A component is placed in high completion cost class if correcting a defect in it takes more than one day, after isolating the component. The data for the experiment was for a number of Ada systems from NASA/Goddard Space Flight Center Flight Dynamics Division. A random selection of 150 components from three Ada systems was used to build and evaluate the models. For both classes of interest, an equal number of components were used to avoid building biased models. A 1-out cross validation[25] approach was used in building and validating the models. If multiple patterns generate conflicting classifications for one component, first the patterns that do not show significant reliability are eliminated, and then among the remaining the pattern with the highest reliability is selected.

The models were evaluated for correctness and completeness in identifying high risk components, and interpretability. This study showed that:

      Logistic regression and OSR had similar results in terms of high class correctness, but in terms of average completeness and correctness OSR performed much better than logistic regression. The logistic regression method could not achieve a comparable level of completeness without loss of correctness.

      For Ada systems, it was possible to build accurate risk models during the design phase to help designers prevent difficulties in the later phases.

      Patterns were more stable and interpretable structures than regression equations when the theoretical underlying assumptions were not met. On the other hand OSR can be relatively less accurate if the assumptions underlying the logistic regression analysis are met.

      Computation for OSR is more intensive than an analytical model.

2.2.4.4 Software Quality Prediction

Almeida and Matwin have applied machine learning to the task of software quality prediction [Almeida and Matwin 1999]. They view software quality as a multi-dimensional concept that contains properties such as modifiability, changeability, etc. Each quality dimension is treated as an unknown. It is assumed that software units such as files, modules, procedures etc. can be described by a number of attributes whose values are available. The value of the particular quality measure for which the model is built should also be known for each software unit. The training data is a collection of attribute values for each unit and the corresponding value of the quality measure for that unit. Consequently, predicting the quality measure class can be handled as a classification task.

In their study Almeida and Matwin have modeled 19 metrics including size metrics (KLOC, function points), comment lines, blank lines, Halstead metrics and others. They have applied the method to a set of 355 COBOL programs from Bell Canada. For this data they have built five different models using NewID [Feng and Mitchie 1994 pp.65-67], CN2 [Clark P and Niblett 1989][Clark and Boswell 1991], C4.5, C4.5 rules [Quinlan 1993] and FOIL [Quinlan 1990]. Except for the FOIL results, the accuracy, completeness and correctness evaluations for the models generated by other methods were very close. While experiments using FOIL generated poorer models, this has been attributed to lack of relational attributes in the representation used. The model generated by C4.5 rules was judged to be the most comprehensible one. The results show that, on unseen data, using high vs. low as values of the class label, the cost of alteration can be correctly predicted three times out of four

2.2.4.5 Case-Base Reasoning and Inductive Logic Programming Applied to Software Reuse

Fouqu and Matwin have applied machine learning to help and enhance the reuse process [Fouqu and Matwin 1993]. CAESAR is a Case Based Reasoning system for compositional reuse of C programs. A librarian is supposed to choose a library of reusable programs to be used by CAESAR. The librarian should also be familiar with the application domain to understand the structural decomposition of the program in functional terms, and be able to define a rich enough vocabulary to describe the functions and data types which are in the library [Fouqu and Matwin 1993 p. 167]. The specification of the problem is given in form of Prolog-like high level goals representing top-level functions of the specification. This specification is adapted to the content of the case base. The user can select one or more adapted problem specifications and ask CAESAR to construct the corresponding code. The refined version is matched against the case base and the associated cases are retrieved and composition is performed. The user will evaluate the behavior of the solution. Based on the analysis of the success of its adaptation, CAESAR may suggest to the librarian the addition of some new knowledge to the case base.

A very important property of any case based reasoning system is completeness. Increasing completeness requires an ability to acquire new cases or knowledge. CAESAR uses Inductive Logic Programming (ILP) [Muggleton and Buntine 1988] to find regularities in the theory it has built over the course of several sessions and then proposes new knowledge to be added to the case base [Fouqu and Matwin 1993 p. 186].

2.2.4.6 Estimating Software Development Effort

Srinivasan and Fisher [Srinivasan and Fisher 1995] have built models for estimating software development efforts using Regression Trees[26] [Breiman et al 1984] and neural-network Backpropagation [Rumelhart et al 1986] methods. It was assumed that the effort was measured in months. The results were compared to the ones obtained form COnstructive COst MOdel (COCOMO) [Boehm 1981], Function Points [Albrecht and Gaffney 1983], and SOftware LIfecycle Management (SLIM) [Putnam 1978] models.

In the case of the model built using backpropagation, inputs to the network correspond to different project attributes, and the output of the network corresponds to the estimated development effort. Each value of a nominal attribute was given its own input line, with values 1.0 and 0.0 representing the presence or lack of presence of the value respectively.

The training set used was the data about 63 projects from which COCOMO model was developed [Boehm 1981]. These projects include instances of business, scientific, and system software projects, written in a variety of different languages including FORTRAN, COBOL, PLI etc.. The test data set was built from the data previously used in a comparative study of COCOMO, Function Points, and SLIM [Kemerer 1987]. Kermerers database was about 15 projects, mainly of business applications written in COBOL. The results obtained from the experiment shows that there exists a strong linear relationship between the estimated effort and the actual development time. Both learning approaches were competitive with traditional models examined by Kemerer. The learning systems performed better than the COCOMO and Function Point models, and worse than SLIM [Srinivasan and Fisher 1995 p. 131].

2.2.5 Summary

This chapter provides a review of some of the existing work in knowledge based software engineering (KBSE) and machine learning which are considered to be the most relevant to the topic of this thesis. In summary, most KBSE systems employ some sort of knowledge repository. Whether this is a full‑fledged knowledge base, a concept hierarchy, an event hierarchy, a granularity hierarchy or a plan library, each one of these systems explicitly encodes some form of knowledge. Many of the systems reviewed in this chapter have not been applied to real world large legacy systems. This is mostly because acquiring and coding such knowledge is very time consuming and most times requires the knowledge of a particular application domain.

The application of machine learning methods to software engineering is still not very widespread. They have been used in problems such as fault detection, concept assignment, software quality estimation, and software development effort estimation as reported in this chapter. This thesis will be an attempt in applying machine learning techniques to an unexplored area, namely building models of relevance among components of a software system, with the goal of, directly or indirectly, assisting software engineers in low level software maintenance activities

 


Chapter 3

 

Methodology

 

 

 

3.1. Introduction

As we discussed in Chapter 1, the purpose of this research is to investigate the possibility of learning the concept of Relevance among different software constructs such as files, or routines in a large software system[27]. We are employing machine learning algorithms to induce the definition of a Relevance Relation, herein referred to as Relevance, within the context of software maintenance activity. As will be further discussed in the following sections, the particular relation we will be investigating is Co-update relation i.e. whether change in one file may require a change in the other file. In this chapter we discuss our methodology in detail. The chapter also includes a discussion of limitations of the approach as it stands.

The most flexible relevance relation is a mapping of m objects into a real value between 0 and 1, indicating the degree of relevance between these objects. In the software maintenance context, an object could be a file, routine, variable, document, etc. The discrete version of such a relation is a function that will map Software Objects (SOs) of interest into k discrete values, or categories of relevance. This discrete version can be readily translated into a classification task, where we would like to classify m given objects, as one of k alternative classes of relevance.

The technique we use to learn the Co-update relation falls into the category of Supervised Learning methods. In supervised learning, the goal is to learn the Concept of interest, e.g. the Co-update relevance relation, from a set of labeled or pre-classified examples of that concept. The words concept and Model are used interchangeably to indicate the same thing. An example is also referred to as a Case. The output of this learning process is a Classifier.

A very common representation for examples is called Feature Vector Representation. First a set of features, or Attributes, of the concept of interest is chosen. Then, for each pre-classified, or pre-categorized, example of the concept, a vector of values corresponding to the selected features is created. The set of all such examples used to learn the concept of interest is referred to as the Training Set.

As shown in Figures 3.1 and 3.2, the training set is input to a learning system, which based on the given examples, outputs a classifier. The generated classifier can be used in future to classify unseen, i.e. unclassified, examples or cases.

Figure 3.1 Creating a Classifier from a Set of Examples

Figure 3.2 Classifying a New Case

The rest of this chapter is dedicated to the process of generating such classifiers, particularly classifiers that are an instance of the Co-update relation. Some finer details are discussed in Chapter 4, which presents the experiments performed and empirical evaluation of the results.

3.2. General Approach

The process of learning a Relevance Relation in the context of software maintenance is shown in Figure 3.3. The process itself is general enough to be used in other applications of machine learning, and it is closely related to the ones suggested by other researchers such as [Saitta & Neri 1998] and [Piatettsky-Shapiro et. al.1996]. However, as the picture suggest, the data and knowledge sources are geared towards the ones available in a software development and maintenance environment. To that end, the discussion that follows will be mostly focused on the source code level software maintenance, and more specifically, particulars of our research and lessons we have learned.

Figure 3.3 The Process of Learning a Maintenance Relevance Relation

The process can be dived into three stages of Pre-learning, Learning, and Post Learning. However, as seen in Figure 3.1, there are backward arrows, indicating that at many of steps in the process one may need to revisit an earlier step. The following sections provide more details about each of theses stages.

3.2.1 Pre-Learning Stage

In the pre-learning stage, we start with determining what sources of knowledge and data are available to be used in solving the problem of interest. This step is followed by the actual acquisition of the knowledge and data, and processing the raw information to bring it to a form that can be used in the learning phase. It is estimated that these steps contribute to 50-80% of all the effort in the real life data mining projects [Dorian 1999]. Obviously, the assumption is that the maintenance problem we are trying to address is already defined. In this research, the problem is studying of the feasibility of learning a useful Maintenance Relevance Relation, and reporting on the difficulties and challenges involved in the process. In practice, the particular problem of interest could be much more specific.

3.2.1.1 Identifying Sources of Data and Knowledge

Some of the sources of knowledge and data available in software development/maintenance environment are:

      Source code

      Bug tracking and software configuration systems

      Documentation

      Experts in the software system of interest

      Software Engineers interaction with the source code while maintaining software

In our research we have used the first three resources above fairly extensively. The details are explained in the relevant sections in the rest of this chapter. Due to the lack of SEs time, we had to minimize the burden on them by use of some available documents and at the expense of our time. The usage of the fifth resource above is discussed in Appendix A.

Although most medium to large size software development companies keep a record of changes made to their software, almost universally the information is not stored with data mining or machine learning in mind. Also, one may need to pull together data from disparate sources and further process them to create information and data to be used in the learning phase.

To be able to learn the Co-update relation, we need to find instances of files being changed together during maintenance process. To do this, we need to acquire an understanding of the process of applying changes to the source files in the system. Figure 3.4 depicts the main steps of this process in our target company. Although the terminology and some of the finer details may vary from one company to the other, the figure is general enough, so that one would expect to see a similar process be followed in many other companies.

First a problem report is posted in SMS which is a software management system developed at Mitel Networks. SMS provides the functionality of source management and bug tracking systems. By using SMS, developers can keep track of the problems reported about a system, along with a history of responses to the problem.

Each problem report is assigned to an SE. After his or her investigation of the report, the SE may conclude that the program source needs be changed. The changes to the system are recorded in the form of an update.

"An update is the basic unit of identifying actual (as opposed to planned) updates to the software." [Bryson and Kielstra 1992 p. 9]

The update and the corresponding changes to the source code are not considered final, before they go through testing and verification of the changes. These steps may be repeated until a decision is made that the changes are acceptable, at which time, the update is assigned a closed status. Once an update is closed, it is considered frozen, meaning it cannot be reopened or altered in any way. If at a future time, it was determined that a closed update was inadequate, a new update needs be initiated to address the problem.

Figure 3.4 The Process of Submitting an Update to Address a Problem Report

3.2.1.2 Acquire Data and Knowledge

SMS provides queries that show files changed by an update. However, to be able to create a relatively reliable list of files changed as a result of an update, results generated by more than one query needed be programmatically combined and filtered.

3.2.1.2.1 Creating examples of Relevant and Not Relevant File Pairs

To classify the relevance between m software objects, e.g., files, one needs to define k classes, or categories, of relevance.. This thesis presents a two-class maintenance relevance relation between a pair of files, called the Co‑update relation. In other words both m and k are equal to two[28]. A three-class formulation, is also discussed briefly in Appendix A.

The two classes of relevance used in this thesis are:

      Relevant

      Not Relevant

Ideally, an expert should provide the best examples of these two classes of relevance between pairs of files. They could also dismiss some of the examples extracted by other means as not important, or non‑representative. However, due to the industrial setting of our project, the size of the software system under study, and the shortage of SE time, we cannot rely on SEs to classify each pair of software objects. In other words, we cannot directly apply machine learning techniques based on supervised learning approach. Therefore, we have opted to use a hybrid of heuristics and supervised learning which solely relies on the stored maintenance data. One would only expect to generate better results than what is reported here, should there be a well established expert support structure in the target organization.

In the following sections, we will discuss the heuristics to find the examples of the Relevant and Not Relevant file pairs. The attributes used in describing the examples are defined in Section 3.2.1.2.5 and 4.10.

3.2.1.2.2 Finding Relevant and Not Relevant File Pairs

The first step in creating examples of the Relevant and Not Relevant classes is to find the file pairs associated with the example. Once the file pairs are found, one can generate the examples by calculating the value for the predefined features and assigning the example the appropriate class label. This is shown in Figure 3.5.

Figure 3.5 File pairs, Features, and Examples

A question that may be raised in the readers mind is, why should we learn if there are already heuristics that can label examples? Perhaps the most interesting feature of a learning system is its ability to generalize beyond examples used to learn a concept. This is in contrast to a system that memorizes examples of a concept and is not capable of classifying unseen examples. As is discussed in the following section the heuristics that suggest example labels are based on the information extracted for a certain period of time and provide a snapshot of the system for that particular time window. They will generate a set of file pairs with the corresponding class labels. We cannot find the class label for a new file pair that is not in the set generated by these heuristics. However, models generated by a learning system can make a prediction about examples that were not seen during the training phase.

A second benefit of learning over simple use of heuristics is that if the learning method generates an explainable model e.g. a decision tree, we may be able to document nontrivial relations among files. As we mentioned in Chapter 1 such information can be used as part of a reverse engineering effort.

3.2.1.2.3 Heuristics Based on File Update Information

Our heuristic to classify a training example as Relevant relies on information stored in SMS.

Co-update Heuristic : Files which have been changed together in the same update are Relevant to each other[29].

Motivation: Updates are capable of capturing design decisions at the level of implementation. They can indirectly show parts of SRG that have been traversed in the process of system maintenance over a certain period of time. In other words, they can provide clues to the kind of relations that exist among software entities that have been subject to the maintenance process.

Not Relevant Heuristic: Two files that have never[30] been changed together are Not Relevant to each other.

If T is the period of time to which the Co-update heuristic was applied, T, the period of time to which the Not Relevant heuristic is applied includes T, i.e., T<=T.

Motivation: If two files have not been changed together in response to problems occurring in a software system over a period of certain number of years, this could be considered as a good evidence of independence among these files, or perhaps the existence of a relation that is not effected by typical maintenance activities applied to these files.

An update can change zero or more files. We are interested in updates that affect more than one file. These are files that we consider to be Relevant to each other. If an update changes n files, we generate  pairs of relevant files[31]. However, some updates change a large number of files. In this thesis the number of files changed by an update is referred to as the group size. A group size of n is written as Gn, and if there is no group size limit it is indicated as NGSL[32]. It seems logical to assume that the smaller the size of a group, the closer the relation between its member files is. Other way of interpreting this is that perhaps for small n the problem addressed by the update was very specific, effecting a small number of files, which are closely related to each other. As it will be discussed in the next chapter, in the system that we studied, one can limit the size of a group, and still cover most of the updates. One clear alternative to limiting the group size, is not limiting it at all.

Corresponding to a set SR of relevant file pairs there is a set SNR of not relevant file pairs, where SR I SNR = . Effectively, each file f is paired with j files in the set fR of files relevant to f, and with k files in set fNR of files not relevant to f, where | fR |=j, |fNR|=k, and fR I fNR = .

We denote the training set of file pairs as TRS, and the testing set of file pairs as TSS. Each of these sets has two subsets, corresponding to relevant and not relevant file pairs. They are denoted as TRSR, TRSNR, TSSR, and TSSNR.

In general, to create the set SNR of not relevant file pairs, we generate a set of potential file pairs in the target software system, and then remove all the pairs that we know are relevant to each other. In other words, we are making the assumption that the world is closed.

While labeling two files as relevant is based on facts, i.e., update records, labeling a pair of files as not relevant is based on the lack of facts. Consequently, due to the lack of knowledge, there is an inherent error in labeling file pairs as not relevant. The larger the number of updates considered, the larger the set of relevant files, and consequently the smaller and more accurately labeled the set of not relevant pairs will be. However, short of experts feedback, or other sources of information, the only way that one can obtain more updates, is to consider a longer period of time in the change history of the system. Assuming such a history exists, considering the fact that software systems evolve over time, the larger the size of this history window, the higher the possibility that the extracted information is no longer valid for the current version of the system. In other words, there is a limit on the number of useful relevant file pairs. This issue is further discussed in the future work chapter.

Table 3.1 shows the distribution of files in a release of the system that we have used to generate the set of negative files. The .pas, .typ, and .if files are Pascal source files used to define the main program and procedures, and type and interface definitions. While .asm and .inc files are assembler source files.

Table 3.1 Distribution of Files by Their Type

File Type

File Count

pas

1464

typ

917

if

1134

asm

863

inc

308

Total

4686

 

If we were to simply generate the set of potential file pairs by generating a set that contains all possible combinations of these files, without pairing a file with itself, we will have,

 = 10,976,955

pairs of files. The larger the initial set of potential file pairs is the larger the number of not relevant pairs will be. This in turn implies that there will be a higher possibility that pairs will be mistakenly labeled as not relevant.

Focusing only on Pascal related files, i.e., 3515 files with extensions .pas, .typ, and .if, will generate,

 = 6,175,855

pairs of files. This is still a very large number of pairs that, besides imposing severe computational limitations, introduces extreme imbalance between the relevant and not relevant pairs. This is due to the fact that the number of files changed together in a year, i.e the relevant files, tends to be much smaller than all the possible combinations of file pairs. We will discuss the issue of imbalance further in the next chapter.

As we mentioned above, the number of mislabeled pairs grows with the size of SNR. One way of reducing the size of SNR is instead of choosing all possible pairs, or even all possible Pascal pairs, we focus on files in SR. In other words, we only pair files in SR with other files in the system. Due to the smaller size of SR, the number of files used to generate the pairs will be smaller, which means the number of generated pairs will be smaller. The number of Not Relevant pairs can be further reduced if we only focus on files fi that appear as the first file in some pair (fi, fj) in SR. Unless stated otherwise, this is the way that the set of potential not relevant pairs are generated. To generate SN , we always remove the elements of SR from which this potential set of Not Relevant pairs was generated. As it will be discussed in the next chapter, we may also remove other pairs from this set. For instance we may have access to additional sources of knowledge indicating that some of the pairs in the potential Not Relevant set are indeed Relevant. The general relation between Relevant and Not Relevant pairs is shown in Figure 3.6.

In the remainder of this thesis we use the notation Sc,n,y to denote a set S of file pairs of class c, a group size restriction of n, and for time period indicated by y, where,

c {R, NR}, R = Relevant NR = Not Relevant

n is a positive integer or NGSL

y is a valid year or year range y1-y2 where y1 < y2

Figure 3.6 Relation Between Relevant and Not Relevant Pairs

Definition: first_of_pairs(S) = {xi| (xi, yj) S, S is a set of file pairs}

Definition: dnrp[33](SR, F2, FRem) = {(x, y)| (x, y) first_of_pairs(SR)F2 - FRem }

where SR is a set of relevant file pairs, F2 is a set of files[34] which can be the second element of a file pair, and FRem is a set of file pairs that should not be included in the resulting set.

Definition: PAS = {f| f is a Pascal file i.e., .pas, .if, .typ}

Please note that both first_of_pairs and dfnrp generate sets. This implies that there is no duplication among elements

3.2.1.2.4 Features Used in Defining the Relevance Concept

As discussed in the previous section, the main knowledge and data source used in labeling pairs of files is SMS. Features that are used to describe examples of the concept to be learned are also based on the available sources of knowledge and data. The main source of data we have used in defining features and extracting their values is the source code. There are also a number of features that one could extract from the information stored in SMS. In the following sections we define all the candidate features used in learning the Co-update relation.

3.2.1.2.5 Features Extracted From the Source Code

Source code is the most up to date and reliable documentation of the behavior of the program and the intricate relations that exist among its constituent parts. In most large software systems, the source code is stored in more than one source file. This is an attempt to organize the code into smaller, ideally more cohesive and manageable pieces. Therefore, we can view the source code at least in two different levels of granularity:

      As an organized collection of source files

      At the individual file level

The organization of source files is mostly influenced by the operating system under which the software is developed and maintained. Where files are stored and how they are named could provide useful information regarding the relations among them. Most modern operating systems incorporate the concept of directories or folders as part of their file system support. Directories can be used to group files in a software system into smaller subsystems. Also, the conventions used in naming files, or the directories that hold them, could provide additional clues that one can use in learning tasks that are focused on the relations between files (e.g. Co-update MRR). In the system that we studied, there was no breakdown of the system into subsystem directories; however. as discussed below, we have used features that are based on file names.

At the individual file level, the content of a file is very much dependent on the programming language and environment[35] used to create the software. Therefore, some of the source file level attributes or features used in the learning task, may be unique to a particular programming language, while others may be based on general concepts, such as routine calls, that are shared among a wide spectrum of programming languages.

As is discussed in section 4.10.1, comments in source files are another source of information that can be used as features in learning an MRR. Although constraints are placed on comments by the syntax of the language, due to their textual nature they can be used to convey virtually any type and form of information.

Below, we present a list of attributes that can be extracted from the source code. Most of these attributes are based on programming language constructs such as files, routines, types, and variables. They are extracted by most reverse engineering systems. As part of our study, we are interested in knowing how much such attributes alone can help us in creating a useful relevance relation such as the Co-update relation. This list also includes file name based attributes.

While this list may not be complete, it covers many of the essential language capabilities used by programmers in creating software systems. Although this list is compiled for the programming language used in our target software system, it does represent a reasonable subset of features found in many modern procedural languages and, by extension, some of the object oriented languages currently in use.

It should also be noted that availability of such attributes is highly dependent on the quality of tools such as parsers, which extract the information used in calculating the attributes from the source code.

Before defining our suggested attributes, we provide the definition of the terms that are used to describe the attribute set. Unless stated otherwise, in this thesis a routine means a procedure or a function.

Definition: System-wide calculation cost is the cost of calculating the value of an attribute for the whole system. This is the cost that must be endured should the attribute turn out to be important in defining the relevance relation

Definition: A Software Object (SO) is any one of a file, a routine, or a variable. A Software Unit (SU) is a file or a routine.

Definition: Interesting Software Unit (ISU) is a software unit for which we want to find its relevant SUs. Other Software Unit (OSU) is a software unit that has been classified to be Relevant, or Not Relevant to ISU. While for all the attributes given in this section ISU and OSU are files, most of the attributes can be used for routines with relatively simple changes to their definition.

In what follows we will present our initial set of suggested attributes and algorithms to calculate their values. SU1 and SU2 stand for any two software units.

3.2.1.2.6 File Reference Attributes

Attribute Name: Number of shared files included

Meaning: Number of shared files included by both ISU and OSU.

Type: Integer

Motivation: The higher the number of the shared files included by ISU and OSU, the higher is the chance of closeness of their functionality, and consequently their relevance.

How to compute:

      Find the sets of file names included by ISU and OSU.

      Find the size of the intersection of these two sets

System-wide calculation cost:

 = N(N-1)/2 times computation of the above algorithm for N files in the system

Attribute Name: Directly include you

Meaning: ISU includes OSU.

Type: Boolean

Motivation: File inclusion is a mechanism of sharing data and functionality. This in effect implies the existence of certain degree of connection between two objects, perhaps a component of a more comprehensive Relevance relation.

How to compute:

      Directly from Data base[36]

System-wide calculation cost:

N times computation of the above algorithm for N files in the system

Attribute Name: Directly Include me

Meaning: OSU includes ISU.

Type: Boolean

Motivation: Similar reasons to the one for Include you

How to compute:

      Directly from Data base

System-wide calculation cost:

N times computation of the above algorithm for N files in the system

Attribute Name: Transitively include you

Meaning: ISU includes an SU which directly or indirectly includes OSU.

Type: Boolean

Motivation: Similar reasons to the one for Include you.

How to compute:

      While there is any unchecked SU included by ISU

      Return true if SU directly or transitively includes OSU

System-wide calculation Cost:

N times computation of the above algorithm for N files in the system

Attribute Name: Transitively include me

Meaning: ISU includes an SU which directly or indirectly includes OSU.

Type: Boolean

Motivation: Similar reasons to the one for Include you.

How to compute:

      While there is any unchecked SU included by OSU

      Return true if SU directly or transitively includes ISU

System-wide calculation Cost:

N times computation of the above algorithm for N files in the system

Direct and Transitive inclusion can be collapsed to a single numeric attribute, indicating the depth of inclusion. In this case, 0 will stand for SU1 not including SU2, 1 for directly including, and any value greater than one indicates SU1 transitively including SU2. It is not clear to us, whether the depth of the inclusion is more informative than its simpler Boolean counterpart. Boolean attributes tend to provide simpler interpretations, while numeric values provide more information at the expense of additional cost of decoding the meaning behind the values. Perhaps it might be a worthwhile exercise to make a comparative study of learned concepts using this alternative coding scheme.

Attribute Name: Number of files including us

Meaning: Number of files that include both ISU and OSU

Type: Integer

Motivation: The higher the number of files that include both ISU and OSU, the higher the probability that ISU and OSU should, most of the time, appear together i.e. using one would most probably will be followed by using the other.

How to compute:

      Find sets of files including ISU and OSU

      Find the size of the intersection of these two sets

System-wide calculation Cost:

 = N(N-1)/2 times computation of the above algorithm for n files in the system

3.2.1.2.7 Data Flow Attributes

Attribute Name: Number of shared directly referred Types

Meaning: Number of shared references to data types made directly by ISU and OSU

Type: Integer

Motivation: The higher the number of shared referred data types, the higher the possibility of closeness of the functionality of SUs referring to them.

How to compute:

      Find the sets of data types directly referenced by ISU and OSU

      Find the size of the intersection of these two sets

System-wide calculation cost:

 = N(N-1)/2 times computation of the above algorithm for N files in the system

Attribute Name: Number of shared directly referred non‑type data items

Meaning: Number of shared references directly made to all data items but types, by ISU and OSU

Type: Integer

Motivation: The same as Number of shared directly referred Types

How to compute:

      Find the sets of non type data items directly referenced by ISU and OSU

      Find the size of the intersection of these two sets

System-wide calculation cost:

 = N(N-1)/2 times computation of the above algorithm for N files in the system

3.2.1.2.8 Control Flow Attributes

Definition: A Static Routine Reference Graph (SRRG) for a routine R, is a directed graph with a node R, and edges leaving R and entering node Ri of SRRG for a routine Ri, for all Ri statically referred to by R[37]. If routine R does not refer to any other routine the SRRG for R will only contain node R.

Definition: A routine R is directly referred to by file F, if it has been referred in the executed portion of F. A routine R is indirectly referred to by file F, if it is referred by a routine Ri, which is defined in F.

Definition: A File Static Routine Reference Graph (FSRRG) for a file F is a directed graph generated by SRRGs for all routines referenced in F i.e., routines directly or indirectly referred to by F.

Attribute Name: Number of directly Referred/Defined routines

Meaning: Number of routines that are directly referred to by ISU and that are defined in OSU

Type: Integer

Motivation: The higher the number, the higher is the coupling between ISU and OSU.

How to compute:

      Create the sets of routines directly referred in ISU and routines defined in OSU

      Find the size of the intersection of these two sets

System-wide calculation cost:

 = N(N-1)/2 times computation of the above algorithm for N files in the system

Attribute Name: Number of FSRRG Referred/Defined routines

Meaning: Number of nodes (routines) that are referred to in ISUs FSRRG and defined in OSU

Type: Integer

Motivation: The higher the number, the higher is the (indirect) coupling between ISU and OSU.

How to compute:

      Create the sets of (nodes) routines in FSRRG of ISU and routines defined in OSU

      Find the size of the intersection of these two sets

System-wide calculation cost:

 = N(N-1)/2 times computation of the above algorithm for N files in the system

 

Attribute Name: Number of Defined/Directly Referred routines

Meaning: Number of routines that are directly referred to by OSU and defined in ISU

Type: Integer

Motivation: The higher the number, the higher is the coupling between ISU and OSU.

How to compute:

      Create the sets of routines directly referred in OSU and routines defined in ISU

      Find the size of the intersection of these two sets

System-wide calculation cost:

 = N(N-1)/2 times computation of the above algorithm for N files in the system

Attribute Name: Number of Defined/FSRRG Referred routines

Meaning: Number of nodes (routines) which are referred to in OSUs FSRRG and defined in ISU

Type: Integer

Motivation: The higher the number, the higher is the (indirect) coupling between ISU and OSU.

How to compute:

      Create the sets of (nodes) routines in FSRRG of OSU and routines defined in ISU

      Find the size of the intersection of these two sets

System-wide calculation cost:

 = N(N-1)/2  times computation of the above algorithm for N files in the system

Attribute Name: Number of Shared routines directly referred to

Meaning: Number of routines directly referred to by both ISU and OSU

Type: Integer

Motivation: The higher the number of shared routines referred to in ISU and OSU, the higher the chance of them performing closely related functions.

How to compute:

      Create the sets of routines directly referred to in ISU and OSU

      Find the size of the intersection of these two sets

System-wide calculation cost:

 = N(N-1)/2 times computation of the above algorithm for N files in the system

Attribute Name: Number of Shared routines among all routines referred

Meaning: Number of shared routines among routines directly and indirectly referred by ISU and OSU.

Type: Integer

Motivation: The higher the number of shared routines referred to in ISU and OSU, directly and indirectly, the higher the chance of them performing closely related functions.

How to compute:

      Create the sets of routines directly or indirectly referred to in ISU and OSU

      Find the size of the intersection of these two sets

System-wide calculation cost:

 = N(N-1)/2 times computation of the above algorithm for N routines in the system

Attribute Name: Number of nodes shared by FSRRGs[38]

Meaning: Number of routines shared by FSRRGs of ISU and OSU

Type: Integer

Motivation: The higher the number of shared routine referred in FSRRGs of ISU and OSU, the higher the chance of them performing closely related functions.

How to compute:

      Create the sets of nodes in FSRRGs of ISU and OSU, by finding the union of the nodes in simple paths of FSRRGs

      Find the size of the intersection of these two sets

System-wide calculation cost:

 = N(N-1)/2 times computation of the above algorithm for N routines in the system

3.2.1.2.9 File Name Related Attributes

Attribute Name: Common prefix length[39]

Meaning: The number of characters in the shared common prefix of the names of ISU and OSU.

Type: Integer

Motivation: A shared common-prefix is usually indicative of some sort of grouping in terms of function, or subsystem divisions etc.

System-wide calculation cost:

 = N(N-1)/2 times computation of the above algorithm for N files in the system

Attribute Name: Same File Name

Meaning: ISU and OSU have the same file name (with the extension ignored).

Type: Boolean

Motivation: Usually different components of a program are distributed among files with the same name, but with different extensions.

System-wide calculation cost:

 = N(N-1)/2 times computation of the above algorithm for N files in the system

 

Attribute Name: ISU File Extension

Meaning: The file extension of ISU

Type: Text

Motivation: File extensions most of the time are indicative of the type of data that they contain e.g., a pas extension indicates a Pascal source file while an asm extension indicates an assembler source file. Only certain combinations of file extensions in practice are used together.

System-wide calculation Cost:

N times retrieval of the value of extension for N files in the system

Attribute Name: OSU File Extension[40]

Meaning: The file extension of OSU

Type: Text

Motivation: The same as ISU File Extension .

System-wide calculation Cost:

N times computation of the above algorithm for N files in the system

Attribute Name: Same extension

Meaning: ISU and OSU have the same file extension

Type: Boolean

Motivation: Providing a Boolean relation between two attribute values.

System-wide calculation Cost:

 = N(N-1)/2 times computation of the above algorithm for N files in the system

Table 3.2 Summary of Syntactic Attributes

Attribute Name

Attribute Type

Number of shared files included *

Integer

Directly include you *

Boolean

Directly Include me *

Boolean

Transitively include you

Boolean

Transitively include me

Boolean

Number of files including us

Integer

Number of shared directly referred Types *

Integer

Number of shared directly referred non type data items *

Integer

Number of Directly Referred/Defined routines *

Integer

Number of FSRRG Referred/Defined routines *

Integer

Number of Defined/Directly Referred routines *

Integer

Number of Defined/FSRRG Referred routines *

Integer

Number of Shared routines directly referred *

Integer

Number of Shared routines among All routine *s referred

Integer

Number of nodes shared by FSRRGs *

Integer

Common prefix length *

Integer

Same File Name*

Boolean

ISU File Extension *

Text

OSU File Extension *

Text

Same extension *

Boolean

 

Table 3.2 is a summary of attributes introduced in this Chapter. Attributes indicated by a * are used in experiments reported in the next chapter.

3.2.1.2.10 Text Based Attributes

The attributes defined in the previous section were mostly based on syntactic constructs present in the system source files. Undoubtedly the source code holds a great wealth of information about a software system. However, SMS by its nature as a bug tracking system could also provide additional features or attributes that may be useful in a task such as learning the Co-update MRR. One such source for extracting attributes is a problem report.

Problem reports stored in SMS can be seen as text documents. Text classification is one of the application areas for machine learning. The attributes used in text classification are words appearing in documents. In section 4.10 we will present a method to adapt and apply text classification techniques to the task of learning the Co-update MRR. We will also present experiments that use problem reports stored in SMS and source file comments as sources for text based attributes.

3.2.1.3 Preprocessing

The preprocessing stage deals with cleaning the data, removing noise and making decisions, such as what to do with features with unknown values. The experiments in the next chapter will show the effects of such operations and decisions on generated results.

3.2.2 Learning Stage

The input to the learning system is one or more data sets that can be used by the learning system. However, in many applications there is a need for transformation of this data to a data set that is directly input by the learning system. In the next step the learning system will take this data set and create a classifier. The classifier should be empirically evaluated to make sure it provides the desired level of performance according to appropriate measure or measures. If this evaluation succeeds then the process moves to the post learning stage. However, if the desired level of performance is not achieved, then one can either move to a previous step in the process, or change the value of a parameter of the learning algorithm and generate a new classifier.

3.2.2.1 Transformation and Encoding

Depending on the nature of the application and the data generated in the Pre learning stage, there may be a need to use a subset of the data, or a subset of features available. The selection of a subset of data is an issue particularly when there is imbalance among the distribution of the classes. Many learning algorithms tend to be influenced by extreme skew in the distribution of data among the classes. This is the case in our datasets, and we will discuss how we address this when we present the experiment setups in the next chapter. The output of the transformation step is the training data set to be used by the learning system.

3.2.2.2 Learning

In the learning step (also referred to as the modeling step), the learning system reads the training examples in the training set and generates a classifier. This was depicted in Figure 3.1. The learning system used in our experiments is C5.0 [RuleQuest Research 1999], an advanced version of the C4.5 decision tree learner [Quinlan 1993].

Figure 3.7 shows a decision tree. In a typical decision tree, each nonleaf node is a test of the value of a feature. A leaf node stores the classification for the cases that reach it. A case is classified by starting at the top of the tree, and following the branches of the tree based on the outcome of each test, until the case reaches a leaf.

We have used this method for the following reasons,

      Decision tree learning algorithms have been widely studied by machine learning researchers and learners such as C4.5 have been successfully used in other projects

      The decision tree is an explainable model. An expert can study a tree to verify the correctness or reasonableness of the learned model. The study of the tree could also result in finding relations that were not known before. This is in contrast to methods such as neural networks or support vector machines (see section 4.4.2), where the classifier is treated as a black box. In these methods, the reason for the class assigned to a case cannot be explained.

      During the course of our research we have tried alternative methods, but none were able to generate a significantly better results than decision trees[41]. Many times decision trees generated better results.

Figure 3.7 A Decision Tree

3.2.2.3. Evaluating the Usefulness of the Learned Concept

We have used the term usefulness above in a very broad sense. For the purpose of empirical evaluation of the learned Relevance concept there is a need to present a more tangible measure. The evaluation process requires the learned classifier to be tested on a test set that is independent from the training set used to generate the classifier. This section introduces the performance measures we have employed in the thesis. The details of the test sets used in evaluating the classifiers are presented in the next chapter.

Figure 3.8 shows a confusion matrix. The entries in the matrix are the classification made by a classifier versus actual class in a two‑class setting. Most established performance measures in machine learning are defined based on such a table

 

 

Classified As

 

 

Not Relevant

Relevant

True Class

Not Relevant

a

b

Relevant

c

d

Figure 3.8 A Confusion Matrix

Accuracy is the proportion of correctly classified examples among all classified examples. Although accuracy has been widely used in machine learning research, recent studies have questioned the appropriateness of accuracy as the sole measure of evaluating the performance of an induced concept [Provost and Fawcett 1997][Provost et. al. 1998]. In the experiments chapter, we will make the case that indeed accuracy alone cannot be considered the proper measure for our particular problem, and will report results of calculating alternative measures discussed below. Accuracy for class Relevant above is defined as:

Accuracy =

Precision is a standard measure from information retrieval, and is used in machine learning. Conceptually, precision tells one the proportion of returned results that are in fact valid (i.e. assigned the correct class). Precision for class Relevant above is defined as:

PrecisionR  =

Recall is a complementary metric that tells one what proportion of the cases really belonging to a class were identified as such by the classifier. Recall for class Relevant above is defined as:

RecallR  =

Another well known measure, called F-measure, combines Precision and Recall, by allowing a parameter b which reflects the importance of recall versus precision. F-measure is defined as[42]:

Fmeasureb = ,

where P and R are precision and recall values, respectively. The following observations can be made for the F-measure:

      F0 is equivalent to Precision. This is when recall has no importance compared to precision.

      F is equivalent to Recall. This is when precision has no importance compared to recall.

For all the above measures, the goal is to achieve as high a value as possible. However, as will be discussed, there is no guarantee to improve the results for all measures at the same time. As a matter of fact, oftentimes an improvement in one measure is accompanied by a degradation of another measure. Therefore, a plot that captures most of the interesting measures and visually assists in evaluating improvement, or lack thereof, will be desirable.

ROC graphs, such as the one shown for two classifiers A and B in Figure 3.9, plot the False Positive (x-axis), versus True Positive rate (yaxis)[43].

ROC curves are generated by varying the value of a parameter, and creating a point for each of the values that the parameter takes. If a classifier or the learning algorithm does not accept a parameter, then a single ROC point represents the classifier. For instance by changing the value of the threshold of the confidence in classification one can obtain a set of (FP, TP) pairs corresponding to different threshold values. The ROC curve can then be created using these points. In Figure 3.9, where the plot for classifier B is to the North West of the plot for classifier A, one can visually identify the superior classifier by plotting the curve.

The true positive rate is the proportion of positive examples that were correctly identified. As can be seen, the true positive rate is the same as recall.

TPR =

The false positive rate is the proportion of negative examples e.g., Not Relevant examples, that were incorrectly classified as positive e.g., Relevant examples above.

FPR =

Some of the benefits of ROC curves include:

      They can convey information contained in a confusion matrix

      They are independent of the class distribution in the data, or classification error costs.[Provost et al. 1998]. The issue of imbalance among class distributions appears in most real world, complex classification problems. This is also the case in our research

In an ROC curve the following holds:

      Point (0,1) corresponds to perfect classification, where every positive case is classified correctly, and no negative case is classified as positive. Better classifiers have (FP, TP) values closer to this point; i.e. more north west of inferior classifiers.

      Point (1,0) is a classifier which misclassifies every case

      Point (1,1) is a classifier that classifies every case as positive

      Point (0,0) is a classifier which classifies every case as negative

 

Figure 3.9 An ROC Curve

3.2.2.4. Parameter Tuning

Most algorithms employed by learning systems employ a variety of parameters that can be assigned by the user. One approach that may improve the results generated by the classifier is to change the value of one or more parameters, and learn the concept of interest, under the new parameter assignment. Note that in this case the training set is not altered, therefore the changes in the result are a consequence of the changes in parameter values.

3.2.3 Post Learning Stage

The post learning phase mostly deals with putting the learned classifier into use. This is a compulsory step in deployed research [Saitta and Neri 1998]. As a result of feedback received from the user, one may revise some of the decisions made in the pre learning and the learning stages such as data representation, sampling techniques, learning algorithms etc. Such changes will require repetition of some of earlier steps of the process, and this is the reason for the backward arrow from this stage to the previous ones shown in Figure 3.1.

However such an endeavor demands a different set of resources and research priorities than those available in an academic setting such as ours. We will discuss this topic further in the future work chapter (Chapter x5).

3.3. Limitations of Our Research

We can see at least two limitations for the methodology and the research presented in this thesis.

      To be able to learn from past software maintenance experience, the changes applied to the system must be recorded properly. The information must be detailed enough to allow creation of appropriate attribute or class labels to describe the concept or the model we are interested in learning. For instance if the changes applied to source files could not have been traced back to updates we would not have been able to automatically create the sets of Relevant and Not Relevant file pairs and the corresponding examples. Similarly if problem reports were not recorded or we could not link a file to the problem reports affecting it we could not use problem report features in learning the Co‑update relation. Of course, theoretically, examples of the concept of interest can be provided by other sources such as human experts. However practically for most interesting models in the software engineering domain in general and software maintenance in particular this will not be feasible.

      There must be enough examples of the class of interest to be able to learn a model that shows high predictive qualities for this class. For instance as will be discussed in Chapter 4 there is a large imbalance between the examples of Relevant and Not relevant class. We estimate that there is a need for at least a few thousand examples of the Relevant class to make creation of useful models possible.


Chapter 4

 

Experiments

 

 

 

This chapter presents and discusses the results obtained in learning the Co‑update relevance relation among files. We first provide a brief overview of the actual software system used in the study. Then we proceed with the discussion of the creation of the example data sets used in the learning and evaluation process. This will be followed by presentation and analysis of the results of experiments performed. The first group of experiments presented is referred to as the base experiment. The subsequent experiments will alter one or more aspects of the base experiments and reports on the results obtained.

4.1 Background on the System Under Study

SX 2000 is a large legacy telephone switching system.[44]. It was originally created in 1983 by Mitel Networks Corporation. The software is written in the Mitel Pascal programming language, and Motorola 68000 assembly language. The system source code is distributed among five major types of files[45]. Table 4.1 shows the distribution of these files based on their type in a release of the software used in this research.

Table 4.1 Distribution of SX 2000 Source Code Among Different Files

File Type

Usage

Number of Files

Commented Lines of Code

asm

Assember code

       867

          330,907

if

Pascal interface file

     1,150

            86,237

inc

Assembler include file

       309

            43,974

pas

Pascal Source

     1,491

        1,316,043

typ

Pascal Type Declaration

       937

            99,745

Total

 

     4,754

        1,876,906

 

4.2 Creating Training and Testing Data sets

As discussed in Chapter 3, to create the training and testing data sets used in learning and evaluation of the Co‑update relevance relation, first we need to find file pairs that are labeled as either Relevant or Not Relevant.

After finding candidate file pairs, any classification conflict between two file pair tuples e.g.; (fifjRelevant) and (fifj, NotRelevant), must be resolved at this stage. However, according to the definition of class assignment heuristics introduced in Chapter 3, such a conflict will not occur in the two class setting presented here.

In the next step, for each one of these (fi, fj, class) tuples we create an example tuple (a1, a2, , anclass) by calculating the value of the attributes (or features) a1, a2, , an used in defining the Co‑update relation.

In the remainder of this thesis whenever there is a reference to a Relevant or Not Relevant (file) pair, it is understood that we are referring to a file pair tuple with a class value of Relevant or Not Relevant respectively. By the same token, a Relevant or Not Relevant example is an example tuple corresponding to a Relevant or Not Relevant file pair tuple. The relation between the file pair tuples and the corresponding examples was shown in Figure 3.5.

4.3 Removing Class Noise

When two examples have the same values for attributes but different class labels the examples are said to contain class noise. Although our example labeling heuristics create file pair sets that are disjoint i.e.; Relevant and Not Relevant file pairs, the attribute value corresponding to two distinct file pairs can have the same values, thus introducing class noise in the datasets. Noise is known to be harmful to learning, and therefore, whenever possible, it ought to be removed from the data.

To remove this class noise from our training sets we use the following heuristic method::

1.     Find count(p, c) the number of examples for each attribute value pattern p and class c in the data set

2.     For each unique attribute value pattern p in the dataset

Find c = [46]

Create N examples with attribute value pattern p and class label c

If N above is set to 1, then we refer to this method as single copy class noise removal. If N is set to count(p, c) we refer to the method multi‑copy class noise removal.

4.4 Background on Experimental Setup

As is the case for any research that deals with real world problems and environment, our research has gone through many obstacles and revisions. While in section 4.5 we discuss the experiments that are referred to as the Base Experiments, in reality they have been among the more recent set of experiments performed through our research. Indeed the main focus of Chapter 4 is to provide a few select results that are deemed to be both interesting and adhere to what is perceived as an acceptable evaluation method by many machine learning researchers at the time of this writing[47]. However, the fact remains that experiments presented in this chapter largely built on lessons learned and ideas explored in many earlier experiments we performed

Although the lack of uniformity of experimental setup and evaluation methods among those past experiments prohibit us from presenting them as our main results, since they have played an important role in defining the setup for the reported experiments and results, we believe discussing some of them can serve as a road map to what have become our Base experiments in section 4.5. Therefore, this section is intended to present some of more influential past experiments, however the reader can skip this section and directly go to section 4.5 without any loss in continuity.

4.4.1 Experiments Using C5.0

While perhaps in many machine learning applications the examples of classes of interest are already labeled or classified, this is not the case in our research. As discussed in Chapter 3, examples of Relevant and Not Relevant classes are generated by the Co‑update and Not Relevance heuristics respectively. Essentially, the Co‑update heuristic is based on recorded facts about updates applied to the system under study, and the Not Relevant heuristic is based on the lack of such evidence. Therefore between the two heuristics, the Not Relevant heuristic is the one that has much higher potential in introducing example label errors. The complementary nature of these heuristics dictates that the larger the set of Relevant examples is, the smaller the set of Not Relevant examples will be.

The updates applied to the software system that is the subject of our research have been recorded since early 1980s. Not surprisingly the system has gone through many changes during its relatively long life. Even though one may think there is an abundance of update records, their applicability in terms of reflecting the current state of the system is questionable. Consequently, the very first factor that one needs to consider is the time period during which the updates were applied to the system. The time period also plays an essential role if chronological splitting is used to evaluate the learned classifiers. The evaluation method used in this chapter is based on what is known as the hold out (splitting) method. In Appendix B we present an alternative chronological splitting method. Our general goal here is to achieve one or more of the following objectives:

      Improve measures such as precision and recall of each class, specially the rare Relevant class

      Reduce the computational overhead by reducing the number of examples to process

      Reduce the number of mislabeled training examples

Clearly these objectives can not always be achieved together.

The very first experiment we will discuss is the case where we simply learn from skewed training sets. We generated all the training examples for years 1997-1998, and testing examples for year 1999. This is an example of chronological splitting. In both cases the group size of updates used to create the Relevant examples was limited to at most 20 files[48]. Training Not Relevant file pairs were created by pairing all file names in the system and then removing all Relevant file pairs for 1995-1998 from this large file pair set[49]. As usual the examples were generated for each of these file pairs.

The set of testing Not Relevant pairs was generated by first pairing the first files in the set of Relevant pairs with other files in the system. The set of Relevant pairs was removed from this intermediate set to create the final Not Relevant pairs set. This is in essence the same method discussed in Chapter 3 and used in the rest of Chapter 4. We refer to this approach of creating Not Relevant pairs as Relevant Based method, to acknowledge the fact that they were created from the Relevant pairs and not from pairing all files in the system. The number of examples for each class and their ratio in training and testing repositories are shown in Table 4.2

Table 4.2 Training and Testing Class Ratios for Skewed Experiments

 

Relevant

Not Relevant

Relevant/Not Relevant Ratio

Training

2630

6755620

0.0004

Testing

2868

1444813

0.002

 

To perform the experiment we split the Relevant and Not Relevant examples in training and testing repositories into 10 parts and created training and testing files that were one‑tenth the size of training and testing repository. The results for these 10 training and testing runs were then macro averaged[50]. These experiments confirmed our suspicion that in the face of very high skewness such as the one shown in Table 4.2 C5.0 will create a classifier that always selects the majority class. Obviously such a classifier is of no practical use.

As we discussed earlier, short of additional sources of information, the Not Relevant heuristic labels a pair of files as Not Relevant if it is not known that these files are relevant to each other, or in other words if they are not labeled Relevant. This is a major source of class noise in our training data sets[51].

We applied the single copy noise removal method discussed in section 4.3 to the training sets in the skewed experiments, and each time tested the learned classifiers on the whole testing repository and generated the macro averaged results. The results are shown in Table 4.3

Table 4.3 Result of Removing Class Noise from Skewed Training Sets

 

Relevant

Not Relevant

Tree Size

 

Precision

Recall

Precision

Recall

 

Noisy

0

0

99.7

100,00

1

Noise Free

9.3

6.1

99.8

99.9

19.29.6

Looking back at Table 3.2 we see that there are 11 attributes that take integer values. Studies by researchers such as [Dougherty et. al. 1995] show that discretizing numeric attributes before induction sometimes significantly improves accuracy. When discretizing numeric attributes, a range of values is mapped to a single value. Consequently, discretization can introduce class noise into a data set, so to avoid this unwanted noise one should remove the noise conflict after discretization. The discretization method we use was entropy-based discretization [Fayyad and Irani 1993]. We applied the following sequence of operations to the skewed training sets reported earlier 1) class noise removal, 2) discretization, 3) class noise removal. The testing sets were the one‑tenth size test sets used for the skewed experiments, Results showed that the macro-averaged precision and recall increased from 0 to 1.9% and 18.9% respectively. Compared to the simple class noise removal, this method improves the recall of the Relevant class at the expense of misclassifying many more Not Relevant examples as Relevant, and therefore reducing the precision of the Relevant class. Thus if recall of the Relevant class is deemed to be more important then its precision, one could use this combination of discretization and class noise removal.

To further investigate the effect of skewness in the training sets on the results we decided to run experiments that learn from less skewed data sets but test on the skewed data sets. To that end we created 10 training sets that used all the Relevant examples in the skewed training repository discussed above with 1, 5 and 10 times as many Not Relevant examples from the same repository. We also split the testing repository into 10 testing sets and used them to test the classifiers generated from these training sets. The partial results for these 10 experiments then were macro averaged. These macro averaged results are shown in Table 4.4. It seems that as the skewness in training data sets increases, the precision of the Relevant Class, and the recall of the Not Relevant class increases while the recall of the Relevant class decreases. Unfortunately the averaged size of generated decision trees also increases with the increase in the skewness of training data sets. Of course at very high degrees of skewness the generated trees reduce in size as they eventually become single node majority class classifiers.

Table 4.4 Learning from Less Skewed Training Sets

Skew

Relevant

Not Relevant

Tree Size

 

Precision

Recall

Precision

Recall

 

1

1.0

68.5

99.9

87.0

147.714.0

5

2.8

47.9

99.9

96.8

216.323.7

10

4.4

40.2

99.9

98.3

240.825.0sy

 

The interesting question would be what is the effect if we choose other skewness ratios in the training sets. We have partially addressed this in our more recent experiments as reported in the rest of Chapter 4, by repeating the experiments for 18 different skewness values.

We then proceeded to see the effect of class noise removal on the training sets. The new training sets were tested with the same testing sets as the noisy training sets and the results of 10 experiments were once again macro averaged. As shown in Table 4.5, these experiments seem to indicate that one could reduce the average size of decision trees at the expense of some reduction in the recall of the Relevant class. Training sets with skewness ratio 1 show an exception where the recall of the Relevant class improves but other measures, except for the average size of the decision tree, either degrade or stay unchanged.

Table 4.5 The Effect of Removing Class Noise from Training Sets

Skew

Relevant

Not Relevant

Tree Size

 

Precision

Recall

Precision

Recall

 

1

0.8

72.1

99.9

82.1

81.612.2

5

3.1

45.5

99.9

97.2

121.114.9

10

5.0

37.1

99.9

98.6

126.417.0

 

The other venue that we pursued to improve our results was by further correcting the labeling errors in the testing set. As discussed before, most of our labeling errors belong to the Not Relevant class. When using chronological splitting, the Not Relevant examples in training and testing repositories are generated independently. This means that we may label an example as Relevant in the training sets, but label it as Not Relevant in the testing set, In other words there is discrepancy between training and testing examples. One way to correct the Not Relevant labeling noise in the testing repository is to remove the known Relevant examples in the training set from testing sets. We ran experiments to verify the effect of this strategy using balanced learning sets (skewness of 1) discussed above. In other words we removed the Relevant examples in the training sets from the 10 testing sets, and then retested the classifiers on these new 10 test sets and macro averaged the results. We found that for balanced training sets the effect of removing the training Relevant examples from the testing Not Relevant examples were very negligible (0.2% increase in the recall of the Not Relevant class). This idea was more generally tested by first removing the Relevant examples in the training repository from the Not Relevant examples in the testing repository and then randomly creating 10 balanced training sets and 10 one‑tenth size testing sets from testing repository and calculating the macro averaged results. As shown in Table 4.6, when experimenting with noisy training sets, this correction improves the results, however not necessarily the same measures each time[52]. The bold numbers show the difference between the entries in Table 4.6 and the corresponding entries in Table 4.4.

Table 4.6 Removing Training Relevant Examples from Testing Not Relevant Examples (Noisy Data)

Skew

Relevant

Not Relevant

Tree Size

 

Precision

Recall

Precision

Recall

 

1

1.0

68.5

99.9

87.1+0.1

147.714.0

5

3.0+0.2

48.8+0.9

99.9

96.8

216.3

10

4.7+0.3

41.0+0.8

99.9

98.4+0.1

240.8

 

Table 4.7 shows that the improvements are also achieved when using noise free training sets. Although limited in their nature, these experiments indicated the potential for correcting testing Not Relevant examples by removing the training Relevant examples. Minimally this has the effect of reducing the testing set, and it could also result in some improvement in accuracy, albeit minor.

Table 4.7 Removing Training Relevant Examples from Testing Not Relevant Examples (Noise Free Data)

Skew

Relevant

Not Relevant

Tree Size

 

Precision

Recall

Precision

Recall

 

1

0.8

72.4+0.3

99.9

82.1

81.612.2

5

3.2+0.1

46.2+0.7

99.9

97.2

121.114.9

10

5.3+0.3

37.3+0.2

99.9

98.7+0.1

126.417.0

 

We also tried an alternative method where instead of removing the examples corresponding to training Relevant file pairs from testing Not Relevant examples, we removed all testing Not Relevant examples where their attribute value pattern matched the attribute value of a Training Relevant example. Once again we calculated the macro averaged result for 10 runs. Interestingly while there were fewer misclassifications of Not Relevant examples as Relevant, since after removing the Relevant example patterns there were overall many fewer Not Relevant examples in the testing set, the end result was a mere 0.4% improvement in the precision of the Relevant class, at the expense of a 11.9% loss in the recall of Not Relevant class.

To further reduce the size of training and testing repositories we opted for a method referred to as single combination where for each two permutations of files f1 and f2 only one example will be generated. Table 4.8 shows the result of using this strategy on training and testing repositories. The numbers in the parentheses show the percentage of reduction in the number of examples compared to the original skewed repositories.

Table 4.8 Training and Testing Class Ratios for Single Combination Experiments

 

Relevant

Not Relevant

Relevant/Not Relevant Ratio

Training

1814 (-31.0%)

5711680 (-15.5%)

0.0003

Testing

1991 (-31.0%)

1398409 (-3.2%)

0.0014

 

We experimented with training sets created from this smaller training repository by creating 10 non skewed training sets (the same number of Relevant and Not Relevant examples). We tested the generated classifiers using the larger testing repository shown in Table 4.2 and the new smaller testing repository shown in Table 4.8. For each repository, 10 random samples of 1/10th size was drawn to perform the testing. The results are presented in Table 4.9 where the Combination row corresponds to the smaller testing repository. The numbers in parentheses show the difference between the entry in the table and the corresponding entry in Table 4.4 for ratio 1 of skewness. The results show that for this ratio of skewness, the recall of the Not Relevant class and the precision of the Relevant class decreases, while the recall of the Relevant class increases. But perhaps the most important difference is the average size of the decision tree that has been reduced from 147.7 to 69.3. This is a 53% reduction in size.

Table 4.9 Single Combination Versus All Permutation

Testing

Relevant

Not Relevant

Tree Size

 

Precision

Recall

Precision

Recall

 

Combination

0.6(-0.4)

71.7(+3.2)

99.9

84.1(-2.9)

69.37.8

Permutation5

0.8(-0.2)

73.3(+4.8)

99.9

83.0(-4.0)

 

 

To further reduce the size of training repository we applied the Relevant Based approach to create the Not Relevant training examples. Once again the training Relevant pairs were created from updates limited to a group size of 20 for years 1997-1998 and all the Relevant examples for years 1995-1998 were removed from the Not Relevant examples created. Table 4.10 shows the effect of this method on the number of examples, where the first two rows show the training repository when all valid file pairings are used to create the Not Relevant examples and when the Relevant Based method is used. The third row shows the testing repository that was already created using the Relevant Based method

Table 4.10 Using Relevant Based Approach to Create the Training Not Relevant Examples

 

Relevant

Not Relevant

Relevant/Not Relevant Ratio

Al Pairsl

2630

6755620

0.0004

Relevant Based

2630

1042524

0.0025

Testing

2868

1444813

0.0020

 

In Table 4.11, results of using Relevant Based training sets for skewness ratios of 1, 5 and 10 are shown. For each ratio we generated 10 less skewed training sets that included all the training Relevant pairs We macro averaged the results for the same 10 testing sets used in experiments shown in Table 4.4. For each level of skewness the first line shows the result for the noisy training sets, and the second line shows the results for noise free training sets. The better entry between the two pairs is shown in bold. Comparing the entries in this table with corresponding entries in Tables 4.4 and 4.5 shows that while applying the Relevant Based method to create training repository decreases the size of the training repository considerably, it does not degrade other measures dramatically, at least for the ratios shown here.

Table 4.11 The Effect of Using Relevant Based Training Files

Skew

Relevant

Not Relevant

Tree Size

 

Precision

Recall

Precision

Recall

 

1

1.0

67.3

99.9

87.4

154.814.0

 

0.7

72.3

99.9

80.7

92.118.3

5

3.0

47.5

99.9

96.9

198.019.3

 

3.1

45.7

99.9

97.2

113.911.9

10

4.4

40.0

99.9

98.3

245.313.5

 

5.1

36.1

99.9

98.7

129.69.7

 

The next step towards reducing the size of training and testing sets was removing Assembly language files from the examples. Due to its unstructured format, the Assembly source codes imposed certain difficulties in extracting information such as routine definitions and calls. Our Assembler parsers were never as accurate as the Pascal parser and consequently prone to introduce noise in the data sets. Therefore we decided to remove examples that were based on Assembly language files from our training and testing repositories. Distribution of examples in the new non‑assembly repositories is shown in Table 4.12. The result of learning from non skewed training sets (skewness ratio 1) based on the new training repository compared to the corresponding entry from Table 4.4 is shown in Table 4.13.

Table 4.12 Pascal Only Repositories

 

Relevant

Not Relevant

Relevant/Not Relevant Ratio

Training

1642 (-37.6%)

752915 (-88.9%)

0.0022

Testing

1861 (-35.1%)

1036692 (-28.2%)

0.0018

 

As can be seen here while for non skewed data sets precision and recall values were worsened, the average size of decision trees generated was reduced by 42%.

Table 4.13 Results of Experimenting With Pascal Only File Pairs

File Types

Relevant

Not Relevant

Tree Size

 

Precision

Recall

Precision

Recall

 

With Assembler

1.0

68.5

99.9

87.0

147.714.0

No Assembler

0.8

65.7

99.9

84.9

85.913.2

 

In all the experiments presented here, the training Not Relevant examples were randomly selected from the training repository. By doing so we would theoretically preserve the proportion of attribute values patterns among training Not Relevant examples. However in practice this is not guaranteed, most notably when the random numbers generated are not uniformly distributed. To avoid such potential problems we implemented a stratified sampling procedure where each Not Relevant example attribute value pattern p that appears in the training repository with a proportion of P, will appear with a similar proportion in the training sets. We then proceeded with three sets of experiments each with the following 18 different skewness ratios:

1, 2,,10, 15, 20, 25, 30, 35, 40, 45 , 50

In experiment set 1 we used a training and testing repository very similar to the one described in Table 4.12, and we made sure all file pairs used to generate the repository appeared in the same release of the software.

In experiment 2 we removed all the known Relevant pairs for the 1995-1998 time period from testing Not Relevants that were generate from 1999 updates. We remind the reader that this Relevant file pairs set was already removed from the training Not Relevant sets. The motivation behind doing so is to reduce Not Relevant labeling heuristic error, by not labeling known Relevant file pairs as Not Relevant.

Finally, we repeated the same procedure for a testing repository generated from 1999 updates with no group size limit. The size of the three training and testing repositories used is shown in Table 4.14, while Figure 4.1 shows the ROC plots for the three sets of experiments, a total of 54 training/testing tasks. The results are based on testing the generated classifiers on the whole testing repository as a single test set.

Table 4.14 Training and Testing Repositories Used for Stratified Training Experiments

Exp. #

Training/Testing

Relevant

Not Relevant

Relevant/Not Relevant Ratio

1

Training

1642

752915

0.0022

 

Testing

1853

1036692

0.0018

2

Testing

1853

992716

0.0019

3

Testing

14819

1007535

0.015

 

Figure 4.1 ROC Comparison of Stratified Training Experiments

As it can be seen, removing all the known Relevant examples in 1995-1998 from testing repository Not Relevant examples improves the result. However using all the Relevant file pairs in 1999 without imposing a group size degrades the results.

Lessons learned from experiments discussed above suggests to us the following:

1.     Train with less skewed data sets

2.     Focus on the Pascal (non assembler) file pairs only

3.     Remove the known Relevant examples from the testing Not Relevant examples

4.     Use only one combination of file pairs instead of their permutation

5.     Use the Relevant Based approach to create training Not Relevant examples

6.     Use stratified samples from the training Not Relevant repository to generate less skewed training sets

4.4.2 Experiments With Set Covering Machines

Set covering machine (SCM) [Marchand and Shawe-Taylor 2001][Marchand and Shawe-Taylor2002] is a new learning method that generalizes classical ([Valiant 1984] and [Haussler 1988]) algorithms to learn from Boolean attributes. Results reported in [Marchand and Shawe-Taylor 2001][Marchand and Shawe-Taylor 2002] show SCM as being a competitive induction algorithm compared to other classifier inducers such as SVMs that also produce non‑explainable models. We also had the benefit of local access to Dr. Mario Marchand who was the designer and the developer of the first SCM learning software. He provided the software and further invaluable insight to inner workings of the algorithm and its implementation. Thus we decided to perform some experiments using this algorithm and compare the results to C5.0 results.

The SCM induction algorithm can generate classifiers which are conjunctions or disjunctions of Boolean attributes, however the attributes describing an example do not have to be Boolean themselves. Similar to Support Vector Machines, SCMs can map the original input space into a new high-dimensional feature space. In the case of SCM these new features are Boolean. The induced classifiers are either conjunctions or disjunctions of these features.

While the original Valiant algorithm only selects features that are consistent with all the positive training examples i.e.; correctly classify the example, the SCM algorithm allows a feature to make some mistakes on classifying positive training examples to obtain better generalization, and consequently better results on the unseen examples. The SCM algorithm employs a greedy method to choose the best feature at each stage until the termination criteria is met. This ranking function provides what is known as the Usefulness of a feature h and is defined as [Marchand and Shawe-Taylor JOURNAL],

Uh = |Qh| - p,|Rh|

where |Qh| is the number of negative examples covered by feature h, and |Rh| is the number of positive examples misclassified by feature h. The goal here is to select a feature that is consistent with as many negative examples as possible while making as small a number of misclassifications as possible when it comes to positive examples. The p parameter here is known as the penalty value and indicates the factor by which we penalize misclassification of positive examples by a feature. The SCM implementation that we experimented with introduces two other variants of this function that are controlled by a parameter named Feature Score Function.(FSF). The above definition corresponds to an FSF value of 0, while the two other alternatives are selected by FSF values of 1 and 2.

As is shown in Table 3.2 most of our attributes are non Boolean. Therefore to be able use the SCM algorithm, this input space must be transformed to a Boolean feature space. One such transformation is achieved by creating feature hi,r as a data-dependent ball centered on each training example. xi. as defined below:

hi,r = hr(x, xi ) =

yi {0,1} is the class of example xi,  is the Boolean complement of yi and d(x, xi) is the distance between x, and xi. The real valued r is the radius of the ball. While this value can theoretically approach infinity, in practice the largest useful r value is the distance between two extreme examples in the training set. Thus for m training examples, this approach will generate at most an O(m2) Boolean features that the SCM induction algorithm has to consider. The above discussion is valid for a disjunctive SCM with minor adjustments as discussed in [Marchand and Shawe-Taylor JOURNAL].

In our experiments we have used two distance functions or metrics. The L1 metric or norm is the Manhattan or city block distance. for two vectors x1 and x2 with cardinality n is calculated as:

The L2 metric is the well known Euclidean distance calculated as:

The SCM induction algorithm also accepts two parameters for positive and negative loss ratios. In essence these parameter are used to internally weigh the importance of a misclassification of a positive example versus the misclassification of a negative example. For instance, if the positive to negative loss ratio is set to 10 then a mistake on classifying a positive example is 10 times more costly than misclassifying a negative example.

In the experiments reported in this section we have created conjunctive and disjunctive classifiers using L1 and L2 distance metrics, for positive to negative loss ratios 1 and 10. We have also experimented with following 16 penalty values:

0.1,0.2,0.3,0.4,0.5,1.0,1.5,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0

In Table 4.15 we show results obtained for a conjunctive SCM with positive to negative loss ratio[53] of 1, and for L1 and L2 distance metrics. The second row for the L1 metric, shown in bold font, shows the results obtained using C5.0 for the same data set. This latter result is our reference for comparison. The training set used for all the experiments reported in this section was based on 1997-1998 updates, while the testing set was based on 1999 updates. The group size was limited to 20, and in the case of training set all the Relevant pairs for 1995-1998 time period were removed from the Not Relevant examples. The skewness ratio was set to 1, meaning the training set was balanced.

Table 4.15 Conjunction of Balls for Loss Ratio 1 and Distance Metrics L1 and L2

Skew

Type

Conjunction

Metric

L1

FSF

1

Pos/Neg Loss

1

PenaltyValues

0.1,0.2,0.3,0.4,0.5,1.0,1.5,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0

Relevant

Not Relevant

Number of Nodes

Best Penalty Value

 

Precision

Recall

Precision

Recall

1

4.0

11.3

99.8

99.5

5

0.1

 

0.8

67.9

99.9

85.5

63

 

 

Metric

L2

 

3.5

7.1

99.8

99.6

5

0.1

 

This setup was used with the above mentioned 16 penalty values foe each metric, thus Table 4.15 provides the best result obtained on the testing set i.e.; minimum classification error, among 16 different experiments. The tables report the size of SCM created along with the penalty value that gives the best result. As can be seen in this table, the L1 metric generated better results than L2 metric, however comparing these results with the corresponding C5.0 result shows that the improvement in the precision of the Relevant class, which is the class of interest for us, has come at the expense of a major degradation of the recall of this class. Table 4.16 shows the same setup for a disjunctive SCM, with very similar observations. The clear difference here shows up in the size of SCM and the penalty value that provides the machine with the least classification error. The disjunctive SCM generated has a smaller size and the best result is obtained for the largest penalty value experimented as oppose to the smallest.

Table 4.16 Disjunction of Balls for Loss Ratio 1 and Distance Metrics L1 and L2

Skew

Type

Disjunction

Metric

L1

FSF

1

Pos/Neg Loss

1

PenaltyValues

0.1,0.2,0.3,0.4,0.5,1.0,1.5,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0

Relevant

Not Relevant

Number of Nodes

Best Penalty Value

 

Precision

Recall

Precision

Recall

1

3.3

11.9

99.8

99.4

1

10.0

 

Metric

L2

 

3.2

7.9

99.8

99.6

1

10.0

 

Tables 4.17 and 4.18 each show the best results obtained from 32 experiments, by creating conjunctive SCMs for a positive to negative loss ratio of 10, and FSF values of 1 and 2. Experiments in Table 4.17 use the L1 distance metric while the experiments in Figure 4.18 use the L2 metric. The corresponding results shown in these two tables are very close, with the L1 metric showing slight advantage by creating smaller SCMs and slightly higher recall for the Relevant class.

Table 4.17 Conjunction of Balls for Loss Ratio 10, Distance Metrics L1, and FSF 1 and 2

Skew

Type

Conjunction

Metric

L1

FSF

1

Pos/Neg Loss

10

PenaltyValues

0.1,0.2,0.3,0.4,0.5,1.0,1.5,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0