The Wayback Machine - https://web.archive.org/web/20200914055813/https://github.com/ai-se/Jitterbug
Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src
 
 
 
 
 
 
 
 

README.md

Identifying Self-Admitted Technical Debts with Jitterbug: A Two-step Approach

Cite as:

@misc{yu2020identifying,
    title={Identifying Self-Admitted Technical Debts with Jitterbug: A Two-step Approach},
    author={Zhe Yu and Fahmid Morshed Fahid and Huy Tu and Tim Menzies},
    year={2020},
    eprint={2002.11049},
    archivePrefix={arXiv},
    primaryClass={cs.SE}
}

Data

  • Original from Maldonado and Shihab "Detecting and quantifying different types of self-admitted technical debt," in 2015 IEEE 7th InternationalWorkshop on Managing Technical Debt (MTD). IEEE, 2015, pp. 9–15.
  • Corrected: 439 labels checked, 431 labels corrected.

Experiments

Setup

Jitterbug$ pip install -r requirements.txt
Jitterbug$ python
>>> import nltk
>>> nltk.download('punkt')
>>> nltk.download('wordnet')
Jitterbug$ cd src

RQ1: How to find the strong patterns of the "easy to find" SATDs in Step 1?

  • Prepare data:
src$ python main.py parse
  • Find patterns with Easy (target project = apache-ant-1.7.0):
src$ python main.py find_patterns

{'fp': 367.0, 'tp': 2493.0, 'fitness': 0.8716783216783217}
{'fp': 53.0, 'tp': 330.0, 'fitness': 0.8616187989556136}
{'fp': 7.0, 'tp': 87.0, 'fitness': 0.925531914893617}
{'fp': 3.0, 'tp': 46.0, 'fitness': 0.9387755102040817}
{'fp': 28.0, 'tp': 61.0, 'fitness': 0.6853932584269663}
Patterns:
[u'todo' u'fixme' u'hack' u'workaround']
Precisions on training set:
{u'fixme': 0.8616187989556136, u'todo': 0.8716783216783217, u'workaround': 0.9387755102040817, u'hack': 0.925531914893617}
src$ python main.py Easy_results original
src$ python main.py MAT_results original

Can the ground truth be wrong?

  • Find conflicting labels (GT=no AND Easy=yes), save as csv files under the conflicts directory:
src$ python main.py validate_ground_truth
  • Validate the conflicting labels manually, results are under the validate directory.
  • Summarize validation results and save as validate_sum.csv:
src$ python main.py summarize_validate
  • Correct ground truth labels with the validation results, new data saved under corrected directory:
src$ python main.py correct_ground_truth
  • Test Easy on every target project with corrected labels, save output as step1_Easy_corrected.csv, also output the data with the "easy to find" SATDs removed to the rest directory:
src$ python main.py Easy_results corrected
src$ python main.py MAT_results corrected

RQ2: How to better find the "hard to find" SATDs with less human effort in Step 2?

  • Test Hard, TM, and other supervised learners on every target project with "easy to find" SATDs removed, save results (rest_*.csv) to the results directory, and dump results as rest_result.pickle:
src$ python main.py rest_results
  • Plot recall-cost curves of Step2 experiments to figures_rest directory:
src$ python main.py plot_recall_cost rest

When to stop Hard in Step 2?

  • Use estimator to stop at 90% recall, plot curves to figures_est directory:
src$ python main.py estimate_results

RQ3: Overall how does Jitterbug perform?

src$ python main.py overall_results
src$ python main.py plot_recall_cost overall

How does Jitterbug perform when targeting at finding 90% hard to find SATDs?

Collect precision, recall, F1, and cost results of Jitterbug on the original dataset. Save results as stopping_0.9_original.csv.

src$ python main.py stopping_results original

Collect precision, recall, F1, and cost results of Jitterbug on the corrected dataset. Save results as stopping_0.9_corrected.csv.

src$ python main.py stopping_results corrected

Releases

No releases published

Packages

No packages published

Languages

You can’t perform that action at this time.