Skip to content

AutoFL

License: GPL v3 DOI Docker

Automatic source code file annotation using weak labeling.

Overview

AutoFL is a tool designed for automatic annotation of source code files through weak labeling techniques. It provides both an API and a web-based UI for easy analysis of projects across different languages.

Setup

To set up the repository along with its UI submodule, clone it using:

git clone --recursive git@github.com:SasCezar/AutoFL.git AutoFL

Optional Model Setup

For advanced features like semantic-based labeling, download models as required. For example, to use w2v-so, download the model from here and place it in the data/models/w2v-so folder. Alternatively, you can provide a custom path in the configuration files.

Usage

To run the tool using Docker, navigate to the project directory (where the docker-compose.yaml file is located) and execute:

docker compose up

API Endpoint

To analyze the files of a project, make a POST request to the following endpoint:

curl -X POST -d '{"name": "<PROJECT_NAME>", "remote": "<PROJECT_REMOTE>", "languages": ["<PROGRAMMING_LANGUAGE>"]}' localhost:8000/label/files -H "content-type: application/json"

For instance, to analyze the project at https://github.com/mickleness/pumpernickel, use:

curl -X POST -d '{"name": "pumpernickel", "remote": "https://github.com/mickleness/pumpernickel", "languages": ["java"]}' localhost:8000/label/files -H "content-type: application/json"

Web UI

AutoFL provides a web-based UI accessible locally at http://localhost:8501:

UI

For more details, check the UI repository.

Configuration

AutoFL uses Hydra to manage configurations. The configuration files can be found in the config folder. The main configuration file, main.yaml, allows you to customize various options:

  • local: Choose between local or Docker environments. Docker is the default.
  • taxonomy: Set the taxonomy for labeling. Currently supports gitranking. You can add custom taxonomies.
  • annotator: Specify the annotators to use. The default is simple, offering good results without dependencies on language models.
  • version_strategy: Select the versioning strategy. The default is latest.
  • dataloader: Choose the dataloader. The default is postgres.
  • writer: Set the writer for storing results. The default is postgres.

Additional configurations can be added by creating new files in the corresponding component folders.

Functionalities

  • Annotation (UI/API/Script)
  • File-Level
  • Package-Level
  • Project-Level
  • Batch Analysis (Script Only)
  • Temporal Analysis (TODO)
  • Classification (TODO)

Supported Languages

  • Java
  • Python (untested)
  • C (untested)
  • C++ (untested)
  • C# (untested)

Development

AutoFL is composed of multiple components, as shown in the architecture diagram below:

Architecture

Adding Support for New Languages

To add support for additional languages, a language-specific parser is required. You can use tree-sitter to develop a parser quickly.

Parser Details

The parser needs to be located in the parser/languages folder. It should extend the BaseParser class, which follows this structure:

class ParserBase(ABC):
"""
Abstract class for a programming language parser.
"""

    def __init__(self, library_path: Path | str):
        """
        :param library_path: Path to the tree-sitter languages.so file. The file has to contain the
        language parser. See tree-sitter for more details
        """
        ...

To implement the parsing logic, create a class that handles extracting identifiers. For Python, the parser might look like:

class PythonParser(ParserBase, lang=Extension.python.name):
    """
    Python-specific parser using a generic grammar for multiple versions. Utilizes tree-sitter for AST extraction.
    """

    def __init__(self, library_path: Path | str):
        ...

A custom parser independent of tree-sitter can also be developed. For more details, refer to the implementation of ParserBase.

Known Issues

  • Dependency Installation: The setup process may take significant time (~10 minutes), and dependency installations might fail due to timeouts. This appears to be a network-related issue, and retrying often resolves it. Future updates will aim to simplify dependencies.
  • ~~Indefinite Analysis Loops~~: ~~In some projects, the analysis may loop indefinitely. This issue is currently under investigation.~~ Seems solved in the latest version. Will monitor for further occurrences.

Docker Image Availability

AutoFL is also available as a Docker image. You can pull the image from Docker Hub using:

docker pull cezarsas/autofl

Find more details and updates at the Docker Hub page.

Disclaimer

This tool is in active development and may not function as expected in some cases. It has been tested primarily on Docker versions 24.0.7 and 25.0.0 for Ubuntu 22.04. Limited testing has been performed on Windows and MacOS, where functionality may vary.

If you encounter any issues, please open an issue on GitHub, make a pull request, or contact me at c.a.sas@rug.nl.

Citation

If you find this tool useful, please cite our work:

Paper

@article{sas2024multigranular,
title     = {Multi-granular Software Annotation using File-level Weak Labelling},
author    = {Cezar Sas and Andrea Capiluppi},
journal   = {Empirical Software Engineering},
volume    = {29},
number    = {1},
pages     = {12},
year      = {2024},
url       = {https://doi.org/10.1007/s10664-023-10423-7},
doi       = {10.1007/s10664-023-10423-7}
}

Note: The code used in this paper is available at CodeGraphClassification. However, AutoFL provides enhanced features, is more user-friendly, and includes a UI.

Tool

@software{sas2023autofl,
          author    = {Sas, Cezar and Capiluppi, Andrea},
          month     = oct,
          title     = {{AutoFL}},
          url       = {https://github.com/SasCezar/AutoFL},
          version   = {0.5.0},
          year      = {2024},
          url       = {https://doi.org/10.5281/zenodo.10255368},
          doi       = {10.5281/zenodo.10255368}
}