Software Engineering
See recent articles
Showing new listings for Monday, 6 October 2025
- [1] arXiv:2510.02387 [pdf, html, other]
-
Title: CWM: An Open-Weights LLM for Research on Code Generation with World ModelsFAIR CodeGen team. Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, David Zhang, Kunhao Zheng, Jordi Armengol-Estapé, Pedram Bashiri, Maximilian Beck, Pierre Chambon, Abhishek Charnalia, Chris Cummins, Juliette Decugis, Zacharias V. Fisches, François Fleuret, Fabian Gloeckle, Alex Gu, Michael Hassid, Daniel Haziza, Badr Youbi Idrissi, Christian Keller, Rahul Kindi, Hugh Leather, Gallil Maimon, Aram Markosyan, Francisco Massa, Pierre-Emmanuel Mazaré, Vegard Mella, Naila Murray, Keyur Muzumdar, Peter O'Hearn, Matteo Pagliardini, Dmitrii Pedchenko, Tal Remez, Volker Seeker, Marco Selvi, Oren Sultan, Sida Wang, Luca Wehrstedt, Ori Yoran, Lingming Zhang, Taco Cohen, Yossi Adi, Gabriel SynnaeveComments: 58 pagesSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
We release Code World Model (CWM), a 32-billion-parameter open-weights LLM, to advance research on code generation with world models. To improve code understanding beyond what can be learned from training on static code alone, we mid-train CWM on a large amount of observation-action trajectories from Python interpreter and agentic Docker environments, and perform extensive multi-task reasoning RL in verifiable coding, math, and multi-turn software engineering environments. With CWM, we provide a strong testbed for researchers to explore the opportunities world modeling affords for improving code generation with reasoning and planning in computational environments. We present first steps of how world models can benefit agentic coding, enable step-by-step simulation of Python code execution, and show early results of how reasoning can benefit from the latter. CWM is a dense, decoder-only LLM trained with a context size of up to 131k tokens. Independent of its world modeling capabilities, CWM offers strong performance on general coding and math tasks: it reaches pass@1 scores of 65.8% on SWE-bench Verified (with test-time scaling), 68.6% on LiveCodeBench, 96.6% on Math-500, and 76.0% on AIME 2024. To support further research on code world modeling, we release model checkpoints after mid-training, SFT, and RL.
- [2] arXiv:2510.02389 [pdf, html, other]
-
Title: From Trace to Line: LLM Agent for Real-World OSS Vulnerability LocalizationSubjects: Software Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Large language models show promise for vulnerability discovery, yet prevailing methods inspect code in isolation, struggle with long contexts, and focus on coarse function- or file-level detections - offering limited actionable guidance to engineers who need precise line-level localization and targeted patches in real-world software development. We present T2L-Agent (Trace-to-Line Agent), a project-level, end-to-end framework that plans its own analysis and progressively narrows scope from modules to exact vulnerable lines. T2L-Agent couples multi-round feedback with an Agentic Trace Analyzer (ATA) that fuses runtime evidence - crash points, stack traces, and coverage deltas - with AST-based code chunking, enabling iterative refinement beyond single pass predictions and translating symptoms into actionable, line-level diagnoses. To benchmark line-level vulnerability discovery, we introduce T2L-ARVO, a diverse, expert-verified 50-case benchmark spanning five crash families and real-world projects. T2L-ARVO is specifically designed to support both coarse-grained detection and fine-grained localization, enabling rigorous evaluation of systems that aim to move beyond file-level predictions. On T2L-ARVO, T2L-Agent achieves up to 58.0% detection and 54.8% line-level localization, substantially outperforming baselines. Together, the framework and benchmark push LLM-based vulnerability detection from coarse identification toward deployable, robust, precision diagnostics that reduce noise and accelerate patching in open-source software workflows.
- [3] arXiv:2510.02393 [pdf, html, other]
-
Title: AP2O: Correcting LLM-Generated Code Errors Type by Type Like Humans via Adaptive Progressive Preference OptimizationSubjects: Software Engineering (cs.SE)
LLMs' code generation capabilities have yielded substantial improvements in the effectiveness of programming tasks. However, LLM-generated code still suffers from compilation and runtime errors. Existing offline preference optimization methods primarily focus on enhancing LLMs' coding abilities using pass/fail signals in the preference data, overlooking the deep-level error types in the failed codes. To address this, we propose Adaptively Progressive Preference Optimization (AP2O) for coding (i.e., AP2O-Coder), a method that guides LLMs adaptively and methodically to reduce code errors for code generation. Specifically, we construct an error notebook from failed codes and progressively optimize the LLM to correct errors type by type. Furthermore, we adaptively replay error types to tailor to the LLM's changing weaknesses throughout the training process. Through extensive experiments on both code and general LLMs (Llama, Qwen, and DeepSeek series) with parameters ranging from 0.5B to 34B, our AP2O-Coder improves code generation performance by up to 3% in pass@k while using less preference data. Code: this https URL
- [4] arXiv:2510.02404 [pdf, html, other]
-
Title: Dynamic Function Configuration and its Management in Serverless Computing: A Taxonomy and Future DirectionsComments: 34 pages, 2 figures, 2 tables, journalSubjects: Software Engineering (cs.SE)
The serverless cloud computing model offers a framework where the service provider abstracts the underlying infrastructure management from developers. In this serverless model, FaaS provides an event-driven, function-oriented computing service characterised by fine-grained, usage-based pricing that eliminates cost for idle resources. Platforms like AWS Lambda, Azure Functions, and Cloud Run Functions require developers to configure their function(s) with minimum operational resources for its successful execution. This resource allocation influences both the operational expense and the performance quality of these functions. However, a noticeable lack of platform transparency forces developers to rely on expert knowledge or experience-based ad-hoc decisions to request desired function resources. This makes optimal resource configuration a non-trivial task while adhering to performance constraints. Furthermore, while commercial platforms often scale resources like CPU and network bandwidth proportional to memory, open-source frameworks permit independent configuration of function resources, introducing additional complexity for developers aiming to optimise their functions. These complexities have directed researchers to resolve developer challenges and advance towards an efficient server-less execution model. In this article, we identify different aspects of resource configuration techniques in FaaS settings and propose a taxonomy of factors that influence function design, configuration, run-time cost, and performance guarantees. We conduct an analysis of existing literature on resource configuration to present a comprehensive review of current studies on function configuration. We also identify existing research gaps and suggest future research directions to enhance function configuration and strengthen the capabilities of serverless computing environments to drive its broader adoption.
- [5] arXiv:2510.02504 [pdf, html, other]
-
Title: Product Manager Practices for Delegating Work to Generative AI: "Accountability must not be delegated to non-human actors"Mara Ulloa, Jenna L. Butler, Sankeerti Haniyur, Courtney Miller, Barrett Amos, Advait Sarkar, Margaret-Anne StoreyComments: 12 pages, 4 figures, 1 tableSubjects: Software Engineering (cs.SE)
Generative AI (GenAI) is changing the nature of knowledge work, particularly for Product Managers (PMs) in software development teams. While much software engineering research has focused on developers' interactions with GenAI, there is less understanding of how the work of PMs is evolving due to GenAI. To address this gap, we conducted a mixed-methods study at Microsoft, a large, multinational software company: surveying 885 PMs, analyzing telemetry data for a subset of PMs (N=731), and interviewing a subset of 15 PMs. We contribute: (1) PMs' current GenAI adoption rates, uses cases, and perceived benefits and barriers and; (2) a framework capturing how PMs assess which tasks to delegate to GenAI; (3) PMs adaptation practices for integrating GenAI into their roles and perceptions of how their role is evolving. We end by discussing implications on the broader GenAI workflow adoption process and software development roles.
- [6] arXiv:2510.02534 [pdf, html, other]
-
Title: ZeroFalse: Improving Precision in Static Analysis with LLMsMohsen Iranmanesh (Simon Fraser University), Sina Moradi Sabet (Amirkabir University of Technology), Sina Marefat (K. N. Toosi University of Technology), Ali Javidi Ghasr (Ferdowsi University of Mashhad), Allison Wilson (Cyber Risk Solutions), Iman Sharafaldin (Forward Security), Mohammad A. Tayebi (Simon Fraser University)Subjects: Software Engineering (cs.SE)
Static Application Security Testing (SAST) tools are integral to modern software development, yet their adoption is undermined by excessive false positives that weaken developer trust and demand costly manual triage. We present ZeroFalse, a framework that integrates static analysis with large language models (LLMs) to reduce false positives while preserving coverage. ZeroFalse treats static analyzer outputs as structured contracts, enriching them with flow-sensitive traces, contextual evidence, and CWE-specific knowledge before adjudication by an LLM. This design preserves the systematic reach of static analysis while leveraging the reasoning capabilities of LLMs. We evaluate ZeroFalse across both benchmarks and real-world projects using ten state-of-the-art LLMs. Our best-performing models achieve F1-scores of 0.912 on the OWASP Java Benchmark and 0.955 on the OpenVuln dataset, maintaining recall and precision above 90%. Results further show that CWE-specialized prompting consistently outperforms generic prompts, and reasoning-oriented LLMs provide the most reliable precision-recall balance. These findings position ZeroFalse as a practical and scalable approach for enhancing the reliability of SAST and supporting its integration into real-world CI/CD pipelines.
- [7] arXiv:2510.02585 [pdf, html, other]
-
Title: Key Considerations for Auto-Scaling: Lessons from Benchmark MicroservicesSubjects: Software Engineering (cs.SE)
Microservices have become the dominant architectural paradigm for building scalable and modular cloud-native systems. However, achieving effective auto-scaling in such systems remains a non-trivial challenge, as it depends not only on advanced scaling techniques but also on sound design, implementation, and deployment practices. Yet, these foundational aspects are often overlooked in existing benchmarks, making it difficult to evaluate autoscaling methods under realistic conditions. In this paper, we identify a set of practical auto-scaling considerations by applying several state-of-the-art autoscaling methods to widely used microservice benchmarks. To structure these findings, we classify the issues based on when they arise during the software lifecycle: Architecture, Implementation, and Deployment. The Architecture phase covers high-level decisions such as service decomposition and inter-service dependencies. The Implementation phase includes aspects like initialization overhead, metrics instrumentation, and error propagation. The Deployment phase focuses on runtime configurations such as resource limits and health checks. We validate these considerations using the Sock-Shop benchmark and evaluate diverse auto-scaling strategies, including threshold-based, control-theoretic, learning-based, black-box optimization, and dependency-aware approaches. Our findings show that overlooking key lifecycle concerns can degrade autoscaler performance, while addressing them leads to more stable and efficient scaling. These results underscore the importance of lifecycle-aware engineering for unlocking the full potential of auto-scaling in microservice-based systems.
- [8] arXiv:2510.02609 [pdf, html, other]
-
Title: RedCodeAgent: Automatic Red-teaming Agent against Diverse Code AgentsChengquan Guo, Chulin Xie, Yu Yang, Zhaorun Chen, Zinan Lin, Xander Davies, Yarin Gal, Dawn Song, Bo LiSubjects: Software Engineering (cs.SE)
Code agents have gained widespread adoption due to their strong code generation capabilities and integration with code interpreters, enabling dynamic execution, debugging, and interactive programming capabilities. While these advancements have streamlined complex workflows, they have also introduced critical safety and security risks. Current static safety benchmarks and red-teaming tools are inadequate for identifying emerging real-world risky scenarios, as they fail to cover certain boundary conditions, such as the combined effects of different jailbreak tools. In this work, we propose RedCodeAgent, the first automated red-teaming agent designed to systematically uncover vulnerabilities in diverse code agents. With an adaptive memory module, RedCodeAgent can leverage existing jailbreak knowledge, dynamically select the most effective red-teaming tools and tool combinations in a tailored toolbox for a given input query, thus identifying vulnerabilities that might otherwise be overlooked. For reliable evaluation, we develop simulated sandbox environments to additionally evaluate the execution results of code agents, mitigating potential biases of LLM-based judges that only rely on static code. Through extensive evaluations across multiple state-of-the-art code agents, diverse risky scenarios, and various programming languages, RedCodeAgent consistently outperforms existing red-teaming methods, achieving higher attack success rates and lower rejection rates with high efficiency. We further validate RedCodeAgent on real-world code assistants, e.g., Cursor and Codeium, exposing previously unidentified security risks. By automating and optimizing red-teaming processes, RedCodeAgent enables scalable, adaptive, and effective safety assessments of code agents.
- [9] arXiv:2510.02634 [pdf, other]
-
Title: Automatic Building Code Review: A Case StudySubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Building officials, particularly those in resource-constrained or rural jurisdictions, face labor-intensive, error-prone, and costly manual reviews of design documents as projects increase in size and complexity. The growing adoption of Building Information Modeling (BIM) and Large Language Models (LLMs) presents opportunities for automated code review (ACR) solutions. This study introduces a novel agent-driven framework that integrates BIM-based data extraction with automated verification using both retrieval-augmented generation (RAG) and Model Context Protocol (MCP) agent pipelines. The framework employs LLM-enabled agents to extract geometry, schedules, and system attributes from heterogeneous file types, which are then processed for building code checking through two complementary mechanisms: (1) direct API calls to the US Department of Energy COMcheck engine, providing deterministic and audit-ready outputs, and (2) RAG-based reasoning over rule provisions, enabling flexible interpretation where coverage is incomplete or ambiguous.
The framework was evaluated through case demonstrations, including automated extraction of geometric attributes (such as surface area, tilt, and insulation values), parsing of operational schedules, and validation of lighting allowances under ASHRAE Standard 90.1-2022. Comparative performance tests across multiple LLMs showed that GPT-4o achieved the best balance of efficiency and stability, while smaller models exhibited inconsistencies or failures. Results confirm that MCP agent pipelines outperform RAG reasoning pipelines in rigor and reliability. This work advances ACR research by demonstrating a scalable, interoperable, and production-ready approach that bridges BIM with authoritative code review tools. - [10] arXiv:2510.02718 [pdf, html, other]
-
Title: Using Fourier Analysis and Mutant Clustering to Accelerate DNN Mutation TestingComments: 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE)Subjects: Software Engineering (cs.SE)
Deep neural network (DNN) mutation analysis is a promising approach to evaluating test set adequacy. Due to the large number of generated mutants that must be tested on large datasets, mutation analysis is costly. In this paper, we present a technique, named DM#, for accelerating DNN mutation testing using Fourier analysis. The key insight is that DNN outputs are real-valued functions suitable for Fourier analysis that can be leveraged to quantify mutant behavior using only a few data points. DM# uses the quantified mutant behavior to cluster the mutants so that the ones with similar behavior fall into the same group. A representative from each group is then selected for testing, and the result of the test, e.g., whether the mutant is killed or survived, is reused for all other mutants represented by the selected mutant, obviating the need for testing other mutants. 14 DNN models of sizes ranging from thousands to millions of parameters, trained on different datasets, are used to evaluate DM# and compare it to several baseline techniques. Our results provide empirical evidence on the effectiveness of DM# in accelerating mutation testing by 28.38%, on average, at the average cost of only 0.72% error in mutation score. Moreover, on average, DM# incurs 11.78, 15.16, and 114.36 times less mutation score error compared to random mutant selection, boundary sample selection, and random sample selection techniques, respectively, while generally offering comparable speed-up.
- [11] arXiv:2510.02773 [pdf, html, other]
-
Title: Automated Repair of OpenID Connect Programs (Extended Version)Comments: This is an extended version. The original paper is accepted to ASE 2025Subjects: Software Engineering (cs.SE); Cryptography and Security (cs.CR)
OpenID Connect has revolutionized online authentication based on single sign-on (SSO) by providing a secure and convenient method for accessing multiple services with a single set of credentials. Despite its widespread adoption, critical security bugs in OpenID Connect have resulted in significant financial losses and security breaches, highlighting the need for robust mitigation strategies. Automated program repair presents a promising solution for generating candidate patches for OpenID implementations. However, challenges such as domain-specific complexities and the necessity for precise fault localization and patch verification must be addressed. We propose AuthFix, a counterexample-guided repair engine leveraging LLMs for automated OpenID bug fixing. AuthFix integrates three key components: fault localization, patch synthesis, and patch verification. By employing a novel Petri-net-based model checker, AuthFix ensures the correctness of patches by effectively modeling interactions. Our evaluation on a dataset of OpenID bugs demonstrates that AuthFix successfully generated correct patches for 17 out of 23 bugs (74%), with a high proportion of patches semantically equivalent to developer-written fixes.
- [12] arXiv:2510.02854 [pdf, html, other]
-
Title: C2|Q>: A Robust Framework for Bridging Classical and Quantum Software DevelopmentBoshuai Ye, Arif Ali Khan, Teemu Pihkakoski, Peng Liang, Muhammad Azeem Akbar, Matti Silveri, Lauri MalmiComments: 46 pages, 8 images, 14 tables, Manuscript submitted to a Journal (2025)Subjects: Software Engineering (cs.SE)
Quantum Software Engineering (QSE) is emerging as a critical discipline to make quantum computing accessible to a broader developer community; however, most quantum development environments still require developers to engage with low-level details across the software stack - including problem encoding, circuit construction, algorithm configuration, hardware selection, and result interpretation - making them difficult for classical software engineers to use. To bridge this gap, we present C2|Q>: a hardware-agnostic quantum software development framework that translates classical specifications (code) into quantum-executable programs while preserving methodological rigor. The framework applies modular software engineering principles by classifying the workflow into three core modules: an encoder that classifies problems, produces Quantum-Compatible Formats (QCFs), and constructs quantum circuits, a deployment module that generates circuits and recommends hardware based on fidelity, runtime, and cost, and a decoder that interprets quantum outputs into classical solutions. In evaluation, the encoder module achieved a 93.8% completion rate, the hardware recommendation module consistently selected the appropriate quantum devices for workloads scaling up to 56 qubits, and the full C2|Q>: workflow successfully processed classical specifications (434 Python snippets and 100 JSON inputs) with completion rates of 93.8% and 100%, respectively. For case study problems executed on publicly available NISQ hardware, C2|Q>: reduced the required implementation effort by nearly 40X compared to manual implementations using low-level quantum software development kits (SDKs), with empirical runs limited to small- and medium-sized instances consistent with current NISQ capabilities. The open-source implementation of C2|Q>: is available at this https URL
- [13] arXiv:2510.02887 [pdf, html, other]
-
Title: GramTrans: A Better Code Representation Approach in Code GenerationZhao Zhang, Qingyuan Liang, Zeyu Sun, Yizhou Chen, Guoqing Wang, Yican Sun, Lu Zhang, Ge Li, Yingfei XiongSubjects: Software Engineering (cs.SE)
Code generation has shown great promise in assisting software development. A fundamental yet underexplored question is how the choice of code representation affects model performance. While existing studies employ various representations, such as treating code as plain text, grammar rule sequences, or syntax tree sequences, they lack a principled understanding of the relationship between parsing difficulty and model effectiveness. This paper proposes a conjecture: the easier a representation is to parse, the better performance the model achieves. We formalize this idea using grammar classes, where representations in simpler classes (e.g., LL(1)) are easier to parse. Through a controlled experiment on a Python-based DSL, we show that parsing difficulty strongly correlates with model performance. Motivated by this finding, we present GramTrans, a general approach that automatically transforms a context-free language into a representation within the LL(1) class. GramTrans introduces a novel hierarchical conflict elimination algorithm, enabling a flexible trade-off between syntactic simplicity and token efficiency. We evaluate GramTrans on both Python and Java using three code generation models: StarCoder 1B, DeepSeek-Coder 1.3B, and Qwen2.5 1.5B. Across multiple benchmarks, GramTrans consistently delivers significant improvements over baseline representations. Furthermore, our analysis of existing representations reconfirms the strong alignment between parsing difficulty and model performance, providing additional support for the conjecture.
- [14] arXiv:2510.02917 [pdf, html, other]
-
Title: Mechanistic Interpretability of Code Correctness in LLMs via Sparse AutoencodersSubjects: Software Engineering (cs.SE); Machine Learning (cs.LG)
As Large Language Models become integral to software development, with substantial portions of AI-suggested code entering production, understanding their internal correctness mechanisms becomes critical for safe deployment. We apply sparse autoencoders to decompose LLM representations, identifying directions that correspond to code correctness. We select predictor directions using t-statistics and steering directions through separation scores from base model representations, then analyze their mechanistic properties through steering, attention analysis, and weight orthogonalization. We find that code correctness directions in LLMs reliably predict incorrect code, while correction capabilities, though statistically significant, involve tradeoffs between fixing errors and preserving correct code. Mechanistically, successful code generation depends on attending to test cases rather than problem descriptions. Moreover, directions identified in base models retain their effectiveness after instruction-tuning, suggesting code correctness mechanisms learned during pre-training are repurposed during fine-tuning. Our mechanistic insights suggest three practical applications: prompting strategies should prioritize test examples over elaborate problem descriptions, predictor directions can serve as error alarms for developer review, and these same predictors can guide selective steering, intervening only when errors are anticipated to prevent the code corruption from constant steering.
- [15] arXiv:2510.02934 [pdf, html, other]
-
Title: Model-Agnostic Correctness Assessment for LLM-Generated Code via Dynamic Internal Representation SelectionSubjects: Software Engineering (cs.SE)
Large Language Models (LLMs) have demonstrated impressive capabilities in code generation and are increasingly integrated into the software development process. However, ensuring the correctness of LLM-generated code remains a critical concern. Prior work has shown that the internal representations of LLMs encode meaningful signals for assessing code correctness. Nevertheless, the existing methods rely on representations from pre-selected/fixed layers and token positions, which could limit its generalizability across diverse model architectures and tasks. In this work, we introduce AUTOPROBE, a novel model-agnostic approach that dynamically selects the most informative internal representations for code correctness assessment. AUTOPROBE employs an attention-based mechanism to learn importance scores for hidden states, enabling it to focus on the most relevant features. These weighted representations are then aggregated and passed to a probing classifier to predict code correctness across multiple dimensions, including compilability, functionality, and security. To evaluate the performance of AUTOPROBE, we conduct extensive experiments across multiple benchmarks and code LLMs. Our experimental results show that AUTOPROBE consistently outperforms the baselines. For security assessment, AUTOPROBE surpasses the state-of-the-art white-box approach by 18%. For compilability and functionality assessment, AUTOPROBE demonstrates its highest robustness to code complexity, with the performance higher than the other approaches by up to 19% and 111%, respectively. These findings highlight that dynamically selecting important internal signals enables AUTOPROBE to serve as a robust and generalizable solution for assessing the correctness of code generated by various LLMs.
- [16] arXiv:2510.02991 [pdf, html, other]
-
Title: Tracing and Metrics Design Patterns for Monitoring Cloud-native ApplicationsComments: Accepted for publication in the EuroPLoP 2025 proceedingsSubjects: Software Engineering (cs.SE)
Observability helps ensure the reliability and maintainability of cloud-native applications. As software architectures become increasingly distributed and subject to change, it becomes a greater challenge to diagnose system issues effectively, often having to deal with fragmented observability and more difficult root cause analysis. This paper builds upon our previous work and introduces three design patterns that address key challenges in monitoring cloud-native applications. Distributed Tracing improves visibility into request flows across services, aiding in latency analysis and root cause detection, Application Metrics provides a structured approach to instrumenting applications with meaningful performance indicators, enabling real-time monitoring and anomaly detection, and Infrastructure Metrics focuses on monitoring the environment in which the system is operated, helping teams assess resource utilization, scalability, and operational health. These patterns are derived from industry practices and observability frameworks and aim to offer guidance for software practitioners.
- [17] arXiv:2510.03005 [pdf, html, other]
-
Title: Patterns for Teaching Agile with Student Projects - Team and Project SetupComments: Accepted for publication in the EuroPLoP 2025 proceedingsSubjects: Software Engineering (cs.SE)
Higher education courses teaching about agile software development (ASD) have increased in commonality as the ideas behind the Agile Manifesto became more commonplace in the industry. However, a lot of the literature on how ASD is applied in the classroom does not provide much actionable advice, focusing on frameworks or even moving beyond the software development area into teaching in an agile way. We, therefore, showcase early work on a pattern language that focuses on teaching ASD practices to university students, which stems from our own experiences as educators in higher education contexts. We present five patterns, specifically focused on team and project setup phase: Capping Team Size, Smaller Project Scope, Business Non-Critical Project, Self-assembling Teams, and Team Chooses Topic as a starting point for developing the overall pattern language.
- [18] arXiv:2510.03029 [pdf, html, other]
-
Title: Investigating The Smells of LLM Generated CodeSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Context: Large Language Models (LLMs) are increasingly being used to generate program code. Much research has been reported on the functional correctness of generated code, but there is far less on code quality.
Objectives: In this study, we propose a scenario-based method of evaluating the quality of LLM-generated code to identify the weakest scenarios in which the quality of LLM generated code should be improved.
Methods: The method measures code smells, an important indicator of code quality, and compares them with a baseline formed from reference solutions of professionally written code. The test dataset is divided into various subsets according to the topics of the code and complexity of the coding tasks to represent different scenarios of using LLMs for code generation. We will also present an automated test system for this purpose and report experiments with the Java programs generated in response to prompts given to four state-of-the-art LLMs: Gemini Pro, ChatGPT, Codex, and Falcon.
Results: We find that LLM-generated code has a higher incidence of code smells compared to reference solutions. Falcon performed the least badly, with a smell increase of 42.28%, followed by Gemini Pro (62.07%), ChatGPT (65.05%) and finally Codex (84.97%). The average smell increase across all LLMs was 63.34%, comprising 73.35% for implementation smells and 21.42% for design smells. We also found that the increase in code smells is greater for more complex coding tasks and for more advanced topics, such as those involving object-orientated concepts.
Conclusion: In terms of code smells, LLM's performances on various coding task complexities and topics are highly correlated to the quality of human written code in the corresponding scenarios. However, the quality of LLM generated code is noticeably poorer than human written code. - [19] arXiv:2510.03050 [pdf, html, other]
-
Title: Refactoring Towards Microservices: Preparing the Ground for Service ExtractionComments: Accepted for publication in the EuroPLoP 2025 proceedingsSubjects: Software Engineering (cs.SE)
As organizations increasingly transition from monolithic systems to microservices, they aim to achieve higher availability, automatic scaling, simplified infrastructure management, enhanced collaboration, and streamlined deployments. However, this migration process remains largely manual and labour-intensive. While existing literature offers various strategies for decomposing monoliths, these approaches primarily focus on architecture-level guidance, often overlooking the code-level challenges and dependencies that developers must address during the migration. This article introduces a catalogue of seven refactorings specifically designed to support the transition to a microservices architecture with a focus on handling dependencies. The catalogue provides developers with a systematic guide that consolidates refactorings identified in the literature and addresses the critical gap in systematizing the process at the code level. By offering a structured, step-by-step approach, this work simplifies the migration process and lays the groundwork for its potential automation, empowering developers to implement these changes efficiently and effectively.
- [20] arXiv:2510.03071 [pdf, html, other]
-
Title: State Field Coverage: A Metric for Oracle QualitySubjects: Software Engineering (cs.SE)
The effectiveness of testing in uncovering software defects depends not only on the characteristics of the test inputs and how thoroughly they exercise the software, but also on the quality of the oracles used to determine whether the software behaves as expected. Therefore, assessing the quality of oracles is crucial to improve the overall effectiveness of the testing process. Existing metrics have been used for this purpose, but they either fail to provide a comprehensive basis for guiding oracle improvement, or they are tailored to specific types of oracles, thus limiting their generality.
In this paper, we introduce state field coverage, a novel metric for assessing oracle quality. This metric measures the proportion of an object's state, as statically defined by its class fields, that an oracle may access during test execution. The main intuition of our metric is that oracles with a higher state field coverage are more likely to detect faults in the software under analysis, as they inspect a larger portion of the object states to determine whether tests pass or not.
We implement a mechanism to statically compute the state field coverage metric. Being statically computed, the metric is efficient and provides direct guidance for improving test oracles by identifying state fields that remain unexamined. We evaluate state field coverage through experiments involving 273 representation invariants and 249,027 test assertions. The results show that state field coverage is a well-suited metric for assessing oracle quality, as it strongly correlates with the oracles' fault-detection ability, measured by mutation score. - [21] arXiv:2510.03178 [pdf, html, other]
-
Title: When Names Disappear: Revealing What LLMs Actually Understand About CodeSubjects: Software Engineering (cs.SE); Computation and Language (cs.CL)
Large Language Models (LLMs) achieve strong results on code tasks, but how they derive program meaning remains unclear. We argue that code communicates through two channels: structural semantics, which define formal behavior, and human-interpretable naming, which conveys intent. Removing the naming channel severely degrades intent-level tasks such as summarization, where models regress to line-by-line descriptions. Surprisingly, we also observe consistent reductions on execution tasks that should depend only on structure, revealing that current benchmarks reward memorization of naming patterns rather than genuine semantic reasoning. To disentangle these effects, we introduce a suite of semantics-preserving obfuscations and show that they expose identifier leakage across both summarization and execution. Building on these insights, we release ClassEval-Obf, an obfuscation-enhanced benchmark that systematically suppresses naming cues while preserving behavior. Our results demonstrate that ClassEval-Obf reduces inflated performance gaps, weakens memorization shortcuts, and provides a more reliable basis for assessing LLMs' code understanding and generalization.
- [22] arXiv:2510.03217 [pdf, html, other]
-
Title: Abstain and Validate: A Dual-LLM Policy for Reducing Noise in Agentic Program RepairJosé Cambronero, Michele Tufano, Sherry Shi, Renyao Wei, Grant Uy, Runxiang Cheng, Chin-Jung Liu, Shiying Pan, Satish Chandra, Pat RondonSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Agentic Automated Program Repair (APR) is increasingly tackling complex, repository-level bugs in industry, but ultimately agent-generated patches still need to be reviewed by a human before committing them to ensure they address the bug. Showing unlikely patches to developers can lead to substantial noise, wasting valuable developer time and eroding trust in automated code changes. We introduce two complementary LLM-based policies to reduce such noise: bug abstention and patch validation policies. Bug abstention excludes bugs that the agentic APR system is unlikely to fix. Patch validation rejects patches that are unlikely to be a good fix for the given bug. We evaluate both policies on three sets of bugs from Google's codebase, and their candidate patches generated by an internal agentic APR system. On a set of 174 human-reported bugs, removing bugs and patch trajectories rejected by our policies can raise success rates by up to 13 percentage points and 15 percentage points, respectively, and by up to 39 percentage points in combination. On null pointer exceptions and sanitizer-reported bugs with machine-generated bug reports, patch validation also improves average single-sample success rates. This two-policy approach provides a practical path to the reliable, industrial-scale deployment of agentic APR systems.
New submissions (showing 22 of 22 entries)
- [23] arXiv:2510.02572 (cross-list from cs.LG) [pdf, html, other]
-
Title: Geospatial Machine Learning LibrariesComments: Book chapterSubjects: Machine Learning (cs.LG); Software Engineering (cs.SE)
Recent advances in machine learning have been supported by the emergence of domain-specific software libraries, enabling streamlined workflows and increased reproducibility. For geospatial machine learning (GeoML), the availability of Earth observation data has outpaced the development of domain libraries to handle its unique challenges, such as varying spatial resolutions, spectral properties, temporal cadence, data coverage, coordinate systems, and file formats. This chapter presents a comprehensive overview of GeoML libraries, analyzing their evolution, core functionalities, and the current ecosystem. It also introduces popular GeoML libraries such as TorchGeo, eo-learn, and Raster Vision, detailing their architecture, supported data types, and integration with ML frameworks. Additionally, it discusses common methodologies for data preprocessing, spatial--temporal joins, benchmarking, and the use of pretrained models. Through a case study in crop type mapping, it demonstrates practical applications of these tools. Best practices in software design, licensing, and testing are highlighted, along with open challenges and future directions, particularly the rise of foundation models and the need for governance in open-source geospatial software. Our aim is to guide practitioners, developers, and researchers in navigating and contributing to the rapidly evolving GeoML landscape.
- [24] arXiv:2510.02810 (cross-list from cs.LG) [pdf, html, other]
-
Title: Dissecting Transformers: A CLEAR Perspective towards Green AISubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
The rapid adoption of Large Language Models (LLMs) has raised significant environmental concerns. Unlike the one-time cost of training, LLM inference occurs continuously at a global scale and now dominates the AI energy footprint. Yet, most sustainability studies report only coarse, model-level metrics due to the lack of fine-grained measurement methods, treating energy efficiency more as an afterthought than as a primary objective. We present the first fine-grained empirical analysis of inference energy across core components of transformer architecture. We propose a novel methodology, Component-Level Energy Assessment via Repeated sampling (CLEAR), to overcome temporal mismatch between microsecond scale component execution and monitoring of millisecond (ms) scale energy sensors. Using CLEAR, we evaluate 15 models spanning four distinct architecture types and consistently keep component-wise energy variance below 9.5\% while capturing more than 90\% of the model's total energy as individual components. Our empirical analysis reveals that Attention blocks consume significantly more energy per floating-point operation (FLOP), indicating that energy consumption is not proportionally aligned with FLOP counts. This shows that FLOPs alone fail to capture the true energy cost at a component level. Our findings establish detailed component-level energy baselines and provide insight as an initial step to build energy-efficient transformer models through component-level optimizations.
- [25] arXiv:2510.03078 (cross-list from cs.AI) [pdf, html, other]
-
Title: From Facts to Foils: Designing and Evaluating Counterfactual Explanations for Smart EnvironmentsComments: Accepted at Ex-ASE 2025, co-located with the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE 2025)Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Explainability is increasingly seen as an essential feature of rule-based smart environments. While counterfactual explanations, which describe what could have been done differently to achieve a desired outcome, are a powerful tool in eXplainable AI (XAI), no established methods exist for generating them in these rule-based domains. In this paper, we present the first formalization and implementation of counterfactual explanations tailored to this domain. It is implemented as a plugin that extends an existing explanation engine for smart environments. We conducted a user study (N=17) to evaluate our generated counterfactuals against traditional causal explanations. The results show that user preference is highly contextual: causal explanations are favored for their linguistic simplicity and in time-pressured situations, while counterfactuals are preferred for their actionable content, particularly when a user wants to resolve a problem. Our work contributes a practical framework for a new type of explanation in smart environments and provides empirical evidence to guide the choice of when each explanation type is most effective.
Cross submissions (showing 3 of 3 entries)
- [26] arXiv:2406.15676 (replaced) [pdf, html, other]
-
Title: Inferring Pluggable Types with Machine LearningSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Pluggable type systems allow programmers to extend the type system of a programming language to enforce semantic properties defined by the programmer. Pluggable type systems are difficult to deploy in legacy codebases because they require programmers to write type annotations manually. This paper investigates how to use machine learning to infer type qualifiers automatically. We propose a novel representation, NaP-AST, that encodes minimal dataflow hints for the effective inference of type qualifiers. We evaluate several model architectures for inferring type qualifiers, including Graph Transformer Network, Graph Convolutional Network and Large Language Model. We further validated these models by applying them to 12 open-source programs from a prior evaluation of the NullAway pluggable typechecker, lowering warnings in all but one unannotated project. We discovered that GTN shows the best performance, with a recall of .89 and precision of 0.6. Furthermore, we conduct a study to estimate the number of Java classes needed for good performance of the trained model. For our feasibility study, performance improved around 16k classes, and deteriorated due to overfitting around 22k classes.
- [27] arXiv:2502.18525 (replaced) [pdf, html, other]
-
Title: Programming with Pixels: Can Computer-Use Agents do Software Engineering?Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG)
Computer-use agents (CUAs) hold the promise of performing a wide variety of general tasks, but current evaluations have primarily focused on simple scenarios. It therefore remains unclear whether such generalist agents can automate more sophisticated and specialized work such as software engineering (SWE). To investigate this, we introduce $\texttt{Programming with Pixels}$ (PwP), the first comprehensive computer-use environment for software engineering, where agents visually control an IDE to perform diverse software engineering tasks. To enable holistic evaluation, we also introduce \texttt{PwP-Bench}, a benchmark of 15 existing and new software-engineering tasks spanning multiple modalities, programming languages, and skillsets. We perform an extensive evaluation of state-of-the-art open-weight and closed-weight CUAs and find that when interacting purely visually, they perform significantly worse than specialized coding agents. However, when the same CUAs are given direct access to just two APIs-file editing and bash operations-performance jumps, often reaching the levels of specialized agents despite having a task-agnostic design. Furthermore, when given access to additional IDE tools via text APIs, all models show further gains. Our analysis shows that current CUAs fall short mainly due to limited visual grounding and the inability to take full advantage of the rich environment, leaving clear room for future this http URL establishes software engineering as a natural domain for benchmarking whether generalist computer-use agents can reach specialist-level performance on sophisticated tasks. Code and data released at this https URL
- [28] arXiv:2503.04076 (replaced) [pdf, html, other]
-
Title: Unmasking the Genuine Type Inference Capabilities of LLMs for Java Code SnippetsComments: under reviewSubjects: Software Engineering (cs.SE)
Type inference is crucial for reusing online code snippets. Although snippets are prevalently shared on platforms like StackOverflow, they often lack essential type information, such as fully qualified names (FQNs). Recent studies have leveraged Large Language Models (LLMs) to perform type inference for such code snippets, showing promising results. However, these results may suffer from data leakage, as the benchmark, StatType-SO, used for evaluation has been publicly available on GitHub since 2017. Consequently, it remains uncertain whether the strong performance of LLMs reflects genuine semantic understanding of code or is due to the ground truth being included in the training set.
This paper strives to comprehensively evaluate the genuine type inference capabilities of LLMs on Java code snippets and identify potential limitations of LLMs. First, we created ThaliaType, a new, previously unreleased benchmark suite designed for type inference evaluation. Second, using the StarCoder2 LLM as baseline, we uncovered data leakage from StatType-SO in StarCoder2's open-source training set and observed that other state-of-the-art LLMs exhibit similar performance drops when evaluated on ThaliaType, with precision decreasing by up to 59% and recall by up to 72%. Finally, we designed semantic-preserving code transformations to test the capabilities of LLMs in understanding the execution semantics of snippets. Results showed that LLMs' performance on StatType-SO is far less robust to these transformations than on ThaliaType, suggesting that the performance on StatType-SO may be biased by data leakage and have limited generalizability.
These findings highlight the importance of carefully designed, leakage-free benchmarks for evaluating LLMs on type inference tasks. We recommend future studies adopt ThaliaType for rigorous and reliable assessments of LLMs' genuine type inference capabilities. - [29] arXiv:2503.21710 (replaced) [pdf, html, other]
-
Title: Enhancing repository-level software repair via repository-aware knowledge graphsSubjects: Software Engineering (cs.SE)
Repository-level software repair faces challenges in bridging semantic gaps between issue descriptions and code patches. Existing approaches, which primarily rely on large language models (LLMs), are hindered by semantic ambiguities, limited understanding of structural context, and insufficient reasoning capabilities. To address these limitations, we propose KGCompass with two innovations: (1) a novel repository-aware knowledge graph (KG) that accurately links repository artifacts (issues and pull requests) and codebase entities (files, classes, and functions), allowing us to effectively narrow down the vast search space to only 20 most relevant functions with accurate candidate fault locations and contextual information, and (2) a path-guided repair mechanism that leverages KG-mined entity paths, tracing through which allows us to augment LLMs with relevant contextual information to generate precise patches along with their explanations. Experimental results in the SWE-bench Lite demonstrate that KGCompass achieves state-of-the-art single-LLM repair performance (58.3%) and function-level fault location accuracy (56.0%) across open-source approaches with a single repair model, costing only $0.2 per repair. Among the bugs that KGCompass successfully localizes, 89.7% lack explicit location hints in the issue and are found only through multi-hop graph traversal, where pure LLMs struggle to locate bugs accurately. Relative to pure-LLM baselines, KGCompass lifts the resolved rate by 50.8% on Claude-4 Sonnet, 30.2% on Claude-3.5 Sonnet, 115.7% on DeepSeek-V3, and 156.4% on Qwen2.5 Max. These consistent improvements demonstrate that this graph-guided repair framework delivers model-agnostic, cost-efficient repair and sets a strong new baseline for repository-level repair.
- [30] arXiv:2506.15655 (replaced) [pdf, html, other]
-
Title: cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax TreeSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Retrieval-Augmented Generation (RAG) has become essential for large-scale code generation, grounding predictions in external code corpora to improve actuality. However, a critical yet underexplored aspect of RAG pipelines is chunking -- the process of dividing documents into retrievable units. Existing line-based chunking heuristics often break semantic structures, splitting functions or merging unrelated code, which can degrade generation quality. We propose chunking via Abstract Syntax Trees (\ourwork), a structure-aware method that recursively breaks large AST nodes into smaller chunks and merges sibling nodes while respecting size limits. This approach generates self-contained, semantically coherent units across programming languages and tasks, improving performance on diverse code generation tasks, e.g., boosting Recall@5 by 4.3 points on RepoEval retrieval and Pass@1 by 2.67 points on SWE-bench generation. Our work highlights the importance of structure-aware chunking for scaling retrieval-enhanced code intelligence.
- [31] arXiv:2506.24015 (replaced) [pdf, html, other]
-
Title: Hierarchical Knowledge Injection for Improving LLM-based Program RepairComments: Accepted at IEEE/ACM Automated Software Engineering (ASE) 2025 ConferenceSubjects: Software Engineering (cs.SE)
Prompting LLMs with bug-related context (e.g., error messages, stack traces) improves automated program repair, but many bugs still remain unresolved. In real-world projects, developers often rely on broader repository and project-level context beyond the local code to resolve such bugs. In this paper, we investigate how automatically extracting and providing such knowledge can improve LLM-based program repair. We propose a layered knowledge injection framework that incrementally augments LLMs with structured context. It starts with the Bug Knowledge Layer, which includes information such as the buggy function and failing tests; expands to the Repository Knowledge Layer, which adds structural dependencies, related files, and commit history; and finally injects the Project Knowledge Layer, which incorporates relevant details from documentation and previously fixed bugs. We evaluate this framework on a dataset of 314 bugs from BugsInPy using two LLMs (Llama 3.3 and GPT-4o-mini), and analyze fix rates across six bug types. By progressively injecting knowledge across layers, our approach achieves a fix rate of 79% (250/314) using Llama 3.3, a significant improvement of 23% over previous work. All bug types show improvement with the addition of repository-level context, while only a subset benefit further from project-level knowledge, highlighting that different bug types require different levels of contextual information for effective repair. We also analyze the remaining unresolved bugs and find that more complex and structurally isolated bugs, such as Program Anomaly and GUI bugs, remain difficult even after injecting all available information. Our results show that layered context injection improves program repair and suggest the need for interactive and adaptive APR systems.
- [32] arXiv:2508.01443 (replaced) [pdf, html, other]
-
Title: Tuning LLM-based Code Optimization via Meta-Prompting: An Industrial PerspectiveJingzhi Gong, Rafail Giavrimis, Paul Brookes, Vardan Voskanyan, Fan Wu, Mari Ashiga, Matthew Truscott, Mike Basios, Leslie Kanthan, Jie Xu, Zheng WangComments: Accepted by ASE'25 Industry ShowcaseSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
There is a growing interest in leveraging multiple large language models (LLMs) for automated code optimization. However, industrial platforms deploying multiple LLMs face a critical challenge: prompts optimized for one LLM often fail with others, requiring expensive model-specific prompt engineering. This cross-model prompt engineering bottleneck severely limits the practical deployment of multi-LLM systems in production environments. We introduce Meta-Prompted Code Optimization (MPCO), a framework that automatically generates high-quality, task-specific prompts across diverse LLMs while maintaining industrial efficiency requirements. MPCO leverages metaprompting to dynamically synthesize context-aware optimization prompts by integrating project metadata, task requirements, and LLM-specific contexts. It is an essential part of the ARTEMIS code optimization platform for automated validation and scaling. Our comprehensive evaluation on five real-world codebases with 366 hours of runtime benchmarking demonstrates MPCO's effectiveness: it achieves overall performance improvements up to 19.06% with the best statistical rank across all systems compared to baseline methods. Analysis shows that 96% of the top-performing optimizations stem from meaningful edits. Through systematic ablation studies and meta-prompter sensitivity analysis, we identify that comprehensive context integration is essential for effective meta-prompting and that major LLMs can serve effectively as meta-prompters, providing actionable insights for industrial practitioners.
- [33] arXiv:2508.20086 (replaced) [pdf, html, other]
-
Title: Smart Contract Intent Detection with Pre-trained Programming Language ModelComments: 11 pages, 5 figures, conferenceSubjects: Software Engineering (cs.SE); Cryptography and Security (cs.CR)
Malicious developer intents in smart contracts constitute significant security threats to decentralized applications, leading to substantial economic losses. To address this, SmartIntentNN was previously introduced as a deep learning model for detecting unsafe developer intents. By combining the Universal Sentence Encoder, a K-means clustering-based intent highlighting mechanism, and a Bidirectional Long Short-Term Memory (BiLSTM) network, the model achieved an F1 score of 0.8633 on an evaluation set of 10,000 real-world smart contracts across ten distinct intent categories.
In this study, we present an enhanced version of this model, SmartIntentNN2 (Smart Contract Intent Neural Network V2). The primary enhancement is the integration of a BERT-based pre-trained programming language model, which we domain-adaptively pre-train on a dataset of 16,000 real-world smart contracts using a Masked Language Modeling objective. SmartIntentNN2 retains the BiLSTM-based multi-label classification network for intent detection. On the same evaluation set of 10,000 smart contracts, SmartIntentNN2 achieves superior performance with an accuracy of 0.9789, precision of 0.9090, recall of 0.9476, and an F1 score of 0.9279, substantially outperforming its predecessor and other baseline models. Notably, SmartIntentNN2 also delivers a 65.5% relative improvement in F1 score over GPT-4.1 on this specialized task. These results establish SmartIntentNN2 as a new state-of-the-art model for smart contract intent detection. - [34] arXiv:2510.00476 (replaced) [pdf, html, other]
-
Title: Analyzing Latent Concepts in Code Language ModelsSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Interpreting the internal behavior of large language models trained on code remains a critical challenge, particularly for applications demanding trust, transparency, and semantic robustness. We propose Code Concept Analysis (CoCoA): a global post-hoc interpretability framework that uncovers emergent lexical, syntactic, and semantic structures in a code language model's representation space by clustering contextualized token embeddings into human-interpretable concept groups. We propose a hybrid annotation pipeline that combines static analysis tool-based syntactic alignment with prompt-engineered large language models (LLMs), enabling scalable labeling of latent concepts across abstraction levels. We analyse the distribution of concepts across layers and across three finetuning tasks. Emergent concept clusters can help identify unexpected latent interactions and be used to identify trends and biases within the model's learned representations. We further integrate LCA with local attribution methods to produce concept-grounded explanations, improving the coherence and interpretability of token-level saliency. Empirical evaluations across multiple models and tasks show that LCA discovers concepts that remain stable under semantic-preserving perturbations (average Cluster Sensitivity Index, CSI = 0.288) and evolve predictably with fine-tuning. In a user study on the programming-language classification task, concept-augmented explanations disambiguated token roles and improved human-centric explainability by 37 percentage points compared with token-level attributions using Integrated Gradients.
- [35] arXiv:2406.07714 (replaced) [pdf, html, other]
-
Title: LLAMAFUZZ: Large Language Model Enhanced Greybox FuzzingSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Greybox fuzzing has achieved success in revealing bugs and vulnerabilities in programs. However, randomized mutation strategies have limited the fuzzer's performance on structured data. Specialized fuzzers can handle complex structured data, but require additional efforts in grammar and suffer from low throughput.
In this paper, we explore the potential of utilizing the Large Language Model to enhance greybox fuzzing for structured data. We utilize the pre-trained knowledge of LLM about data conversion and format to generate new valid inputs. We further fine-tuned it with paired mutation seeds to learn structured format and mutation strategies effectively. Our LLM-based fuzzer, LLAMAFUZZ, integrates the power of LLM to understand and mutate structured data to fuzzing. We conduct experiments on the standard bug-based benchmark Magma and a wide variety of real-world programs. LLAMAFUZZ outperforms our top competitor by 41 bugs on average. We also identified 47 unique bugs across all trials. Moreover, LLAMAFUZZ demonstrated consistent performance on both bug trigger and bug reached. Compared to AFL++, LLAMAFUZZ achieved 27.19% more branches in real-world program sets on average. We also demonstrate a case study to explain how LLMs enhance the fuzzing process in terms of code coverage. - [36] arXiv:2502.18044 (replaced) [pdf, html, other]
-
Title: S-Graphs 2.0 -- A Hierarchical-Semantic Optimization and Loop Closure for SLAMComments: 8 pages, 9 figures, Accepted in IEEE RA-L September 2025Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
The hierarchical structure of 3D scene graphs shows a high relevance for representations purposes, as it fits common patterns from man-made environments. But, additionally, the semantic and geometric information in such hierarchical representations could be leveraged to speed up the optimization and management of map elements and robot poses.
In this direction, we present our work Situational Graphs 2.0 (S-Graphs 2.0), which leverages the hierarchical structure of indoor scenes for efficient data management and optimization. Our algorithm begins by constructing a situational graph that represents the environment into four layers: Keyframes, Walls, Rooms, and Floors. Our first novelty lies in the front-end, which includes a floor detection module capable of identifying stairways and assigning floor-level semantic relations to the underlying layers. Floor-level semantics allows us to propose a floor-based loop closure strategy, that effectively rejects false positive closures that typically appear due to aliasing between different floors of a building. Our second novelty lies in leveraging our representation hierarchy in the optimization. Our proposal consists of: (1) local optimization over a window of recent keyframes and their connected components across the four representation layers, (2) floor-level global optimization, which focuses only on keyframes and their connections within the current floor during loop closures, and (3) room-level local optimization, marginalizing redundant keyframes that share observations within the room, which reduces the computational footprint. We validate our algorithm extensively in different real multi-floor environments. Our approach shows state-of-art-art accuracy metrics in large-scale multi-floor environments, estimating hierarchical representations up to 10x faster, in average, than competing baselines - [37] arXiv:2508.05865 (replaced) [pdf, html, other]
-
Title: Secure and Scalable Blockchain Voting: A Comparative Framework and the Role of Large Language ModelsComments: 9 pages, 8 figures, 1 tableSubjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)
Blockchain technology offers a promising foundation for modernizing E-Voting systems by enhancing transparency, decentralization, and security. Yet, real-world adoption remains limited due to persistent challenges such as scalability constraints, high computational demands, and complex privacy requirements. This paper presents a comparative framework for analyzing blockchain-based E-Voting architectures, consensus mechanisms, and cryptographic protocols. We examine the limitations of prevalent models like Proof of Work, Proof of Stake, and Delegated Proof of Stake, and propose optimization strategies that include hybrid consensus, lightweight cryptography, and decentralized identity management. Additionally, we explore the novel role of Large Language Models (LLMs) in smart contract generation, anomaly detection, and user interaction. Our findings offer a foundation for designing secure, scalable, and intelligent blockchain-based E-Voting systems suitable for national-scale deployment. This work lays the groundwork for building an end-to-end blockchain E-Voting prototype enhanced by LLM-guided smart contract generation and validation, supported by a systematic framework and simulation-based analysis.
- [38] arXiv:2509.02372 (replaced) [pdf, html, other]
-
Title: Scam2Prompt: A Scalable Framework for Auditing Malicious Scam Endpoints in Production LLMsSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Large Language Models (LLMs) have become critical to modern software development, but their reliance on uncurated web-scale datasets for training introduces a significant security risk: the absorption and reproduction of malicious content. To systematically evaluate this risk, we introduce Scam2Prompt, a scalable automated auditing framework that identifies the underlying intent of a scam site and then synthesizes innocuous, developer-style prompts that mirror this intent, allowing us to test whether an LLM will generate malicious code in response to these innocuous prompts. In a large-scale study of four production LLMs (GPT-4o, GPT-4o-mini, Llama-4-Scout, and DeepSeek-V3), we found that Scam2Prompt's innocuous prompts triggered malicious URL generation in 4.24% of cases. To test the persistence of this security risk, we constructed Innoc2Scam-bench, a benchmark of 1,559 innocuous prompts that consistently elicited malicious code from all four initial LLMs. When applied to seven additional production LLMs released in 2025, we found the vulnerability is not only present but severe, with malicious code generation rates ranging from 12.7% to 43.8%. Furthermore, existing safety measures like state-of-the-art guardrails proved insufficient to prevent this behavior, with an overall detection rate of less than 0.3%.