Towards optimal an optimal debugging library framework

This article is intended as overview of software based debugging techniques and motivation for uniform execution representation and setup to efficiently mix and match the appropriate technique for system level debugging with focus on statically optimizing compiler languages to keep complexity and scope limited. The reader may notice that there are several documented deficits across platforms and tooling on documentation or functionality, which will be improved. The author accepts the irony of such statements by “C having no ABI”/many systems in practice having no stable or formally specified ABI, but reality is in this text simplified for brevity and sanity.

Section 1 (theory) feels complete aside of simulation and hard-/software replacement techniques and are good first drafts for bug, debugging and debugging process. Section 2 (practical) is tailored towards non micro Kernels, which are based on process abstraction, but is currently missing content and scalability numbers for tooling. The idea is to provide understanding and numbers to estimate for system design, 1 if formal proof of correctness is feasible and on what parts, 2 problems and methods applicable for dynamic system analysis. Section 3 (future) will be on speculative and more advanced ideas, which should be feasible based on numbers. They are planned to be about how to design systems for rewriting and debugging using formal methods, compilers and code synthesis.

Theory of debugging

A (software) system can be represented as (often non-deterministic) state machine, such that a bug is a bad transition rule between those states. It is usually assumed that the developer/user knows correct and incorrect (bad) system states and the code represents a somewhat correct model of the intended semantics. Then an execution witness are the states and state transitions encountered on a specific program run. If the execution witness shows a “bad state”, then there must be a bug. Thus a debugger can be seen as query engine over states and transitions of a buggy execution witness.
In more simple terms, debugging is not making bugs or removing them.
Frequent operations are bug source isolation to deterministic components, where encapsulation of non-determinism usually simplifies the process. In contrast to that, concurrent code is tricky to debug, because one needs to trace multiple execution flows to estimate where the origin of the incorrect state is.

The process of debugging means to use static and dynamic (software) system analysis and its automation and adaption to speed up bug (classes) elimination for the (classes of) target systems.

One can generally categorize methods into the following list [automate, simplify, observe, understand, learn] (asoul)

  • automate the process to minimize errors/oversights during debugging, against probabilistic errors, document the process etc
  • simplify and isolate system components and changes over time
  • observe the system while running it to trace state or state changes
  • understand the expected and actual code semantics to the degree necessary
  • learn, extend and ensure how and which system invariants are satisfied necessary from of the involved systems, for example user-space processes, kernel, build system, compiler, source code, linker, object code, assembly, hardware etc

with the fundamental constrains being [finding, eensuring, limited] (feel)

  • finding out correct system components semantics
  • eensuring deterministic reproducibility of the problem
  • limited time and effort

Common static and dynamic (software) system analysis methods to run the system to feel a soul for the purpose of eliminating the bug (classes) are:

  • Specification meaning to “compare/get/write the details”, possibly formally, possibly for (software) system synthesis.
  • Formal Verification as ahead or compile-time invariant resolving. May be superflous by (software) system synthesis based on Specification or unfeasible due to complexity or non-formal specification.
  • Validation as runtime invariant checks. Sanitizers as compiler runtime checks are common tools.
  • Testing as sample based runtime invariant checks. Coverage based fuzzers are common tools.
  • Stepping via “classical debugger” to manipulate task execution context, manipulate memory optionally via source code location translation via REPL commands, graphically, scripting or (rarely) freely programmable.
  • Logging as dumping (a simplification of) state with context from bugs (usually timestamps in production systems).
  • Tracing as dumping (a simplification of) runtime behavior via temporal relations (usually timestamps). Can be immediate or sampled.
  • Recording Encoded dumping of runtime to replay runtime with before specified time and state determinism.
  • Scheduling meaning to do logical or time-relation based scheduling of process or threads. Typical use cases are undo “thread fuzzing”, rr “chaos mode”, using the kernel scheduler API or bounded model checking.
  • Reversal computing meaning to reverse execute some code to (partial) reset the system to a previous state without Recording and replaying. Typically used in simulations and pure logic functionality of languages and corresponds to applying some bijective function.
  • Time-reversal computing to do Reversal computing with tracked time. Mostly used in simulations, because (if used) source code to assembly relation and (assembly) instruction time must be fixed and known.

The core ideas for what software system to run based on code with its semantics are then typically a mix of

  • Machine code execution on the actual hardware to get hardware and timing behavior.
  • Simulation as partial or full execution on a simplified, imitative representation of the target hardware to get information for the simplified model.
  • Virtualisation as isolation or simplification of a hardware- or software subsystem to reduce system complexity.

Further, isolation and simplification are typically applied on all potential sub-components including, but not limited to hardware, code versioning including dependencies, source system, compiler framework and target system. Methods are usually

  • Bisection via git or the actual binaries.
  • Reduction via removal of system parts or trying to reproduce with (a minimal) example.
  • Statistical analysis from collected data on how the problem manifests on given environment(s) etc.

Debugging is domain- and design-specific and relies on core component(s) of the to be debugged system to provide necessary debug functionality. For example, software based hardware debugging relies on interfaces to the hardware like JTAG, kernel debugging on kernel compilation or configuration and elevated (user), user-space debugging on process and user permissions, system configuration or a child process to be debugged on Posix systems via ptrace.

Without costly hardware devices to trace and physical access to the computing unit for exact recording of the system behavior including time information, dynamic (software) system analysis (to run the system) requires trade-offs on what program parts and aspects to inspect and collect data from. Therefore, it depends on many factors, for example bug classes and target systems, to what degree the process of debugging can and should be automated or optimized.

Practical methods with trade-offs

Depending on the domain and environment, problematic behavior of hardware or software components must be (more or less) 1 avoided or 2 traceable and there exist various (domain) metrics as decision helper. Very well designed systems explain users how to debug regarding to functional behavior, time behavior with internal and external system resources up to the degree the system usage and task execution correctness is intended. Access restrictions limit or rule out stepping, whereas storage limitations limit or rule out logging, tracing and recording.

Formal methods, Specification, (software) system synthesis and Formal Verification

(Highly) safety-critical systems or hardware are typically created from formal Specification by (software) system synthesis or, when (full) synthesis is unfeasible, implementations are formally verified. To my knowledge no standards for (highly) security-critical systems exist, which require formal Specification and Formal Verification or synthesis (2025-05-16).

For non safety- or security-critical or hardware (sub)systems, usually semantics are not “set into stone”, so Formal Verification or (software) system synthesis is rarely an option. Formal models and (semi-)formal specifications are however commonly used for design, planning, testing, review and validation of fail-safe or core (software) system functionality.

Typical used models for C, C++, Zig and compiler backends are Integer Arithmetic, Modular Arithmetic, Saturation Arithmetic for integers and Floating point arithmetic (with possible rough edge cases like signaling NaN propagation), Fixed-Point Arithmetic for real numbers. (Simplified) instances of Separation Logic may be used to model and check pointers and resources, for example Safe Rust uses separation logic with lifetime inference and user annotations based on strict aliasing of Unsafe Rust.

Typical relevant unsolved or incomplete models for compilers are

  1. hardware semantics, specifically around timing behavior and (if used) weak memory
  2. memory synchronization semantics for weak memory systems with ideas from “Relaxed Memory Concurrency Re-executed” and suggested model looking promising
  3. SIMD with specifically floating point NaN propagation
  4. pointer semantics, specifically in object code (initialization), se- and deserialization, construction, optimizations on pointers with arithmetic, tagging
  5. constant time code semantics, for example how to ensure data stays in L1, L2 cache and operations have constant time
  6. ABI semantics, since specifications are not formal

and typical problems more related to platforms like Kernels are

  1. resource (tracking) semantics, for example how to track resources in a process group
  2. security semantics, for example how to model process group permissions.

For Validation, Sanitizers are typically used as the most efficient and simplest debugging tools for C and C++, whereas Zig implements them, besides thread sanitizer, as allocator and safety mode. Instrumented sanitizers have a 2x-4x slowdown vs dynamic ones with 20x-50x slowdown.

NrClang usageZig usageMemoryRuntimeComments
1-fsanitize=addressalloc + safety1x (3x stack)2xClang 16+ TB of virt mem
2-fsanitize=leakallocator1x1xon exit ?x? more mem+time
3-fsanitize=memoryunimplemented2-3x3x
4-fsanitize=thread-fsanitize=thread5-10x+1MB/thread5-15xClang ?x? (“lots of”) virt mem
5-fsanitize=typeunimplemented??not enough data
6-fsanitize=undefinedsafety mode1x~1x
7-fsanitize=dataflowunimplemented1-2x?1-4x?wip, get variable dependencies
8-fsanitize=memtagunimplemented~1.0Yx?~1.0Yx?wip, address cheri-like ptr tagging
9-fsanitize=cfiunimplemented1x~1xforward edge ctrl flow protection
10-fsanitize=safe-stackunimplemented1x~1xbackward edge ctrl flow protection
11-fsanitize=shadow-call-stackunimplemented1x~1xbackward edge ctrl flow protection

Sanitizers 1-6 are recommended for testing purpose and 7-11 for production by LLVM. Memory and slowdown numbers are only reported for LLVM sanitizers. Zig does not report own numbers yet (2025-01-11). Slowdown for dynamic sanitizer versions increases by a factor of 10x in contrast to the listed static usage costs. The leak sanitizer does only check for memory leaks, not other system resources. Besides various kernel specific tools to track system resources, Valgrind can be used on Posix systems for non-memory resources and Application Verifier for Windows. Address and thread sanitizers can not be combined in Clang and combined usage of the Zig implementation is limited by virtual memory usage. In Zig, aliasing can currently not be sanitized against, whereas in Clang only typed based aliasing can be sanitized without any numbers reported by LLVM yet.

Besides adjusting source code semantics via 1 sanitizers, one can do 2 own dynamic source code adjustments or use 3 tooling that use kernel APIs to trace and optionally 3.1 run-time check information or 3.2 run-time check kernel APIs and with underlying state. Kernels further may simplify access to information, for example the proc file system simplifies access to process information.

Testing is very context and use-case dependent with typical separations being between pure/impure, time-invariant/variant, accurate/approximate, hardware/software (sub)system separation from simple unit tests up to integration and end to end tests based on statistical/probability analysis and system intuition on determinstic expected behavior based on explicit or implicit requirements. TODO tools, hardware, software, mixed hw/sw examples

Stepping

  • TODO time costs, sync options, etc

Logging

  • TODO

Tracing

  • TODO
    • “Debugging And Profiling .NET Core Apps on Linux”
    • https://github.com/goldshtn/linux-tracing-workshop
    • CPU sampling linux perf, bcc; win ETW; macos; macos instruments dtrace
    • dynamic tracing linux perf, systemtap, bcc; win nothing; macos dtrace
    • static tracing linux LTTng, win ETW, macos nothing
    • dump gen linux core_pattern, gcore; win procdump, WER; macos kern.corefile, gcore
    • dump analysis gdb,lldb; visual studio, windbg, gdb,lldb
    • lwn.net Unifying kernel tracing
    • https://github.com/goldshtn/linux-tracing-workshop
    • babeltrace https://babeltrace.org/
    • There are no “works for all kernels” and “trace specific (group of) processes” solutions,
    • so one has to do specific queries to constrain what data should be collected.
    • For low latency overhead analysis, dtrace or inspired systems like bpftrace,
    • bcc and systemtap can be used.
    • ETW allows complete user-space captures
    • Most related solutions use dtrace or
    • TODO
    • * list standard Kernel tracing tooling,
    • * focus on dtrace and drawback of no “works for all kernels” “trace processes”
    • * standard tooling for checking traced information
    • * Tracers: dtrace, bpftrace, bcc, systemtap, ETW, darwin/macos?, other posix tools?
    • - TODO memory/runtime/latency overhead etc

Recording

  • TODO requirements: eliminate non-deterministic choices for replaying, others

Scheduling

  • TODO requirements: simplification methods, practicality

Reversal computing

  • TODO how and when to write bijective code to simplify debugging

Time-reversal computing

  • TODO use cases

The following is a list of typical problems with simple solution tactics. To keep analysis simple, no virtual machine/emulator and simulation approaches are given.

  1. Hard(ware) problems Hardware design reviews with extensive focus on core components (power, battery, periphery, busses, memory/flash and debug/test infrastructure) to enable debugging and component tests against product and assembling defects are fundamental for software debugging under assumption that computing unit(s) and memory unit(s) can be trusted to work reliable enough. Depending on goals, time channel analysis, formal methods to rule out logic errors and fuzzing against bad temporal behavior (for example during speculative execution) are common methods besides various testing strategies based on statistical analysis.
  2. Kernel and platform problems The managing environment the code is running on can vary a lot. As example, the typical four phases of the Linux boot process (system startup, bootloader stage, kernel stage, and init process) have each their own debugging infrastructure and methods. Generally, working with (introspection-restricted) platforms requires 1. reverse engineering and "trying to find info" and/or 2. "use some tracing tool" and 3. for open source "adjust the source and stare at kernel dumps/use debugger". Kernels are rarely designed for tracing, recording, formal verification due to internal complexity and virtualisation is slow and hides many classes of synchronization bugs. Due to being complex, moving targets, having no library design, having design flaws and many performance tradeoffs, Kernels are hard to fuzz test.
    Non-micro kernels use the process as fundamental abstraction model, but no widely used Kernel has a formal process group model. Since there is no api to communicate from child to parent process initialization status and since signaling is racy, testing process groups is error-prone and a reliable testing setup is complex. Race conditions during read and write access to kernel objects like shared memory or file access (flush semantics) are the most common problems besides general reliance on any on timing of actions without handling the timeout path or resource over-saturation path (for example on high system load).
  3. Detectable Undefined Behavior clang -Werror -Weverything -fsanitize="undefined,type", zig -OReleaseSafe, zig -ODebug
  4. Undetectable Undefined Behavior Staring at source code, backend intermediate representation like LLVM IR and reducing the problem or resulting assembly. Unfortunately the backend optimizers like LLVM do not offer frontend language writers debug APIs and related tooling due to not being designed for that purpose, so one has to manually invoke the optimizations to reproduce the problem. A bespoke debug api would allow recording, replaying and tracing IR of each optimization step, ideally via reversal computing for optimal performance. Getting unoptimized LLVM IR via zig --verbose-llvm-ir test.zig (so far without an option to store LTO artifacts) and clang -O3 -Xclang -disable-llvm-optzns -emit-llvm -S test.c with (if needed) LTO artifact storing via -plugin-opt=save-temps. Getting optimized LLVM IR works via clang -O3 -emit-llvm -S test.c and zig -femit-llvm-ir test.zig.
  5. Miscompilations Tools like Miri or Cerberus run the program in an interpreter, but may not cover all possible program semantics due to ambiguity and may not be feasible, so the only good chance is to reduce it as in Undetectable Undefined Behavior.
  6. Memory problems
    1. Out-of-bounds (OOB) clang -fsanitize=address, zig -ODebug/-OReleaseSafe
    2. Null pointer dereference clang -fsanitize=address, zig -ODebug/-OReleaseSafe
    3. Type confusion clang -fsanitize="address,undefined, zig -ODebug/-OReleaseSafe
    4. Integer overflow clang -fsanitize=undefined, zig -ODebug/-OReleaseSafe
    5. Use after free clang -fsanitize=address, Zig allocator configuration
    6. Invalid stack access clang -fsanitize=address and ASAN_OPTIONS=detect_stack_use_after_return=1 with 1.3-2x runtime and 11MB fake stack per thread, unimplemented in Zig.
    7. Usage of uninitialized memory (UUM) clang -fsanitize=memory, unimplemented in Zig for partial initialization (implementation only checks against any initialization, if value is used in branch and only if memory is not coerced to different types).
    8. Data races can be sanitized in Clang and Zig via -fsanitize=thread, but Zig offers no annotation for "intentionally racy reads and writes" via __attribute__((no_sanitize("thread"))).
    9. Illegal aliasing can only be checked for typed aliasing with clang -fsanitize=type, unimplemented in Zig.
    10. Stack overflow from recursions or during the call chain of the program. TODO list tooling to over-approximate stack usage upfront. TODO list tooling to measure stack usage.
  7. Resource leaks (freestanding/kernel) Only process-based resources are considered here, which are accessible by the process as
    1. memory (stack is covered in Memory problems, heap)
    2. handles/file descriptors
    3. child processes (without ownership delegation)
    There may be an association between 2. handles/file descriptors and 3. child processes (ie pidfd/GetProcessId, process group handles etc) xor 1. memory (ie on usage of mmap) or there may be none. There exists no tooling to check for point 3 and explaining the process group model with edge cases would make this article too long.
    1. Automatic tools for memory leak detection are Valgrind MassIf (does not work on Windows) or using custom allocators with such functionality (or any of below for checking/proviling/tracing). Posix systems have for memory profiling Valgrind MassIf (valgrind --tool=massif prog; ms_print massif.out.12345), for memory checks Valgrind MemCheck (valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --verbose prog), for memory analysis at runtime gdb with pwndbg (for example using vmmap) or memory analysis after runtime using coredumps, meaning gcore -o $TMPDIR/process $PID, cat /proc/$pid/smaps > $TMPDIR/TimeMemAction.txt or gdb -p $pid; dump memory memory.dump 0xSTART 0xEND; hexdump -C memory.dump. Windows systems have for memory profiling VMMap (graphical), for memory checks but there is also with a bunch of tooling Windows has for memory profiling VMMap and RAMMap, DrMemory as graphical tools, for memory leaks UMDH gflags /i prog.exe +ust; $Env=_NT_EXECUTABLE_IMAGE_PATH="url_ms_sym_server"; umdh -p:$PID -f:b4leak.log; umdh b4leak.log afterleak.log > res.diff, DrMemory, for memory analysis at runtime Visual Studio (Code) with "Memory Usage" and analysis after runtime with windbg gflags /i prog.exe +ust; WinDbgX.exe prog.exe; .dump /ma b4leak.dmp; .opendump leak.dmp; f5; ||1s; ||.; !heap -s; !heap -h HANDLE; !heap -p -a ADDRESS; !heap -flt s SIZE (find stack to allocation).
    File descriptor/handle leaks can be automatically detected on Posix with Valgrind valgrind --track-fds=yes prog and on Windows with manually checking Handle, ProcessExplorer, ETW traces or automatically with proprietary solutions.
    Examples for more direct access and control are on many Poxis systems /proc/PID_OF_PROCESS, on Windows NtQuerySystemInformation with SYSTEM_HANDLE_INFORMATION and SYSTEM_HANDLE_TABLE_ENTRY_INFO, on BSDs sysctl, kvm, procmap and there exist various other kernel specific trace options.
  8. Freezes (deadlocks, livelock, signal safety, unbounded loops etc) LLVM has a not well-documented deadlock sanitizer option TSAN_OPTIONS=detect_deadlocks=1:second_deadlock_stack=1.
    Livelock detection like infinitive loop detection would need annotation of progress and a step or time limit. So one good option is to do time or progress simulation in the testing build mode and do runtime-validation in intermediate steps. The same strategy can be applied to unbounded loops.
    Signal safety requires fail-safe programming, especially on Posix, and would be another article also covering process group semantics. ptrace(GETSIGINFO, ..) , WaitForDebugEvent are options to trace signals besides kernel tracers like ktrace, dtrace or on Windows ETW, but usually it is simpler to reproduce the behavior in a debugger with simplified code.
  9. Performance problems Extrapolation across multiple target hardware is unfeasible to do automatically. Simulation of CPU cache behavior of target hardware from any host hardware works via Valgrinds Cachegrind with Valgrinds Callgrind adding call graph information. Callgrind visualization exists for every platform. Accurate tracing on target hardware can be obtained based on hardware counters via Windows Event Tracing, Linux perf (perf_event_open), Darwin kperf (kpc_get_thread_counters), Event Trace for Windows (ETW) (StartTraceW) with Darwin having (yet) no cli api and gui. There are many approaches to do profiling of various program aspects with less accuracy and less space usage too long to list here.
  10. Logic problems Logic problems of software systems can be described as problems related to incorrectly applied logic of how the code is solving the intended and follow-up problems ignoring hardware problems, kernel problems, different types of UB, miscompilations, memory problems, resource leaks, freezes and performance issues.
    This typically includes
    1. software requirements or their handling, TODO better phrase requirements and specification?
    2. (temporary) inconstency of state (relations)
    3. incorrect math, for example not covering edge cases
    4. incorrect modeling of external and internal state and synchronization
    5. incorrect protocol handling
    and is usually caused by
    1. incorrect constrains on the design, meaning how the different parts should interact and work towards the goals for the use cases
    2. unclear, unspecified or incorrectly assumed hardware or software guarantees by components
    3. implementation oversights, unintended use cases, unfeasibility of a general solution due to constrains like time, money etc
    Those problems may be solved in the following ways
    1. Software requirements typically depend on the hardware and software platforms to specify the system type (how distributed, event handling idea like state machine or query system), the used protocols, platform requirements and provided functionality with assurences (more rare are guarantees). TODO rephrase Those are typically written in UML, which is very inflexible in contrast to an arbitrary graph for modeling system behavior. Mermaid for UML looks nice, but has scalability issues on bigger drawings. PlantUML does not look nice, but just works. draw.io for non-UML is unnecessary complex, offers no data annotation and no sane export format to reuse the graph in other tools (did not check underlying representation) besides being not open source. So in short terms, any tool with graph output will do, since none is good, and for smaller models ascii/utf8 drawings work fine.
      Handling means to track progress, for which time feasibility and getting feasible design is essential. TODO prototype, debugging and other things.
    2. (temporary) inconstency of state (relations) TODO
    3. incorrect math, for example not covering edge cases TODO
    4. incorrect modeling of external and internal state and synchronization TODO
    5. incorrect protocol handling TODO
Ideally, only the system behavior and interactions with domain and use-case specific parts (2. Kernel and platform problems, 10. Logic problems) need cognitive load from the programmer, whereas the other error classes should have standard approaches to isolate and eliminate. Unifying debug tooling simplifies usage for bigger developer productivity and exposing as library allows to automate this process.