Skip to main content

The shim-layer problem

In real automotive firmware, signals are never referenced by their DBC name. Between the DBC definition and the C source file sits one or more abstraction layers:
DBC file:           BrakeDemand.Value  (ETH, CAN ID 0x4A2)

AUTOSAR RTE:        Rte_Read_BC_BrakeDemandVal(&val)

SWC source:         status = Rte_Read_BC_BrakeDemandVal(&brk_demand);
To write a HiL test for BrakeController, you need to know:
  • Which signals does it consume? (BrakeDemand.Value, EngineData.RPM, VehicleSpeed.Speed)
  • Which signals does it produce? (BrakeStatus.Active, BrakeStatus.Pressure)
Finding this out manually means reading every RTE header, tracing every COM callback, and matching them against a DBC that has hundreds of signals. For a medium-size SWC this takes hours. crucihil analyze does it in under a minute.

How it works — the full pipeline

Source files  →  tree-sitter  →  identifiers  →  filter  →  AI matching
DBC files     →  cantools     →  signal corpus ────────────────────────┘

                                              JSON result
                                         (inputs + outputs
                                          with confidence)

Stage 1 — Identifier extraction (tree-sitter)

tree-sitter parses every .c, .cpp, .h, .hpp, .cc, .cxx file under --source and any --dep paths. It collects every node of type identifier, field_identifier, and type_identifier from the AST — a broad sweep that captures function calls, variable names, macro names, and type names. The result is a set of raw strings like:
Rte_Read_BC_BrakeDemandVal
COM_SIG_ENGINE_RPM
BrakeControllerInit
uint32_t
i
status

Stage 2 — Noise filter

A static blocklist removes:
  • C/C++ keywords (int, static, typedef, …)
  • AUTOSAR primitive types (uint8_t, Std_ReturnType, E_OK, …)
  • Common local variable names (value, result, status, i, …)
  • Any identifier shorter than 4 characters
  • Any identifier starting with __
The remaining identifiers are sorted longest-first (longer names are more likely to be meaningful shim identifiers) and capped at 500. This keeps the AI prompt within context budget.

Stage 3 — Signal corpus

DBC files are parsed with cantools. Every MessageName.SignalName pair becomes a corpus entry:
BrakeDemand.Value      [ETH, defs/chassis.dbc]
EngineData.RPM         [CAN, defs/powertrain.dbc]
BrakeStatus.Active     [CAN, defs/powertrain.dbc]
VehicleSpeed.Speed     [CAN, defs/powertrain.dbc]
Interface type is inferred from the TOML key name: can_dbc → CAN, eth_dbc → ETH. Explicit --dbc paths default to unknown.

Stage 4 — AI matching

The AI receives:
  • The filtered identifier list (up to 500 entries)
  • The full signal corpus with interface labels
  • A system prompt that explains the direction rules (Read/Receive/Get → INPUT, Write/Send/Set → OUTPUT)
The AI returns JSON:
{
  "inputs": [
    {
      "signal": "BrakeDemand.Value",
      "interface": "ETH",
      "matched_identifier": "Rte_Read_BC_BrakeDemandVal",
      "confidence": 0.95
    }
  ],
  "outputs": [...],
  "unmatched_identifiers": [...]
}
The framework enriches each match with review_required: true when confidence < 0.85, and deduplicates by signal — keeping only the highest-confidence match when a signal is matched via multiple shim paths.

Confidence score system

RangeLabelWhat it means
0.90 – 1.00HighUnmistakable match — e.g., Rte_Read_EC_EngineRPM → EngineData.RPM
0.70 – 0.89HighStrong semantic match with minor naming variation
0.85+review_required: falseSafe to use in tests without manual verification
0.60 – 0.84MediumPlausible but ambiguous — verify before using
Below 0.60OmittedNot included in output
Medium-confidence matches (review_required: true) should be verified against the actual header before being used in test assertions. A false match here will cause a test to assert the wrong signal.

The direction inference rules

The AI determines input vs. output from the shim identifier’s verb:
Verb patternDirectionExample
Rte_Read_*, Com_Receive*, get_*, *_readINPUTRte_Read_BC_BrakeDemandVal
Rte_Write_*, Com_Send*, set_*, *_writeOUTPUTRte_Write_BC_BrakeActive
Enum ID like COM_SIG_*INPUT (default)COM_SIG_ENGINE_RPM
Ambiguous identifiers are classified as INPUT when context is insufficient.

Using —dep for better coverage

Many AUTOSAR SWCs have this pattern:
// brake_controller.c
Std_ReturnType ret = Rte_Read_BC_BrakeDemandVal(&demand);
The function Rte_Read_BC_BrakeDemandVal is declared in rte/Rte_BrakeController.h, not in the SWC source itself. Without --dep rte/Rte_BrakeController.h, tree-sitter still finds the call — but the additional type annotations in the header make the semantic match stronger.
Pass only the shim headers for the specific SWC you are analyzing. Avoid passing the entire rte/ directory — it adds identifiers from other SWCs and introduces noise.

Good dependency pattern

# Right: only the BrakeController's shim headers
crucihil analyze \
  --source swc/brake_controller \
  --component BrakeController \
  --rig rigs/bench.toml \
  --dep rte/Rte_BrakeController.h \
  --dep com/Com_BrakeController_Cfg.h

Avoid

# Wrong: entire RTE directory floods the AI with other SWCs' identifiers
crucihil analyze \
  --source swc/brake_controller \
  --component BrakeController \
  --rig rigs/bench.toml \
  --dep rte/

Multi-interface support

CruciHiL supports DBC-encoded definitions for any interface type. A single crucihil analyze call can match signals across multiple buses:
crucihil analyze \
  --source swc/chassis_controller \
  --component ChassisController \
  --dbc defs/powertrain_can.dbc \
  --dbc defs/chassis_eth.dbc
Interface types are inferred from TOML key names:
  • can_dbc = "..."CAN
  • eth_dbc = "..."ETH
Or passed directly with --dbc, defaulting to unknown unless the TOML key is used.

What is NOT sent to the AI

CruciHiL never sends raw source code to the AI — only the extracted identifier list and the DBC signal corpus. This means:
  • Firmware IP (algorithms, constants, proprietary logic) stays on your machine
  • The AI never sees function bodies, comments, or string literals
  • Only a filtered list of identifier names (up to 500) is transmitted

Integration with generate_test_suite

The output of crucihil analyze feeds directly into test generation:
# In Claude/Copilot with MCP tools:
result = analyze_component(
    source_path="swc/brake_controller",
    component_name="BrakeController",
    rig_toml_path="rigs/bench.toml",
)

# Use high-confidence matches as context for test generation
signals = [m["signal"] for m in result["inputs"] + result["outputs"]
           if not m["review_required"]]

generate_test_suite(
    suite_name="brake_validation",
    description="Validate BrakeController signal interface",
    rig_toml_path="rigs/bench.toml",
    context_items=signals,
)

Tips for best results

Analyze one component at a time. The identifier cap (500) is calibrated for a single SWC. Pointing --source at an entire project directory will dilute the identifier space and reduce precision.
Use the rig TOML for DBC discovery. Specifying DBC files via [rig.definitions] in the TOML gives the AI interface type context (CAN vs ETH) that improves match quality.
Trust high-confidence matches, verify medium ones. Matches with confidence >= 0.85 are almost always correct. Spend review time on the 0.60–0.84 range.
Run with --output json and filter in CI. jq '.inputs[] | select(.review_required == false)' gives you only the high-confidence matches to feed into test generation.

See also