workbugbench
2026 · Eval · Agents · Android

Autonomous Bug-Report Testing for Android

Benchmarking how well autonomous agents reproduce bugs

With Dr. Tingting Yu. Bug reports are messy; testing whether an autonomous system can reproduce them requires both the agent and the harness to measure success/failure correctly.

I’m using computer-use agents to extract APK info and evaluate the outcome of executed reports. The novel methods sit alongside the benchmark itself — both are research output.

Highlights

  • Eval + benchmarks for autonomous bug-report testing.
  • Computer-use agents extracting APK info & evaluating outcomes.