workbugbench
Autonomous Bug-Report Testing for Android
Benchmarking how well autonomous agents reproduce bugs
With Dr. Tingting Yu. Bug reports are messy; testing whether an autonomous system can reproduce them requires both the agent and the harness to measure success/failure correctly.
I’m using computer-use agents to extract APK info and evaluate the outcome of executed reports. The novel methods sit alongside the benchmark itself — both are research output.
Highlights
- Eval + benchmarks for autonomous bug-report testing.
- Computer-use agents extracting APK info & evaluating outcomes.