Evaluating Automated Theorem Proving Systems
Abstract
A key concern of Automated Theorem Proving (ATP) research is the development
of more powerful systems, capable of solving more difficult problems within
the same resource limits.
In order to build more powerful systems, it is important to understand which
systems, and hence which techniques, work well for what types of problems.
This talk deals with the empirical evaluation of general purpose ATP systems,
in the context of the TPTP project.
This includes the underlying principles and processes of the offline
evaluation that is performed, and the organization and outcomes of the
annual CADE ATP System Competition (CASC).
The offline evaluation provides a measure of the difficulty of each problem
in the TPTP, and a rating for each ATP system for each class of problems.
The CASC evaluation is similar in structure to the offline evaluation, and
in addition to being a public evaluation of ATP systems, aims to provide
stimulus and insight for the development of more powerful ATP systems.