Evaluating Automated Theorem Proving Systems

Abstract

A key concern of Automated Theorem Proving (ATP) research is the development of more powerful systems, capable of solving more difficult problems within the same resource limits. In order to build more powerful systems, it is important to understand which systems, and hence which techniques, work well for what types of problems. This talk deals with the empirical evaluation of general purpose ATP systems, in the context of the TPTP project. This includes the underlying principles and processes of the offline evaluation that is performed, and the organization and outcomes of the annual CADE ATP System Competition (CASC). The offline evaluation provides a measure of the difficulty of each problem in the TPTP, and a rating for each ATP system for each class of problems. The CASC evaluation is similar in structure to the offline evaluation, and in addition to being a public evaluation of ATP systems, aims to provide stimulus and insight for the development of more powerful ATP systems.