Design Changes

Discontinuance of the SEM Division

Although the SEM division was well motivated, and at CASC-16 many entrants expressed an interest and desire for such a division, at CASC-17 no systems were especially tuned to the chosen application domain of set theory. Therefore:

The SEM division has not been continued.

Addition of the EPR Division

There are many CNF problems, often from 'real world' applications, that have a finite Herbrand Universe. These problems can be converted to propositional problems, by generating all ground instances of the clauses and assigning a proposition to each unique ground atom. Such problems are decidable, and may be tackled using 1st order techniques, or by conversion to propositional form and using propositional techniques. Some approaches are capable of deciding such problems, while others can only determine that the clause set is unsatisfiable, and others can only determine that the clause set is satisfiable. It would be interesting to evaluate and compare the 1st order and propositional approaches. The addition of propositional systems to CASC would also contribute to making CASC a stimulating event. Therefore:

CASC-JC has an EPR division, using problems with a finite Herbrand Universe.
The EPR division has two categories:
- The EPT category: Effectively Propositional Theorems (unsatisfiable clauses)
- The EPS category: Effectively Propositional non-theorems (Satisfiable clauses)

Two Classes in the MIX Division

ATP systems have three main types of successful output:

A logical solution:
- For theorems, proofs in which each formula is either an input formula, or deduced (following the semantics of the logic) from the input formulae and other deduced formulae.
- For non-theorems, models.
A meta-solution, that proves that there is a logical solution. Refutations are meta-solutions.
An assurance ("yes") that there is a logical solution.

For theorems, logical solutions and meta-solutions are collectively called proof objects.

There is evidence that for some applications of ATP there is a need for the production of logical or meta-solutions, and that the output must be produced as part of the system run, i.e., the production of the solution cannot be deferred to a later run of the system. The need for output as part of the first system run appears to come mainly from applications whose problems are CNF theorems that would fall into the MIX division, i.e., there is a demand for the production of proof objects.

Current systems' proof objects vary in format and level of detail. For CNF theorems, many systems output at least a refutation, and the ATP community is experienced in examining these. For this type of problem, it therefore seems possible to say whether or not a system's proof objects are acceptable. For FOF theorems, most decent current systems convert the problem to CNF and work with the CNF. The output, although acceptable wrt the CNF, may not be meaningful in terms of the original FOF problem. For non-theorems, finite models can be easily presented, while for non-finite models the situation is less clear.

Therefore, CASC-JC ensures the appropriate recognition of the production of acceptable proof objects:

The MIX division has two classes:
- The Assurance class: Ranked according to the number of assurances of the existence of a proof (the only class used up to CASC-17).
- The Proof class: Ranked according to the number of acceptable proof objects output.
For the Proof class, only acceptable proof objects count towards winning. Entrants who wish to be eligible for winning the Proof class must supply representative samples of their proof objects before the competition. The competition panel decides whether or not a system's proof objects are acceptable.
The proof objects from the winner of the Proof class are checked by the panel. If any of the proof objects are unacceptable, i.e., they are significantly worse than the samples provided, then that system is retrospectively disqualified from the Proof class.

Two categories in the SAT division

Various ATP systems and techniques have been observed to be particularly well suited to problems with certain syntactic characteristics. Due to this specialization, empirical evaluation of ATP systems must be done in the context of problems that are reasonably homogeneous with respect to the systems. This motivates the syntactically defined divisions and categories of CASC. Evaluation of ATP systems within such divisions and categories makes it possible to say which systems work well for what types of problems. This identifies specialist capabilities of ATP systems, while general capabilities can be inferred from the separate division and category capabilities.

Recent analysis of ATP system performances on problems in the SAT division has shown that there is specialization between problems with equality and problems without equality. It is therefore appropriate to separate these two types of problems in the SAT division, so that the specialist capabilities can be observed.

Therefore:

The SAT division has two categories:
- The SNE category: SAT with no Equality
- The SEQ category: SAT with Equality

Non-availability of the TPTP version used

Over the last few CASCs there has been an increase in the extent to which systems have been tuned to the TPTP problems that have been likely to be eligible for use. This is problematic, as these sets of problems have been small in some categories/divisions, and thus tuning has become close to precomputation and storage of information for individual TPTP problems.

Various mechanisms have been proposed to limit the extent of the tuning for CASC, e.g., a limit on the number of strategies from which a system may select, based on problem characteristics. The most direct approach, one that corresponds directly to ensuring that "work done using the TPTP should also be useful in a broader context" (as discussed at the CASC panel at CADE-17), is to include unseen problems in CASC. Previously this was not possible, because the CASC organizers did not have the resources to independently evaluate the difficulty, and hence eligibility, of new problems. Rather they relied on performance data kindly submitted by developers. That in turn made it necessary to make the problems available in a TPTP release well in advance of CASC, and hence also available for tuning. This situation has now changed, as thanks to the Australian government the organizers now have almost dedicated use of a new 4 CPU SUN, so that new problems can be evaluated privately, without making the problems publically available.

Therefore:

CASC-JC will use TPTP v2.4.0, or a patch thereof.
The TPTP version used for the competition is not released until after the system installation deadline, so that some problems have not been previously seen by the entrants.
The CASC-JC selection mechanism is biased to ensure that up to 50% of the problems in each category are unseen. (The percentage is dependent on how many new problems are eligible in the category. It is also limited by the mechanism that prevents having an excessive number of very similar problems in any category.)

This change means that it is disadvantagous to tune excessively to previously available TPTP problems. Rather, systems that are fundamentally strong, and whose strategies are useful beyond previously used test problems, should succeed. Currently there are over 1600 problems prepared for TPTP v2.4.0 that are not in TPTP v2.3.0. We hope that many more will be contributed before TPTP v2.4.0 is finalized. (But note, because of the random selection of problems in CASC and the mechanism for avoiding many similar problems, it is impossible to bias the competition by submitting large numbers of problems that are solved by only some particular system.)

Eligibility of Non-standard Problems

The TPTP distinguishes versions of problems as one of standard, incomplete, augmented, especial, or biased. Previously, only standard and especial problems were eligible for use in CASC. Incomplete and augmented problems were excluded, as there was a perceived danger that the modifications that make the problems incomplete or augmented were made with a particular ATP system in mind, i.e., the problems were considered to be potentially biased. Recent discussions have led to the conclusion that the modifications would generally be effective for all ATP systems, and that the problems are therefore not biased. Therefore:

All except biased problems are eligible.

Changes to required System Properties

The ATP systems entered into CASC are run in various ways by the competition organizers, firstly to ensure that they have the required properties, and of course subsequently in the competition itself. In these runs problems with various strange names are used, and it is necessary that the ATP systems are robust to these. Therefore:

No assumptions may be made about the format of the problem file name.

In the competition divisions a wall clock time limit is imposed in addition to the CPU time limit, to prevent very high memory usage that causes swapping. The wall clock time limit is double the CPU time limit, and is imposed by sending a SIGALRM when the wall clock time limit is reached. Therefore:

The ATP systems that run on the general hardware have to be interruptable by a SIGALRM signal, so that the wall clock time limit can be imposed on each solution attempt.

In the MIX division, the system that outputs the most acceptable proof objects (maximally one per problem solved) to stdout is the winner of the Proof class. It is necessary that the proofs be easily identifiable, so that their acceptability can be verified. Therefore:

When outputing proofs for the MIX division proof class, the start and end of the proof output must be identified by distinguished strings (specified by the entrant).

Execution of the ATP systems on the general hardware is controlled by a perl script. The jobs are queued onto the workstations so that each workstation is running one job at a time. This means that multiple copies of the ATP systems have to be executable concurrently on different machines but in the same (NFS cross mounted) directory. Previously the competition organizers have had to modify many of the ATP systems to make this possible. Now the responsibility has been transferred to the entrants.

Multiple copies of the ATP systems have to be executable concurrently on different machines but in the same (NFS cross mounted) directory.