Design Changes
Discontinuance of the SEM Division
Although the SEM division was
well motivated, and at CASC-16 many entrants expressed an interest
and desire for such a division, at CASC-17 no systems were especially
tuned to the chosen application domain of set theory.
Therefore:
- The SEM division has not been continued.
Addition of the EPR Division
There are many CNF problems, often from 'real world' applications, that have
a finite Herbrand Universe.
These problems can be converted to propositional problems, by
generating all ground instances of the clauses and assigning a
proposition to each unique ground atom.
Such problems are decidable, and may be tackled using 1st order
techniques, or by conversion to propositional form and using propositional
techniques.
Some approaches are capable of deciding such problems, while others
can only determine that the clause set is unsatisfiable, and others can
only determine that the clause set is satisfiable.
It would be interesting to evaluate and compare the 1st order and
propositional approaches.
The addition of propositional systems to CASC would also contribute to making
CASC a stimulating event.
Therefore:
- CASC-JC has an EPR division, using problems with a finite
Herbrand Universe.
- The EPR division has two categories:
- The EPT category: Effectively Propositional Theorems
(unsatisfiable clauses)
- The EPS category: Effectively Propositional
non-theorems (Satisfiable clauses)
Two Classes in the MIX Division
ATP systems have three main types of successful output:
- A logical solution:
- For theorems, proofs in which each formula is either an input
formula, or deduced (following the semantics of the logic) from
the input formulae and other deduced formulae.
- For non-theorems, models.
- A meta-solution, that proves that there is a logical solution.
Refutations are meta-solutions.
- An assurance ("yes") that there is a logical solution.
For theorems, logical solutions and meta-solutions are collectively
called proof objects.
There is evidence that for some applications of ATP there is a need
for the production of logical or meta-solutions, and that the
output must be produced as part of the system run, i.e., the production
of the solution cannot be deferred to a later run of the system.
The need for output as part of the first system run appears to come
mainly from applications whose problems are CNF theorems that would
fall into the MIX division, i.e., there is a demand for the production
of proof objects.
Current systems' proof objects vary in format and level of detail.
For CNF theorems, many systems output at least a refutation, and the
ATP community is experienced in examining these.
For this type of problem, it therefore seems possible to say whether
or not a system's proof objects are acceptable.
For FOF theorems, most decent current systems convert the problem to
CNF and work with the CNF.
The output, although acceptable wrt the CNF, may not be meaningful in
terms of the original FOF problem.
For non-theorems, finite models can be easily presented, while for
non-finite models the situation is less clear.
Therefore, CASC-JC ensures the appropriate recognition of the production of
acceptable proof objects:
- The MIX division has two classes:
- The Assurance class:
Ranked according to the number of assurances of the existence
of a proof (the only class used up to CASC-17).
- The Proof class:
Ranked according to the number of acceptable proof objects
output.
- For the Proof class, only acceptable proof objects count towards
winning.
Entrants who wish to be eligible for winning the Proof class must supply
representative samples of their proof objects before the competition.
The competition panel decides whether or not a system's proof
objects are acceptable.
- The proof objects from the winner of the Proof class are checked
by the panel.
If any of the proof objects are unacceptable, i.e., they are significantly
worse than the samples provided, then that system is retrospectively
disqualified from the Proof class.
Two categories in the SAT division
Various ATP systems and techniques have been observed to be particularly
well suited to problems with certain syntactic characteristics.
Due to this specialization, empirical evaluation of ATP systems must be
done in the context of problems that are reasonably homogeneous with respect
to the systems.
This motivates the syntactically defined divisions and categories of CASC.
Evaluation of ATP systems within such divisions and categories makes it
possible to say which systems work well for what types of problems.
This identifies specialist capabilities of ATP systems, while general
capabilities can be inferred from the separate division and category
capabilities.
Recent analysis of ATP system performances on problems in the SAT division
has shown that there is specialization between problems with equality and
problems without equality. It is therefore appropriate to separate these
two types of problems in the SAT division, so that the specialist capabilities
can be observed.
Therefore:
- The SAT division has two categories:
- The SNE category: SAT with no Equality
- The SEQ category: SAT with Equality
Non-availability of the TPTP version used
Over the last few CASCs there has been an increase in the extent to which
systems have been tuned to the TPTP problems that have been likely to be
eligible for use. This is problematic, as these sets of problems have
been small in some categories/divisions, and thus tuning has become close
to precomputation and storage of information for individual TPTP problems.
Various mechanisms have been proposed to limit the extent of the tuning
for CASC, e.g., a limit on the number of strategies from which a system
may select, based on problem characteristics. The most direct approach,
one that corresponds directly to ensuring that "work done using the TPTP
should also be useful in a broader context" (as discussed at the CASC
panel at CADE-17), is to include unseen problems in CASC. Previously this
was not possible, because the CASC organizers did not have the resources
to independently evaluate the difficulty, and hence eligibility, of new
problems. Rather they relied on performance data kindly submitted by
developers. That in turn made it necessary to make the problems available
in a TPTP release well in advance of CASC, and hence also available for
tuning. This situation has now changed, as thanks to the Australian
government the organizers now have almost dedicated use of a new 4 CPU
SUN, so that new problems can be evaluated privately, without making
the problems publically available.
Therefore:
- CASC-JC will use TPTP v2.4.0, or a patch thereof.
- The TPTP version used for the competition is not released until after
the system installation deadline, so that some problems have not been
previously seen by the entrants.
- The CASC-JC selection mechanism is biased to ensure that up to 50%
of the problems in each category are unseen. (The percentage
is dependent on how many new problems are eligible in the category.
It is also limited by the mechanism that prevents having
an excessive number of very similar problems in any category.)
This change means that it is disadvantagous to tune excessively
to previously available TPTP problems. Rather, systems that are
fundamentally strong, and whose strategies are useful beyond previously
used test problems, should succeed. Currently there are over 1600 problems
prepared for TPTP v2.4.0 that are not in TPTP v2.3.0. We hope that many
more will be contributed before TPTP v2.4.0 is finalized. (But note,
because of the random selection of problems in CASC and the mechanism
for avoiding many similar problems, it is impossible to bias the
competition by submitting large numbers of problems that are solved by
only some particular system.)
Eligibility of Non-standard Problems
The TPTP distinguishes versions of problems as one of standard, incomplete,
augmented, especial, or biased.
Previously, only standard and especial problems were eligible for use in CASC.
Incomplete and augmented problems were excluded, as there was a perceived
danger that the modifications that make the problems incomplete or
augmented were made with a particular ATP system in mind, i.e., the problems
were considered to be potentially biased.
Recent discussions have led to the conclusion that the modifications would
generally be effective for all ATP systems, and that the problems are
therefore not biased.
Therefore:
- All except biased problems are eligible.
Changes to required System Properties
The ATP systems entered into CASC are run in various ways by the
competition organizers, firstly to ensure that they have the required
properties, and of course
subsequently in the competition itself.
In these runs problems with various strange names are used, and it is
necessary that the ATP systems are robust to these.
Therefore:
- No assumptions may be made about the format of the problem file name.
In the competition divisions a wall clock time limit is imposed in
addition to the CPU time limit, to prevent very high memory usage that causes
swapping. The wall clock time limit is double the CPU time limit, and
is imposed by sending a SIGALRM when the wall clock time limit
is reached.
Therefore:
- The ATP systems that run on the general hardware have to be
interruptable by a SIGALRM signal, so that the wall clock
time limit can be imposed on each solution attempt.
In the MIX division, the system that outputs the most acceptable proof
objects (maximally one per problem solved) to stdout is
the winner of the Proof class.
It is necessary that the proofs be easily identifiable, so that their
acceptability can be verified.
Therefore:
- When outputing proofs for the MIX division proof class, the start
and end of the proof output must be identified by distinguished strings
(specified by the entrant).
Execution of the ATP systems on the general hardware is controlled by a
perl script.
The jobs are queued onto the workstations so that each workstation is
running one job at a time.
This means that multiple copies of the ATP systems have to be executable
concurrently on different machines but in the same (NFS cross mounted)
directory.
Previously the competition organizers have had to modify many of the ATP
systems to make this possible.
Now the responsibility has been transferred to the entrants.
- Multiple copies of the ATP systems have to be executable concurrently
on different machines but in the same (NFS cross mounted) directory.