The IJCAR ATP System Challenge - IASC
Design and Organization
By Geoff Sutcliffe, Shuwei Chen,
Konstantin Korovin.
Motivation
- For many years CASC has focussed on breadth of coverage over TPTP problems.
- Regular bursts of enthusiasm for "hard problems", e.g., solving TPTP problems with
rating 1.00.
- Evidence of incremental development (as noted in the early days of CASC); see the system
version numbers from recent CASCs.
- A new event, with new stimulation, focussed on hard problems provided by the entrants ...
the IJCAR ATP System Challenge (IASC, pronounced "I ask"!).
A New CASC
- The CADE ATP System Competition (CASC) will run in CADE years
- The IJCAR ATP System Challenge (IASC), will run in IJCAR years.
- IASC is modelled after soccer competitions, e.g., the UEFA Champions League, that have
a league phase and a knockout phase.
The IASC league phase has only one group containing all systems, and the knockout phase
starts with 8 systems.
Read all the details below.
Systems and Problems
- Each entrant submits 3+2 challenge problems to me
- I keep the challenge problems secret until the day of the challenge (live at IJCAR).
- Each problem must be accomanied by a proof in TPTP format, which I'll verify.
- It is OK to submit a problem from the TPTP.
- Use just one format of problems, either FOF (because there will be more systems), or TF0
without arithmetic (because it's a better language).
- Systems are submitted and tested for soundness, as in the competition.
Competition
- League Phase: Each system plays a game against each other system, using their first 3
challenge problems, i.e., 6 problems per game.
- The time limit will be T min wall clock (I think T=5 is OK for the resources
available - it will depend on the number of entrants - see the example numbers below).
- A solution must be a proof in TPTP format, which I'll verify.
- At the end of the League Phase eight systems are chosen by the panel, based on the
systems' performances.
While the numbers of problems solved in the games will be a strong influence on the
choice, the panel can use their discretion, e.g., in the case of ties.
- The eight systems are given a ranking, the best is rank-1, down to the worst rank-8.
- Knockout Phase:
- Round 1: Four games are played - rank-1 v rank-5, rank-2 v rank-6,
rank-3 v rank-7, rank-4 v rank-8.
In each game all 5 challenge problems of the two systems are used, i.e., 10 problems per
game.
- The time limit is 2*T mins.
- If any game results in a tie, a tie-break is run using problems randomly selected
from the challenge problems from other systems in Round 1, until one system gets
ahead.
- A solution must be a proof in TPTP format, which I'll verify.
- The winners are winner-1-5, winner-2-6, etc.
- Round 2: Two games are played - the winner-1-5 v winner-3-7, and winner-2-6 v
winner-4-8.
- Same rules as Round 1.
- The winners are winner-1-5-3-7, winner-2-6-4-8, etc.
- Round 3 - The Finals: One game is played - winner-1-5-3-7 v winner-2-6-4-8.
- Same rules as Round 1.
- The winner is the "Champion of the Challange"
- Human Phase:
- This is for scientific interest, and is not a competition.
- All the problems are released online.
- Developers are given 1 week to tweek their systems.
- A second League Phase is run, using all 5 problems from all systems.
The same time limit applies as the original League Phase.
- The results will be reported in the Journal report of the Challenge.
A detailed evaluation with system decriptions by the developers with be published
separately.
Ideas and Questions
- Maybe increase the time limit as the Knockout Rounds progress?
- I'm nervous about having humans in the loop, because they introduce variability that might
be contraversial.
Having the systems submitted in advance, tested for soundness, then used consistently in
the competition is cleaner.
Practical Notes for Geoff and Other Organizers
- The League Phase can be run in parallel on the 30 nodes of StarExec.
If there are N systems, then there are N*6 system-problem pairs per game, but a system
needs to run its own 3 problems only once.
Maximally each pair will take T mins (and I expect a higher fraction unsolved than in the
traditional CASC because systems will submit problems that are "hard" for other systems).
- Each system has to attempt 3*N problems. That will take
3*N(problems) * N(systems) * T(mins)/30(nodes) mins, e.g.,
if N=10 and T=5, 30 min.
- In order to make the Knockout Phase exciting, like a real sports event where a system can
play only one game at a time, I'll use only two nodes per game (one per system).
Each game has 10 problems, 2 systems. So a game will take
10*2(problems * 2(systems) * T(mins)/2(nodes) mins, e.g.,
100 mins for T=5.
- The games in each Knockout Round are run in parallel, so for three rounds that'll take
maximally 300mins.
But each winning system in Rounds 2 and 3 has already run its own 5 problems, so it's less
than that.