Entrants' System Descriptions

CASC-J13

Connect++ 0.7.2
CSI++ 1.0
Drodi 4.1.1
E 3.5.1
FindProof 0.1
FMB4J 0.1
LEO-II 1.7.0
Leo-III 1.8.0
Mace4 2026-6A
mrs 0.2.0
Prover9 1109a
Prover9 2026-6A
SATResetCoP 1.0
SPASS-SCL 0.1.1
SUPr 1.0
Twee---2.7
Vampire 4.8
Vampire 5.0
Vampire 5.0.1
VIP 1.718
Zipperposition 2.1.9999

ProoVer 2026

CheckProof 0.1
GAPT 2.20
GDV 2.0
GDV-LP 2.0
mrs-proover 0.2.0
Nörgler 1.1
ProofCheck 1.0
ProofGuard 1.0
PyCheck 0.1
VaLeaDate 0.1

CASC-J13

Connect++ 0.7.2

Dr Sean B Holden
University of Cambridge, United Kingdom

Architecture

Connect++ Version 0.6.1 is the current publicly-available release of the Connect++ prover for first-order logic, introduced in [Hol23]. It is a connection prover using the same calculus and inference rules as leanCoP [Ott10]. That is, it uses the connection caclulus with regularity and (optionally) with lemmas.

Strategies

The (default) proof search is essentially the same as that used by leanCoP Version 2.1. That is, it employs a search trying left branches of extensions first, with restricted backtracking as described in [Ott10], and using the leftmost literal of the relevant clause for all inference rules. It does not attempt to exactly reproduce leanCoP's search order. When not using a schedule, command-line options allow many of these choices to be altered; for example allowing backtracking restriction for extensions while still fully exploring left branches, or other modification of the backtracking restrictions. If run with its default schedule it uses one similar to that of leanCoP Version 2.1, including the various applications of definitional clause conversion. Alternatively it can read and apply arbitrary schedules if desired. At present Connect++ does not attempt to tune its proof search based on the characteristics of individual problems.

This is the first version of Connect++ that, in addition to using connection calculus, employs ideas from [RR21], for periodically grounding clauses and using a SAT solver to complete a proof by finding a finite unsatisfiable grounded set.

Implementation

Connect++ is implemented in C++ - minimally the 2017 standard - and built using cmake. Libraries from the Boost collection are used for parsing, hashing, some random number generation, and processing of command-line options. The system has a built-in proof checker for verifying its own output, but also includes a standalone checker implemented in SWI Prolog. Since version 0.7.0 it has also implemented the standard TPTP format for recording connection proofs proposed in [SHB26]. During the build process cmake reads a specified version of the CaDiCaL [BF+24] SAT solver from its GitHub repository, builds a competition version as a library, and links directly to this to perform SAT-solving.

As substitutions need to apply to an entire proof tree the system only represents each variable once and shares the representation, simultaneously maintaining a stack of substitutions making removal of substitutions under backtracking trivial. It also creates subterms only once and shares them; these are indexed allowing constant-time lookup, and nothing is ever removed from the index, meaning that if a term is constructed again after its initial construction no new memory allocation takes place and the term itself is obtained in constant time. At the same time, fresh copies of variables are recycled under backtracking - these two design choices appear to interact very effectively, as the recycling of the variables seems to make it quite likely that subterms already in the index can be reused. As new copies of clauses are often needed, clauses are themselves also retained during backtracking and reused where possible to minimize the need to make new ones.

By default a standard recursive unification algorithm is used, but a polynomial-time version is optional.

If a schedule is used, it is assumed that different approaches to definitional clause conversion may be needed - typically all clauses, conjecture clauses only, or no clauses. As these choices can lead to different matrices, and the conversion itself can be expensive, the system stores and switches between the different matrices rather than converting multiple times.

As the system was developed with two guiding aims - to provide a clear implementation easily modified by others, somewhat in the spirit of MiniSAT [ES04], and to support experiments in machine learning for guiding the proof search - the implementation avoids the use of direct recusion in favour of a pair of stacks and an iterative implementation based on these, as described in [Hol23]. This allows complete and arbitrary control of backtracking restriction and other modifications to the proof search using typically quite simple modifications to the code.

The source and documentation are available at

    http://www.cl.cam.ac.uk/~sbh11/connect++.html

Expected Competition Performance

The system remains at an early stage of development and is currently undergoing systematic profiling and improvement. It is not expected at this stage to be competitive with the state-of-the-art, but is expected to be a distinct improvement on last year's version 0.7.0.

CSI++ 1.0

Guoyan Zeng
Xihua University, China

Architecture

CSI++ is an automated theorem prover for first-order logic, integrating contradiction separation inference with superposition-based reasoning. The system combines the CSI inference framework and the E (or GKC) prover in a cooperative architecture. CSI is a multi-layer inverse and parallel prover based on the Contradiction Separation Based Dynamic Multi-Clause Synergized Automated Deduction (S-CS) framework [XL+18], while E and GKC mainly provide efficient reasoning capabilities for superposition, equality, and resolution. The cooperation mechanism works as follows. CSI and E (or GKC) are first applied sequentially to the given problem. If either component succeeds, the proof search terminates successfully. Otherwise, CSI generates additional inferred clauses, especially clauses with at most two literals and unit clauses, which are then supplied to the superposition reasoning component together with the original clause set for further proof search. This integration is intended to combine the strengths of both reasoning styles. CSI is effective at generating useful unit clauses and simplifying the search space, while the superposition component provides strong equality handling ability. Their cooperation aims to improve performance on difficult first-order theorem proving problems.

Strategies

The CSI inference component adopts strategies similar to those used in the standalone CSE 1.6 system, including clause/literal selection, strategy scheduling, and CSC strategies. In the integrated system, dedicated equality handling strategies of the original CSE component are disabled in favor of the superposition-based equality reasoning module. The main additional strategies in the integrated system include:

Complementary ratio strategy: a measure and calculation method for estimating complementary relations between clauses, guiding clause selection and deduction path planning effectively.
Portfolio strategy: different clause selection schemes are applied during different stages of proof search.
Multi-Goal Extraction Strategy. We use the following strategy to extract the multiple goal literals inside the S-SC (called multi-goal extraction strategy). This literal selection strategy is used to get the inverse deduction goal clause.

Implementation

CSI++ is implemented mainly in C++, while Java is used for batch problem execution. The coordination and job dispatch between the contradiction separation inference module and the superposition reasoning module are implemented in C++.

Expected Competition Performance

We expect CSIPlusPlus to solve some difficult problems that cannot be solved by traditional superposition provers alone and to achieve competitive overall performance. Acknowledgement: Development of CSIPlusPlus has been partially supported by the National Natural Science Foundation of China (NSFC) (Grant No. 62106206, 62206227)，and the Key Project of Sichuan Science and Technology Innovation and Entrepreneurship Seeding Program (Grant No. 2024JDRC0084).

Drodi 4.1.1

Oscar Contreras
Amateur Programmer, Spain

Architecture

Drodi 4.1.1 is a very basic and lightweight automated theorem prover. It implements the following main features:

Ordered resolution and equality paramodulation inferences as well as demodulation and some other standard simplifications.

A basic implementation of clausal normal form conversion as in [NW01].
AVATAR architecture with a SAT solver [Vor14].
Limited Resource Strategy [RV03].
Discrimination trees.
KBO, non recursive and lexicographic reduction orderings. KBO has been rewriten using the polynomial-time algorithm in [Loe06].
Literal selection including lookahead as in [HR+16].
SInE distance for clauses and symbols as in [Sud19].
Layered clause selection as in [GS20].
Stochastic strategy inspired in [Sud22].
Goal transformation for Unit Equality problems as in [Sma21].
SAT solver based subsumption and subsumption resolution inspired in [CR+24].
Learning data is now embedded in the executable. An external file with learning data is no longer required.
Drodi produces a (hopefully) verifiable proof in TPTP format.

Strategies

Drodi has a fair number of selectable strategies including but not limited to the following:

Otter, Discount and Limited Resource Strategy [RV03] saturation algorithms.
A basic implementation of AVATAR architecture [Vor14].
Several literal and term reduction orderings.
Several literal selection options [HR+16].
Several layered clause selection heuristics with adjustable selection ratios [GS20].
Classical clause relevancy pruning.
Drodi V4 has new strategy portfolio inspired (but not exactly equal than) by [BS23]. The strategies have now unequal time slices and the set of strategies used depends on the problem features detected during the preprocessing.
Some strategies are run a second time with a previously applied randomization to the problem [Sud22].
Drodi can generate learning data from successful proofs and use the data to guide clause selection strategy. It is based in the enhanced ENIGMA method. However, unlike ENIGMA, the learning data is completely general and can be used with any kind of problems. This generality allows the use of the same learning data in both FOF and UEQ CASC competition divisions. The learning data is generated over a set of TPTP problems before the CASC competition using built-in Drodi functions that include a L2 Support Vector Machine. Drodi integrated learning functions are a generalization of ENIGMA [JU17, JU18]. Literals polarity, equality, skolem and variable occurrences are stored in clause feature vectors. Unlike ENIGMA, instead of storing the specific functions and predicates themselves only the SinE distance and arity of functions and non equality predicates are stored in clause feature vectors with different features assigned to predicates and functions.

Implementation

Drodi is implemented in C. It includes discrimination trees and hashing indexing. All the code is original, without special code libraries or code taken from other sources.

Expected Competition Performance

Due to the SAT based subsumption and subsumption resolution Drodi 4.1.1 solves around 5% more FOF problems than last year's version. Due to new goal transformation Drodi solves around 30% more UEQ problems than last year's version. Also all syntax problems in the proof have been hopefully corrected. It is expected better results than in last year but probably not enough to improve the ranking position.

E 3.5.1

Stephan Schulz
DHBW Stuttgart, Germany

Architecture

E [Sch02, Sch13, SCV19] is a purely equational theorem prover for many-sorted first-order logic with equality, and for monomorphic higher-order logic. It consists of an (optional) clausifier for pre-processing full first-order formulae into clausal form, and a saturation algorithm implementing an instance of the superposition calculus with negative literal selection and a number of redundancy elimination techniques, optionally with higher-order extensions [VB+21, VBS23]. E is based on the DISCOUNT-loop variant of the given-clause algorithm, i.e., a strict separation of active and passive facts. No special rules for non-equational literals have been implemented. Resolution is effectively simulated by paramodulation and equality resolution. As of E 2.1, PicoSAT [Bie08] can be used to periodically check the (on-the-fly grounded) proof state for propositional unsatisfiability.

Strategies

Proof search in E is primarily controlled by a literal selection strategy, a clause selection heuristic, and a simplification ordering. The prover supports a large number of pre-programmed literal selection strategies. Clause selection heuristics can be constructed on the fly by combining various parameterized primitive evaluation functions, or can be selected from a set of predefined heuristics. Clause evaluation heuristics are based on symbol-counting, but also take other clause properties into account. In particular, the search can prefer clauses from the set of support, or containing many symbols also present in the goal. Supported term orderings are several parameterized instances of Knuth-Bendix-Ordering (KBO) and Lexicographic Path Ordering (LPO), which can be lifted in different ways to literal orderings.

For CASC-J13, E implements a two-stage multi-core strategy-scheduling automatic mode. The total CPU time available is broken into several (unequal) time slices. For each time slice, the problem is classified into one of several classes, based on a number of simple features (number of clauses, maximal symbol arity, presence of equality, presence of non-unit and non-Horn clauses, possibly presence of certain axiom patterns, ...). For each class, a schedule of strategies is greedily constructed from experimental data as follows: The first strategy assigned to a schedule is the the one that solves the most problems from this class in the first time slice. Each subsequent strategy is selected based on the number of solutions on problems not already solved by a preceding strategy. The strategies are then scheduled onto the available cores and run in parallel.

About 1615 different strategies have been thoroughly evaluated on various TPTP versions, 420 of which made it into the prover either into the automatic mode or into at least one schedule.

Implementation

E is build around perfectly shared terms, i.e., each distinct term is only represented once in a term bank. The whole set of terms thus consists of a number of interconnected directed acyclic graphs. Term memory is managed by a simple mark-and-sweep garbage collector. Unconditional (forward) rewriting using unit clauses is implemented using perfect discrimination trees with size and age constraints. Whenever a possible simplification is detected, it is added as a rewrite link in the term bank. As a result, not only terms, but also rewrite steps are shared [Sch25]. Subsumption and contextual literal cutting (also known as subsumption resolution) is supported using feature vector indexing [Sch13]. Superposition and backward rewriting use fingerprint indexing [Sch12], a new technique combining ideas from feature vector indexing and path indexing. Finally, LPO and KBO are implemented using the elegant and efficient algorithms developed by Bernd Löchner in [Loe06, Loe06]. The prover and additional information are available at

    https://www.eprover.org

Expected Competition Performance

E 3.5 has a new strategy hacked into it manually at the last moment. If this works out, we expect performance to be at least as good as last year's version, though it might do better than last year in UEQ. The system is expected to perform well in most proof classes, but will at best complement top systems in the disproof classes.

FindProof 0.1

Nik Murzin
Wolfram Institute, USA

Architecture

FindProof [Mur26] is an equational theorem prover for unit-equality problems, based on unfailing (ordered) Knuth-Bendix completion [BDP89]. It continues the Waldmeister line of provers [HL02], which won the CASC unit-equality division at every competition in which it was ranked between 1997 and 2014; the completion loop, critical-pair selection, queue management, and normalisation reproduce the Waldmeister design, and on classical algebraic problems the default configuration reproduces Waldmeister's critical-pair selection sequence exactly.

The axioms are saturated towards a convergent rewrite system while the goals are kept normalised, and a proof is reported when the two sides of every goal are joined; a conjecture given as several negative units is proved as a multi-goal conjunction against one saturation. Equations that cannot be oriented by the reduction ordering are used unfailingly, so that both faces are superposed, which preserves refutational completeness. The Knuth-Bendix ordering and the lexicographic path ordering are supported, with symbol precedences generated automatically. Redundant critical pairs are discarded by ground-joinability testing, the connectedness criterion, forward and backward subsumption and demodulation, and right-hand-side interreduction. Proofs are reconstructed from the recorded completion trace, the closing rewrite chain re-derived against the final rule set, and rendered as a TPTP CNFRefutation.

The system is not restricted to unit equality. Full first-order input is handled by equationalization, in which propositions are Skolemized and encoded as equations over a Boolean-algebra axiomatisation, proved by the same completion core, and decoded back into a predicate-logic proof. The present entry competes in the UEQ division.

Strategies

The search is ordered by a weighted critical-pair selection heuristic over a priority queue, with a Waldmeister-style selection ratio that interleaves a first-in-first-out pick at a fixed cadence, so that fairness, and hence completeness, is retained. The default configuration is the Waldmeister strategy stack. The full strategy surface, some fifty-five switches covering the reduction ordering, critical-pair weighting, weight limits, redundancy criteria, goal treatment, and queue management, is exposed as command-line options, and named configurations modelled on the published defaults of Waldmeister, Twee, and Vampire are provided. In competition a fixed schedule of such configurations is run, each in its own child process under a slice of the wall-clock limit, stopping at the first proof. The schedule and its slice allocation are derived from coverage measurements on the problem sets of previous competitions; strategy selection is not tuned to individual problems.

Implementation

FindProof is a single self-contained C program with no external dependencies. Terms are represented as flatterms over a term bank; rewriting and subsumption retrieval use perfect discrimination trees; the critical-pair queue is a binary heap over a compact twelve-byte critical-pair encoding; and the completion trace is kept off-heap in a packed store from which proof objects are reconstructed on demand. Problems in TPTP CNF syntax [Sut24] are read directly, with include directives resolved against the problem directory and the TPTP environment variable, and positive units taken as axioms and negative units as goals. Result lines follow the SZS ontology, with Satisfiable claimed only under a strategy with no completeness-breaking pruning and a ground goal. An optional Wolfram Language layer drives the same engine and adds proof objects verified by internal replay and the predicate-logic equationalization; it is a design feature, not a dependency, and the competition binary is the standalone C build. The system is available from the author.

Expected Competition Performance

FindProof is entered in the UEQ division. The Waldmeister default solves the standard equational classes, including Boolean and ternary Boolean algebra, group and ring theory, lattice theory, and combinatory logic, and the schedule adds coverage from ground-joinability and connectedness configurations on the associative-commutative classes where pure completion saturates slowly. The system is expected to trail the leading portfolio provers, whose unit-equality schedules have benefited from years of tuning, and a mid-field placement is anticipated.

FMB4J 0.1

Michael Rawson
University of Southampton, United Kingdom

    https://github.com/jh4n23/FMB4J

Expected Competition Performance

This system was implemented as an undergraduate project (!). It is not the state of the art in finite model building. However, the use of Vampire as a frontend boosts performance considerably. It can solve around a third of the satisfiable fragment of TPTP in 1 second.

LEO-II 1.7.0

Alexander Steen
University of Greifswald, Germany

Architecture

LEO-II [BP+08], the successor of LEO [BK98], is a higher-order ATP system based on extensional higher-order resolution. More precisely, LEO-II employs a refinement of extensional higher-order RUE resolution [Ben99]. LEO-II is designed to cooperate with specialist systems for fragments of higher-order logic. By default, LEO-II cooperates with the first-order ATP system E [Sch02]. LEO-II is often too weak to find a refutation amongst the steadily growing set of clauses on its own. However, some of the clauses in LEO-II's search space attain a special status: they are first-order clauses modulo the application of an appropriate transformation function. Therefore, LEO-II launches a cooperating first-order ATP system every n iterations of its (standard) resolution proof search loop (e.g., 10). If the first-order ATP system finds a refutation, it communicates its success to LEO-II in the standard SZS format. Communication between LEO-II and the cooperating first-order ATP system uses the TPTP language and standards.

Strategies

LEO-II employs an adapted "Otter loop". Moreover, LEO-II uses some basic strategy scheduling to try different search strategies or flag settings. These search strategies also include some different relevance filters.

Implementation

LEO-II is implemented in OCaml 4, and its problem representation language is the TPTP THF language [BRS08]. In fact, the development of LEO-II has largely paralleled the development of the TPTP THF language and related infrastructure [SB10]. LEO-II's parser supports the TPTP THF0 language and also the TPTP languages FOF and CNF.

Unfortunately the LEO-II system still uses only a very simple sequential collaboration model with first-order ATPs instead of using the more advanced, concurrent and resource-adaptive OANTS architecture [BS+08] as exploited by its predecessor LEO.

The LEO-II system is distributed under a BSD style license, and it is available from

    http://www.leoprover.org

Expected Competition Performance

LEO-II is not actively being developed anymore, hence there are no expected improvements to last year's CASC results.

Leo-III 1.8.0

Alexander Steen
University of Greifswald, Germany

Architecture

Leo-III [SB21], the successor of LEO-II [BP+08], is a higher-order ATP system based on extensional higher-order paramodulation with inference restrictions using a higher-order term ordering. The calculus contains dedicated extensionality rules and is augmented with equational simplification routines that have their intellectual roots in first-order superposition-based theorem proving. The saturation algorithm is a variant of the given clause loop procedure inspired by the first-order ATP system E.

Leo-III cooperates with external first-order ATPs that are called asynchronously during proof search; a focus is on cooperation with systems that support typed first-order (TFF) input. For this year's CASC E [Sch02, Sch13] is used as external system. However, cooperation is in general not limited to first-order systems. Further TPTP/TSTP-compliant external systems (such as higher-order ATPs or counter model generators) may be included using simple command-line arguments. If the saturation procedure loop (or one of the external provers) finds a proof, the system stops, generates the proof certificate and returns the result.

Strategies

Leo-III comes with several configuration parameters that influence its proof search by applying different heuristics and/or restricting inferences. These parameters can be chosen manually by the user on start-up. There is no time slicing, no strategy scheduling, or similar.

Implementation

Leo-III utilizes and instantiates the associated LeoPARD system platform [WSB15] for higher-order (HO) deduction systems implemented in Scala (currently using Scala 2.13 and running on a JVM with Java >= 11). The prover makes use of LeoPARD's data structures and implements its own reasoning logic on top. A hand-crafted parser is provided that supports all TPTP syntax dialects. It converts its produced concrete syntax tree to an internal TPTP AST data structure which is then transformed into polymorphically typed lambda terms. As of version 1.1, Leo-III supports all common TPTP dialects (CNF, FOF, TFF, THF) as well as their polymorphic variants [BP13, KRS16]. Since version 1.6.X (X >= 0) Leo-III also accepts non-classical problem input represented in non-classical TPTP, see ...

    https://tptp.org/NonClassicalLogic/

The term data structure of Leo-III uses a polymorphically typed spine term representation augmented with explicit substitutions and De Bruijn-indices. Furthermore, terms are perfectly shared during proof search, permitting constant-time equality checks between alpha-equivalent terms.

Leo-III's saturation procedure may at any point invoke external reasoning tools. To that end, Leo-III includes an encoding module which translates (polymorphic) higher-order clauses to polymorphic and monomorphic typed first-order clauses, whichever is supported by the external system. While LEO-II relied on cooperation with untyped first-order provers, Leo-III exploits the native type support in first-order provers (TFF logic) for removing clutter during translation and, in turn, higher effectivity of external cooperation.

Leo-III is available on GitHub:

    https://github.com/leoprover/Leo-III

Expected Competition Performance

Version 1.8.0 is, for all intents and purposes of CASC, equivalent to the version from previous years, except that some minor proof output bugs were fixed, and the support for reasoning in various quantified non-classical logics (not relevant to CASC) was improved. We do not expect Leo-III to be strongly competitive against more recent higher-order provers as Leo-III does not implement several standard features of effective systems (including time slicing and proper axiom selection).

Mace4 2026-6A

Jeff Machado
Independent Researcher, USA

Architecture

Mace4 [McCURL], version 2026-6A [ML26], is a finite model finder for first-order logic with equality, based on William McCune's LADR (Library for Automated Deduction Research) codebase. Given a set of clauses or formulas, Mace4 searches for a finite interpretation that satisfies them. For a fixed domain size it ground-instantiates the problem and applies a decision procedure based on ground rewriting modulo the input equalities, together with Davis-Putnam-style backtracking and negative propagation. Equalities are preserved through the search rather than flattened to propositional variables, so reasoning about function table cells stays first-order until a contradiction or a model is reached.

The set(arithmetic) command optionally enables an integer arithmetic interpretation of designated symbols, so that common operations such as sum and product are interpreted as such. This is provided as a convenience and is off by default.

Strategies

Mace4 starts at a small domain size (default 2) and iterates upward, building a ground model for each domain in turn, until a satisfying interpretation is found, the configured end_size is reached, or the wall-clock budget expires. Smallest-first iteration is effective on the finite-model problem profile, where most interpretations of interest are small and the search terminates quickly once the right domain size is reached. Within each domain, Mace4 prunes the search with least-number-heuristic (LNH) isomorphism reduction [Zha96] to break domain symmetry, ground propagation, negative-assignment propagation (the neg_assign and neg_assign_near flags), and negative elimination (neg_elim_near). Propagation proceeds by ground rewriting and derivation of negated equalities, detecting inconsistency early. A cell-overflow safety break stops the domain iterator when the function tables for the next domain size would exceed available memory, preventing runaway expansion on problems with no small finite model. These pruning techniques are enabled by default.

For parallel search, the -cores N option (and -casc, which implies -cores 8) directs Mace4 to race up to N domain sizes at once rather than running them sequentially. Because the smallest model is the desired answer, Mace4 retains the smallest model found so far, keeps smaller sizes running while wall-clock time remains, and reports that model when it is proven minimal or when the time limit is reached. Racing sizes makes productive use of the available cores under a wall-clock limit, and the keep-smallest rule preserves the minimality of the reported interpretation.

Mace4 applies a single fixed configuration to every problem in a division, with the same command line for all problems. It performs no per-problem tuning, uses no machine learning, and stores no precomputed information about individual problems or their solutions. All techniques are general purpose.

Implementation

Mace4 is coded in C (C89), using the LADR libraries shared with Prover9. The 2026 version adds approximately 2,000 lines to McCune's 2009 Mace4 sources, concentrated in TPTP input handling, SZS status output, the TFI model emitter, the parallel domain-size scheduler, and competition-mode command-line plumbing, while remaining backward compatible with existing LADR input files. The ground search uses an estack-based representation of partial assignments to cell variables. The parallel scheduler is a single Mace4 process that forks one child per domain size; the children share the parsed and clausified problem in memory, and the parent collects results, retains the smallest model, and relays it for the winning size. The single-size search core is unchanged from McCune's original design; only this orchestration is new. Checkpoint and resume support lets the user halt and resume a search and recover from unplanned outages.

For TPTP input, Mace4 emits SZS status lines (Satisfiable for a satisfiable axiom set, CounterSatisfiable for a conjecture refuted by a finite model) and TFI-format interpretations in TPTP syntax, following the specification of Sutcliffe et al. [SS+26]. Each interpretation gives the domain together with the function and predicate denotations, wrapped in SZS output delimiters for verification by tools such as AGMV. The -ladr_out option additionally produces the original LADR-format model for tools that consume the legacy representation.

Mace4 compiles and runs at the command line with simple, documented commands, and installation instructions are given on the website. It has been tested on a variety of platforms and runs on essentially all commonly used operating systems, including Linux, Windows, and macOS 10.4 or later. Mace4 will also run directly in a modern web browser, with no download or compilation required; the code is fetched once and then executes on the local device, so an input file prepared in any text editor can be run and the results saved or verified without using the command line.

Mace4 and its HTML-formatted manual are available at

    https://prover9.org

Expected Competition Performance

In the FNT division, Mace4 is expected to be effective on problems with small finite models: its home territory of algebra (group theory, ring theory, lattice theory) and combinatorics, where smallest-first iteration reaches the right domain size quickly and the parallel scheduler covers several sizes at once. Against the modern SAT-based and finite-model-building systems, whose propositional encodings scale to larger domains, Mace4 is expected to trail overall; internal evaluation at competition-style limits supports a modest overall solve count concentrated on the small-model profile. For problems that require large domains or have no finite model, Mace4 will not produce a result within the time limit. It reports SZS Timeout when the time budget is exhausted, SZS GaveUp when the configured domain range is exhausted or the next domain size would exceed available memory, and SZS MemoryOut when memory is exhausted during search.

mrs 0.2.0

Olivier Roland
Independent Researcher, France

Architecture

mrs 0.2.0 is an automated theorem prover for first-order logic with equality, implementing the superposition calculus [BG94]. within an Otter-style given-clause loop [Sch02]. The core inference rules are ordered binary resolution, factoring, equality resolution, equality factoring, and superposition (into both terms and literals), oriented by a Knuth-Bendix Ordering (KBO) or Lexicographic Path Ordering (LPO) with dynamic, rarity-based symbol precedence. Clause splitting on non-Horn clauses is delegated to the AVATAR architecture [Vor14], backed by the CaDiCaL CDCL SAT solver. EPR-structured problems are handled lazily through AVATAR splitting rather than eager ground pre-expansion, which previously caused memory exhaustion on large Effectively Propositional problems.

Term retrieval for unification and matching uses perfect discrimination trees that track variable bindings through traversal [McC92], eliminating false-positive candidates at the index level. Subsumption and subsumption resolution are accelerated by Feature Vector Indexing [Sch04]. Redundancy elimination includes forward/backward demodulation, forward/backward subsumption, subsumption resolution, condensation, tautology deletion, and global subsumption with orphan elimination (removal of the entire derived subtree of a clause that is later found to be subsumed).

mrs runs a portfolio of independent search strategies concurrently on separate threads, one per available CPU core, each maintaining its own clause set; the first strategy to find a refutation signals the others to stop via a shared atomic flag. Strategies additionally share a pool of derived unit equalities across threads to accelerate sibling searches without duplicating work; each shared entry carries its full justifying ancestor chain back to the original problem's axioms, so a clause adopted from a sibling thread is spliced into the receiving thread's own proof record with a complete, locally self-consistent derivation rather than an opaque fact.

Strategies

For CASC, mrs selects a division-tuned, data-driven portfolio of 8 strategies (matching the 8-core StarExec hardware) via --auto-schedule, which classifies the input problem as FNE, FEQ, or UEQ using rule-based syntactic checks (presence of equality literals, presence of function symbols of arity ≥ 1, unit-equality-only clause sets) and dispatches to the matching casc_* schedule.

Each per-division priority order was derived empirically: every one of mrs's 15 base strategies (varying clause weight function, literal selection, term ordering, Set-of-Support depth, and AVATAR on/off) was run solo against a large corpus of representative FOF/UEQ problems at the official time limit, and a greedy set-cover algorithm selected the minimal-redundancy 8-strategy subset that maximizes problems solved when run in parallel. This tuning is strictly at the division level (FNE/FEQ/ UEQ), never at the level of individual problems or their solutions, and generalizes to unseen problems of the same syntactic class. The baseline strategies use age-weight ratios (e.g., every 5th or 6th given-clause pick by FIFO age, the rest by weight) to balance breadth and depth of search, several distinct clause weight functions (including symbol-count, function-depth penalties, Horn-clause penalties, and conjecture-symbol boosting), and Set-of-Support restrictions that skip inferences between two clauses both far from the negated conjecture.

Implementation

mrs is implemented entirely in Rust (edition 2024), organized as a Cargo workspace of single-purpose crates: a zero-copy TPTP/TSTP parser built on the winnow parser-combinator library, a clausification pipeline (NNF/Skolemization/definitional CNF), a Robinson unification engine, the inference and ordering crate, the discrimination-tree/feature-vector indexing crate, the given-clause search loop and strategy scheduler, and a proof-extraction/TSTP-output crate. AVATAR splitting uses the cadical Rust bindings to the CaDiCaL SAT solver. mrs oes not depend on or invoke any external ATP system; all reasoning is performed in-process. mrs produces a TSTP-format refutation proof on success.

mrs is open source (MIT OR Apache-2.0) and available from:

    https://github.com/newca12/mrs

Expected Competition Performance

mrs is entered in the FOF and UEQ divisions only. In local benchmarking against a representative CASC-30-era problem set (not run under official competition conditions), mrs solved 45/100 FNE-category, 98/400 FEQ-category, and roughly 39-40/300 UEQ problems, compared to reference local runs of Vampire 5.0.1 (82 FNE, 361 FEQ, 243 UEQ) and E 3.3.3 (67 FNE, 236 FEQ, 186 UEQ) on the same local harness. mrs is expected to be competitive with, but behind, the leading systems in both divisions.

Prover9 1109a

Bob Veroff on behalf of William McCune
University of New Mexico, USA

Architecture

Prover9, Version 2009-11A, is a resolution/paramodulation prover for first-order logic with equality. Its overall architecture is very similar to that of Otter-3.3 [McC03]. It uses the "given clause algorithm", in which not-yet-given clauses are available for rewriting and for other inference operations (sometimes called the "Otter loop").

Prover9 has available positive ordered (and nonordered) resolution and paramodulation, negative ordered (and nonordered) resolution, factoring, positive and negative hyperresolution, UR-resolution, and demodulation (term rewriting). Terms can be ordered with LPO, RPO, or KBO. Selection of the "given clause" is by an age-weight ratio.

Proofs can be given at two levels of detail: (1) standard, in which each line of the proof is a stored clause with detailed justification, and (2) expanded, with a separate line for each operation. When FOF problems are input, proof of transformation to clauses is not given.

Completeness is not guaranteed, so termination does not indicate satisfiability.

Strategies

Prover9 has available many strategies; the following statements apply to CASC.

Given a problem, Prover9 adjusts its inference rules and strategy according to syntactic properties of the input clauses such as the presence of equality and non-Horn clauses. Prover9 also does some preprocessing, for example, to eliminate predicates.

For CASC Prover9 uses KBO to order terms for demodulation and for the inference rules, with a simple rule for determining symbol precedence.

For the FOF problems, a preprocessing step attempts to reduce the problem to independent subproblems by a miniscope transformation; if the problem reduction succeeds, each subproblem is clausified and given to the ordinary search procedure; if the problem reduction fails, the original problem is clausified and given to the search procedure.

Implementation

Prover9 is coded in C, and it uses the LADR libraries. Some of the code descended from EQP [McC97]. (LADR has some AC functions, but Prover9 does not use them). Term data structures are not shared (as they are in Otter). Term indexing is used extensively, with discrimination tree indexing for finding rewrite rules and subsuming units, FPA/Path indexing for finding subsumed units, rewritable terms, and resolvable literals. Feature vector indexing [Sch04] is used for forward and backward nonunit subsumption. Prover9 is available from

    http://www.cs.unm.edu/~mccune/prover9/

Expected Competition Performance

Prover9 is the CASC fixed point, against which progress can be judged. Each year it is expected do worse than the previous year, relative to the other systems.

Prover9 2026-6A

Jeff Machado
Independent Researcher, USA

Architecture

Prover9 [McCURL], version 2026-6A [ML26], is a resolution/paramodulation prover for first-order logic with equality, based on William McCune's LADR (Library for Automated Deduction Research) codebase. The system follows the "given clause" algorithm, in which not-yet-given clauses are available for rewriting and for other inference operations (the "Otter loop"). The prover provides positive ordered (and nonordered) resolution and paramodulation, negative ordered (and nonordered) resolution, factoring, positive and negative hyperresolution, UR-resolution, and demodulation (term rewriting). Term ordering options include LPO, RPO, or KBO. Selection of candidate clauses uses an age-weight ratio. The 2026 version adds native TPTP input parsing with automatic SZS status output, SInE [HV11] for large-theory problems, a machine-learned strategy selector, and a multi-core parallel portfolio scheduler. The underlying inference algorithms are unchanged from McCune's original design.

For TPTP input, proof output is in TPTP/TSTP format with SZS status lines and CNFRefutation delimiters. SIGXCPU and SIGALRM are handled for clean termination under both CPU and wall-clock time limits. In competition mode (-casc T), the system automatically configures TPTP output, multi-core scheduling, the ML strategy selector, and wall-clock timeout T for the entire portfolio run.

Strategies

Prover9 2026-6A employs a portfolio of 96 strategies executed via a sliding-window parallel scheduler. A compact decision tree (147 nodes, 74 leaves) selects and orders strategies for each input problem from a fixed vector of general syntactic features: formula and clause counts, symbol counts and symbols-per-formula, term-depth and literal-count estimates, an axiom-to-symbol ratio, presence of equality, and unit/Horn classification. The selector reads only these structural features and is given no problem name, file identifier, hash, or solution. The tree was built by running the 96 strategies over the TPTP library and learning which regions of this feature space favor which strategies. Its 74 leaves partition problems into broad structural classes, so the learned feature-to-strategy mapping is coarse by construction and generalizes across families of problems rather than fitting any one of them; it is expected to apply equally to new, unseen problems. The same trained selector is applied unchanged to all problems in all divisions. Strategy 0 (auto_default) runs McCune's original automatic mode, which adjusts inference rules and parameters according to syntactic properties of the input (presence of equality, non-Horn clauses, etc.). The remaining 95 specialist strategies vary parameters such as term ordering (KBO/LPO/RPO), age-weight ratio, inference rule selection (hyperresolution, UR-resolution, paramodulation), demodulation settings, literal selection, and SInE tolerance levels.

The parallel scheduler assigns each strategy a time slice and runs multiple strategies concurrently using a sliding-window approach. An initial breadth phase exposes the problem to diverse strategies, followed by a priority-based depth phase that allocates remaining time to the most promising strategies based on progress metrics such as given and kept clause counts and memory use.

SInE is enabled automatically for TPTP problems with more than 128 axioms, reducing large-theory problems to manageable subsets before search. Several specialist strategies vary the SInE tolerance to explore different axiom subsets.

For FOF problems, a preprocessing step attempts to reduce the problem to independent subproblems by a miniscope transformation; if the reduction succeeds, each subproblem is clausified and given to the ordinary search procedure.

Implementation

Prover9 is coded in C (C89), using the LADR libraries. Some of the code descended from EQP [McC97]. The 2026 version adds approximately 10,000 lines to the Prover9-specific search code in McCune's original codebase and approximately 13,500 lines to the shared LADR libraries (TPTP parser and writer, SInE, demodulation enhancements, and proof-output machinery), while maintaining strict backward compatibility with existing LADR input files. Memory addressing and management are significantly improved from the 2009 version; over a trillion generated clauses have been processed effectively.

Indexing uses discrimination tree indexing for finding rewrite rules and subsuming units, FPA/Path indexing for finding subsumed units, rewritable terms, and resolvable literals. Feature vector indexing is used for forward and backward nonunit subsumption. To accelerate lookups at high-fanout index nodes, both FPA/Path and discrimination tree nodes can switch from linear child scanning to hash-table lookup; the switchover threshold is a runtime parameter (fpa_hash_threshold, default 4 children, and discrim_hash_threshold, off by default) so strategies can tune it per problem family. The hash-table code paths are unconditionally compiled in.

The multi-core scheduler uses anonymous shared mmap regions and signals (SIGSTOP/SIGCONT/SIGALRM) for cooperative child process management, with open_memstream for capturing child proof output. Each child process runs an independent Prover9 instance with its own strategy configuration. Checkpoint and resume support lets the user halt and resume a search and recover from unplanned outages.

Prover9 may be compiled and run easily at the command line with documented, simple commands familiar to most users. Installation instructions are displayed conveniently on the website. Prover9 has been tested on a variety of platforms, and runs on essentially all commonly used computer operating systems including Linux, Windows, and macOS versions 10.4 or later. Prover9 will also run directly from any modern browser connected to the internet, without the need for the user to download or compile any code. Code execution takes place on the local device. Any local text editor may be used to prepare the input file, and the command line may be entirely avoided. Prover9 results are directly reported to the screen, and may be saved, verified, or processed according to the user's needs.

Prover9 and its HTML-formatted manual are available at

    https://prover9.org

Expected Competition Performance

In the FOF division, Prover9 2026-6A is expected to clearly outperform the fixed Prover9 1109a reference, on the strength of the multi-core portfolio, ML strategy selection, and SInE axiom selection; internal evaluation on rating-banded TPTP samples at competition-style limits shows an improvement of roughly a third over the 2009 baseline. The portfolio provides robustness across diverse problem types, and SInE brings large-theory problems into reach. In the UEQ division, Prover9's paramodulation and demodulation engine, a descendant of the EQP system that solved the Robbins Problem [McC97], remains a capable, well-tested equational reasoner, with the portfolio adding coverage through varied term orderings and demodulation strategies. A respectable showing is expected, though the leading dedicated equational systems are expected to remain ahead. Overall, we expect Prover9 2026-6A to finish comfortably ahead of the fixed Prover9 1109a reference, by roughly a third on internal evaluation, while remaining well behind the modern superposition portfolio systems. Its particular strengths are equational and algebraic problems.

SATResetCoP 1.0

Martin Fixman
University of Cambridge, United Kingdom

Architecture

SATResetCoP is a connection prover for untyped first-order formulas without equality. It is coupled to an incremental SAT solver: ground instances of clauses encountered during tableau construction are accumulated as propositional clauses, and a theorem is reported when that clause set is propositionally unsatisfiable. The current SAT model is used to rank eligible tableau steps. When a tableau attempt has generated new ground clauses, it is reset to explore a fresh part of the search space while retaining the accumulated SAT state. This design builds on SATCoP [RR21] and the Connections framework [ROH23].

Strategies

At a tableau dead end, a reset at the current depth is selected when new ground clauses have been generated. Otherwise the depth bound is increased and a new tableau is started. Thus, local backtracking is traded for diverse clause generation while the SAT state is retained. The depth bound is increased by at most one when an iteration has stopped producing new ground clauses, following the iterative-deepening discipline used by leanCoP [Ott23]. This keeps early searches shallow while allowing deeper tableaux when they are required. The strategy is intended to generate a sufficiently informative set of ground clauses for a SAT refutation, rather than to close one particular tableau.

Implementation

SATResetCoP is implemented in Python using a fork of the Connections library [ROH23], which provides a framework for creating and testing connection provers. The fork adds a classical SAT-assisted calculus with leanCoP-style cuts and a strategy interface for selecting different prover controllers. These changes are currently being backported to newer versions of the library. The incremental SAT state is maintained by CaDiCaL through the pydical Python binding. The packaged system also depends on python-sat for optional experimental strategies. The final version can be downloaded from GitHub:

    https://github.com/mfixman/satresetcop

Expected Competition Performance

As a first version, SATResetCoP is not expected to be competitive with mature saturation provers such as Vampire in the general FOF category. Strong performance is expected against other connection provers on problems for which shallow tableau exploration quickly generates an unsatisfiable set of ground propositional clauses. The experimental results suggest particular strength on the TPTP SYN, SWC, and SWV categories, with improvements also seen on GEO and NUM. No category-specific tuning is used.

SPASS-SCL 0.1.1

Simon Schwarz
Max Planck Institute for Informatics, Germany

Architecture

SPASS-SCL-FOL 0.1.1 is a prototype implementing SCL(FOL) [BSW23] for first-order logic without equality. The focus of this year's development was not on improving performance, but rather on supporting proofs and models, implementing a first prototype integration of propositional reasoning, and clarifying the structure of the main loop. Unlike the classical main loops of saturation-based provers, this loop is centered on selecting ground literals to extend the trail and handling stuck states, which represent partial saturations.

Strategies

SPASS-SCL-FOL 0.1.1 comes without any complex strategy parameters. The main strategy parameter of SCL(FOL) is the selection of ground literals that should extend the trail. Currently, only very simple strategies are implemented. The main strategy selects ground literals by building linear models [BK+24] and considers literals that conflict during model building. As a prototype, SPASS-SCL-FOL does not use a portfolio.

Implementation

SPASS-SCL-FOL is implemented in C on top of the SPASS-Workbench and will become our next system after our SAT solver SPASS-SAT and our SMT solver SPASS-SATT. Due to its prototypical status, it is currently not published on our website.

Expected Competition Performance

SPASS-SCL-FOL 0.1.1 is still an early prototype for studying properties of SCL and has not been tuned for competition performance, so performance will be poor.

SUPr 1.0

Teddy Kim
Naval Postgraduate School, USA

Architecture

SUPr 1.0 is a saturation-based theorem prover built as a native component of the Sigma knowledge engineering environment (Rust implementation) (SigmaKEE-rs) [PS14]. SUPr is optimized to reason directly over the Suggested Upper Merged Ontology (SUMO) [NP01, Pea11], with additional support to solve standalone TPTP problems. The core search procedure is a given-clause saturation loop implementing ordered resolution and superposition with selective literal selection, forward and backward demodulation, and a Knuth-Bendix reduction ordering. Clauses and their constituent terms are represented in a content-addressed, hash-consed arena. This content based addressing scheme is structured to allow batched, hardware accelerated unification via binary arithmetic.

Strategies

SUPr is built specifically for reasoning over SUMO; consequently, SUPr's proving strategy depends on heavy caching of supporting evidence at the time of ingest for new axioms and use-case-specific semantic decoding.

Given that SUMO is comprised of hundreds of thousands of axioms, axiom selection is key to proving any problem leveraging the ontology. Novel pre-caching of SUMO Inference Engine (SInE) occurence and trigger indices [HV11] allows for fast per-problem axiom selection. Axioms are additionally ranked by structural relevance to the conjecture [LX22] and those which rank highly but dropped by SInE are rescued and introduced to the problem.
Beneath the given-clause loop, a generalized Datalog(¬). model-construction system extracts and caches Horn-clause shaped rules from the axiom set to drive a background ground fact materialization loop to potentially short circuit easy queries against the ontology.
Individually developed and sound subsystems are used to short circuit the given-clause loop for common SUMO queries like event calculus and transitive closures.
Search-loop tuning is organized as a small number of strategies running as concurrent, threads which race for the solution. Threads share a single clause stack that remains deconflicted due to the atomicity of the system's content-based hashing scheme.
SUPr uses a content-based hashing system by which terms are identified by a 64-bit unsigned integer hash based on its content and structure. In addition to serving as a unique identifier for the lifetime of the ontology, it is structured such that unification and paramodulation within the saturation loop can be conducted as bitwise arithmetic and parallelized via "Single Instruction, Multiple Data" (SIMD) on a single processor core.

Implementation SUPr 1.0 is implemented in Rust. It is shipped as a part of the SigmaKEE-rs v2.0.0+ command line interface application. The system accepts both SUO-KIF and TPTP (CNF, FOF, TFF) syntax as input dialects, parsing either into a common internal sentence which is cached into a Lightning Memory Database (LMDB), which itself is a shared memory B+ Tree which allows for large axiom spaces to share a common memory addressing scheme for systems lacking the resource to hold a large ontology simultaneously in memory. Proofs are emitted in the afforementioned dialects, graphviz compatible notation, or procedurally generated plain language prose. SigmaKEE-rs can be downloaded from its GitHub page:

    https://github.com/ontologyportal/sigma-rs

Expected Competition Performance

SUPr 1.0 is competing in the first order category. It is expected to perform well on problems drawn from or resembling the SUMO ontology, where its axiom-selection feedback loop and discharge-subsystem prologue are specifically tuned, and on standalone TPTP problems small enough, or easy enough, for at least one portfolio lane to solve quickly. Harder, uniformly slow problems in equational domains are expected to be a weaker category: superposition-based simplification is present but comparatively lightly indexed next to mature systems in this competition.

Twee 2.7

Nick Smallbone
Chalmers University of Technology, Sweden

Architecture

Twee 2.7 [Sma21] is a theorem prover for unit equality problems based on unfailing completion [BDP89]. It implements a DISCOUNT loop, where the active set contains rewrite rules (and unorientable equations) and the passive set contains critical pairs. The basic calculus is not goal-directed, but Twee implements a transformation which improves goal direction for many problems.

Twee features ground joinability testing [MN90] and a connectedness test [BD88], which together eliminate many redundant inferences in the presence of unorientable equations. The ground joinability test performs case splits on the order of variables, in the style of [MN90], and discharges individual cases by rewriting modulo a variable ordering.

This year's version adds preliminary support for discovering interesting term patterns during proof search [AJS26].

Strategies

Twee's strategy is simple and it does not tune its heuristics or strategy based on the input problem. The term ordering is always KBO; by default, functions are ordered by number of occurrences and have weight 1. The proof loop repeats the following steps:

Select and normalise the lowest-scored critical pair, and if it is not redundant, add it as a rewrite rule to the active set.
Normalise the active rules with respect to each other.
Normalise the goal with respect to the active rules.

Each critical pair is scored using a weighted sum of the weight of both of its terms. Terms are treated as DAGs when computing weights, i.e., duplicate subterms are counted only once per term.

For CASC, to take advantage of multiple cores, several versions of Twee run in parallel using different parameters (e.g., with the goal-directed transformation on or off).

Implementation

Twee is written in Haskell. Terms are represented as array-based flatterms for efficient unification and matching. Rewriting uses a perfect discrimination tree.

The passive set is represented compactly (12 bytes per critical pair) by storing only the information needed to reconstruct the critical pair, not the critical pair itself. Because of this, Twee can run for an hour or more without exhausting memory.

Twee uses an LCF-style kernel: all rules in the active set come with a certified proof object which traces back to the input axioms. When a conjecture is proved, the proof object is transformed into a human-readable proof. Proof construction does not harm efficiency because the proof kernel is invoked only when a new rule is accepted. In particular, reasoning about the passive set does not invoke the kernel.

Twee can be downloaded as open source from:

    https://nick8325.github.io/twee

Expected Competition Performance

Competing with the top provers.

Vampire 4.8

Michael Rawson
TU Wien, Austria

There have been a number of changes and improvements since Vampire 4.7, although it is still the same beast. Most significant from a competition point of view are long-awaited refreshed strategy schedules. As a result, several features present in previous competitions will now come into full force, including new rules for the evaluation and simplification of theory literals. A large number of completely new features and improvements also landed this year: highlights include a significant refactoring of the substitution tree implementation, the arrival of encompassment demodulation to Vampire, and support for parametric datatypes.

Vampire's higher-order support has also been re-implemented from the ground up. The new implementation is still at an early stage and its theoretical underpinnings are being developed. There is currently no documentation of either.

Architecture

Vampire [KV13] is an automatic theorem prover for first-order logic with extensions to theory-reasoning and higher-order logic. Vampire implements the calculi of ordered binary resolution, and superposition for handling equality. It also implements the Inst-gen calculus and a MACE-style finite model builder [RSV16]. Splitting in resolution-based proof search is controlled by the AVATAR architecture which uses a SAT or SMT solver to make splitting decisions [Vor14, RB+16]. A number of standard redundancy criteria and simplification techniques are used for pruning the search space: subsumption, tautology deletion, subsumption resolution and rewriting by ordered unit equalities. The reduction ordering is the Knuth-Bendix Ordering. Substitution tree and code tree indexes are used to implement all major operations on sets of terms, literals and clauses. Internally, Vampire works only with clausal normal form. Problems in the full first-order logic syntax are clausified during preprocessing [RSV16]. Vampire implements many useful preprocessing transformations including the SinE axiom selection algorithm. When a theorem is proved, the system produces a verifiable proof, which validates both the clausification phase and the refutation of the CNF.

Strategies

Vampire 4.8 provides a very large number of options for strategy selection. The most important ones are:

Choices of saturation algorithm:
- Limited Resource Strategy [RV03]
- DISCOUNT loop
- Otter loop
- Instantiation using the Inst-Gen calculus
- MACE-style finite model building with sort inference
Splitting via AVATAR [Vor14]
A variety of optional simplifications.
Parameterized reduction orderings.
A number of built-in literal selection functions and different modes of comparing literals [HR+16].
Age-weight ratio that specifies how strongly lighter clauses are preferred for inference selection. This has been extended with a layered clause selection approach [GS20].
Set-of-support strategy with extensions for theory reasoning.
For theory-reasoning:
- Ground equational reasoning via congruence closure.
- Addition of theory axioms and evaluation of interpreted functions [RSV21].
- Use of Z3 with AVATAR to restrict search to ground-theory-consistent splitting branches [RB+16].
- Specialised theory instantiation and unification [RSV18].
- Extensionality resolution with detection of extensionality axioms

The schedule for the new HOL implementation was developed using Snake, a strategy schedule construction tool described in more detail last year. The Snake schedule will this year fully embrace Vampire randomisation support [Sud22] and, in particular, every strategy will independently shuffle the input problem, to nullify (in expectation) the effect of problem scrambling done by the organisers.

Implementation

Vampire 4.8 is implemented in C++. It makes use of fixed versions of Minisat and Z3. See the website for more information and access to the GitHub repository.

Expected Competition Performance

Vampire 4.8 is the CASC-29 FNT winner.

Vampire 5.0

Michael Rawson
University of Southampton, United Kongdom

Vampire 5.0 remains similar in spirit to all previous versions, but a bumper crop of changes have been merged this competition cycle. Various non-competition improvements to Vampire including a program synthesis mode [HA+24] and partial support for the polymorphic SMT-LIB 2.7 standard landed, but for the competition we mention:

ALASCA [KK+23] for reasoning with linear arithmetic, with further VIRAS extensions [SKK24] for quantifier elimination.
Partial redundancy calculi [HKV25]
Stabilised and greatly enhanced runtime-specialised unidirectional term ordering checks [HC+25]
A variant of the ground joinability redundancy elimination rule, used in forward simplification.
Subsumption (resolution) via code trees.
Integration of the CaDiCaL SAT solver [BF+24] alongside Minisat.
More detailed output, including proofs that are (more) TSTP-compliant, reporting non-trivial preprocessing in saturations, and producing completely faithful finite models of the input.
Portability: Vampire is much more standards-compliant and portable than previously, with much-reduced dependence on platform-specific APIs and hardware architectures, aided by C++17.

Vampire's higher-order support remains very similar to last year, although a re-implementation intended for mainline Vampire is being merged in stages.

Architecture

Vampire [BB+25] is an automatic theorem prover for first-order logic with extensions to theory-reasoning and higher-order logic. Vampire implements several extensions of a core superposition calculus. It also implements a MACE-style finite model builder for finding finite counter-examples [RSV16]. Splitting in saturation-based proof search is controlled by the AVATAR architecture which uses a SAT or SMT solver to make splitting decisions [Vor14, RB+16]. A number of standard redundancy criteria and simplification techniques are used for pruning the search space: subsumption, tautology deletion, subsumption resolution and rewriting by ordered unit equalities. Substitution tree and code tree indices are used to implement all major operations on sets of terms, literals and clauses. Internally, Vampire works only with clausal normal form: problems not already in CNF are clausified during preprocessing [RSV16]. Vampire implements many preprocessing transformations, including the SInE axiom selection algorithm for large theories and blocked clause elimination.

Strategies

Vampire 5.0 provides a very large number of options for strategy selection. The most important ones are:

Choices of saturation algorithm:
- Limited Resource Strategy [RV03]
- DISCOUNT loop
- Otter loop
- MACE-style finite model building with sort inference
Splitting via AVATAR [Vor14]
A variety of optional simplifications.
Parameterized reduction orderings KBO and LPO.
A number of built-in literal selection functions and different modes of comparing literals [HR+16].
Age-weight ratio that specifies how strongly lighter clauses are preferred for inference selection. This has been extended with a layered clause selection approach [GS20].
The set-of-support strategy with extensions for theory reasoning.
For theory reasoning:
- Specialised calculi such as ALASCA.
- Addition of theory axioms and evaluation of interpreted functions [RSV21].
- Use of Z3 with AVATAR to restrict search to ground-theory-consistent splitting branches [RB+16].
- Specialised theory instantiation and unification [RSV18].
- Extensionality resolution with detection of extensionality axioms

Implementation

Vampire 5.0 is implemented in C++. It makes use of fixed versions of Minisat, CaDiCaL, GMP, VIRAS, and Z3. See the GitHub repository and associated wiki for more information.

Expected Competition Performance

Vampire 5.0 is the CASC-30 THF, FOF, and UEQ winner.

Vampire 5.0.1

Márton Hajdu
TU Wien, Austria Vampire 5.0.1 remains similar in spirit to all previous versions, but a bumper crop of changes have been merged this competition cycle.

Various non-competition improvements to Vampire landed, but for the competition we mention:

Clause-selection guidance based on Graph and Recursive Neural Networks [Sud25].
Higher-order support has been mostly merged with mainline Vampire, therefore our HOL support enjoys all competition-winning improvements over the last years, while remaining efficient on non-higher-order problems.
Detailed propositional proof output via the CaDiCaL SAT solver [BF+24].
More detailed output, including proofs that are (more) TSTP-compliant.
Portability: Vampire is much more standards-compliant and portable than previously, with much-reduced dependence on platform-specific APIs and hardware architectures, aided by C++20.

Vampire's higher-order support remains very similar to last year, although a re-implementation intended for mainline Vampire is being merged in stages.

Architecture

Strategies

Vampire 5.0.1 provides a very large number of options for strategy selection. The most important ones are:

Choices of saturation algorithm:
- Limited Resource Strategy [RV03]
- DISCOUNT loop
- Otter loop
- MACE-style finite model building with sort inference
Splitting via AVATAR [Vor14]
A variety of optional simplifications.
Parameterized reduction orderings KBO and LPO.
A number of built-in literal selection functions and different modes of comparing literals [HR+16].
Age-weight ratio that specifies how strongly lighter clauses are preferred for inference selection. This has been extended with a layered clause selection approach [GS20]. Optionally, instead of age-weight alternation, clause selection can be fully delegated to a guiding neural network [Sud25].
The set-of-support strategy with extensions for theory reasoning.
For theory reasoning:
- Specialised calculi such as ALASCA.
- Addition of theory axioms and evaluation of interpreted functions [RSV21].
- Use of Z3 with AVATAR to restrict search to ground-theory-consistent splitting branches [RB+16].
- Specialised theory instantiation and unification [RSV18].
- Extensionality resolution with detection of extensionality axioms

For CASC, we prepared several clause-selection guiding networks trained on the TPTP library. As described in [Sud25], we were independently checking test performance on randomly held-out fraction of the library during the training. The test performance, although lower than the train performance, is greater than that of the baseline, which means that generalization to problems similar to those used in training is achieved.

Implementation

Vampire 5.0.1 is implemented in C++. It makes use of fixed versions of Minisat, CaDiCaL, GMP, VIRAS, and Z3. See the GitHub repository and associated wiki for more information. The neural clause-selection guidance relies on the LibTorch library (v2.10), which is statically linked into the competition submission so that it can run on StarExec without requiring the library to be installed there.

Expected Competition Performance

Vampire 5.0.1 should be an improvement on the previous version. A reasonably strong performance across all divisions is therefore expected.

VIP 1.718

Ilies Nokrani
Université Montpellier - LIRMM, France

Architecture

VIP 2026 is an automatic theorem prover for first-order logic in TPTP FOF syntax. It was developed under a short time constraint as a standalone prover, with extensive LLM-assisted implementation. The submitted system does not call another ATP system as a backend.

The core is a given-clause saturation loop. Problems are parsed from TPTP files, include directives are expanded, formulae are clausified, and clauses are processed by resolution-style and superposition-style rules. The main inferences are binary resolution, factoring, equality resolution, equality factoring, and equality-oriented paramodulation/superposition steps. Standard simplifications include tautology deletion, demodulation, forward simplification, subsumption, subsumption resolution, condensation, and contextual literal cutting. VIP contains a simple legacy engine and a more recent engine with stronger equality handling, indexing, and clause selection; the competition portfolio combines both. The system also includes SInE-style axiom selection, conservative AVATAR-style splitting, layered clause selection, and limited deduction-modulo-inspired preprocessing for selected definitional equivalences.

Strategies

VIP uses a sequential portfolio. The main CASC-oriented portfolio is casc-150. Each stage receives a fixed fraction of the available time and runs with fixed options for the FOF division. The submitted StarExec script uses the same command line for every FOF problem.

Strategy scheduling is based only on general syntactic characteristics of the input, such as equality density, clause shape, unit-clause ratio, polarity patterns, and symbol occurrence information. It is not based on problem names, file paths, comments, TPTP headers, or stored information about individual problems or their solutions. The portfolio includes FEQ-oriented equality stages, FNE recovery stages using the legacy engine, SInE axiom-selection stages with different widths, legacy-guided modern stages, layered age/weight passive selection, and conservative splitting stages. The submitted StarExec script does not hardcode the CASC time limit; it accepts the announced wall-clock budget as a wrapper argument or environment value. The default generated-clause limit is 75000.

Implementation

VIP is implemented in OCaml and is built with Dune. The executable installed in the StarExec package is vip; ip is kept as a compatibility alias in the source tree. The delivered binary is statically linked.

TPTP include files are resolved according to the standard TPTP convention: first relative to the problem file, and otherwise relative to the TPTP root supplied by the TPTP environment variable. Internal data structures include feature-vector style indices for subsumption candidates, discrimination-style indices for rewriting candidates, and KBO-style term ordering for equality reasoning.

In competition mode, VIP writes all result information to standard output, emits an SZS status line, and, when a refutation is found, prints a TPTP/TSTP-style proof delimited by SZS output markers. The run script expects the problem file as its first argument and relies on the StarExec/TPTP environment for include-file resolution and resource enforcement. VIP is available from:

    https://github.com/delahayd/vip

Expected Competition Performance

VIP is expected to solve a useful subset of FOF theorem problems, with better performance on unsatisfiable problems where axiom selection, equality simplification, and staged saturation interact well. It is not expected to match mature ATP systems such as Vampire, E, or Zipperposition. >P>

Zipperposition 2.1.9999

Jasmin Blanchette
Ludwig-Maximilians-Universität München, Germany

Architecture

Zipperposition is a superposition-based theorem prover for typed first-order logic with equality and for higher-order logic. It is a pragmatic implementation of a complete calculus for full higher-order logic [BB+21]. It features a number of extensions that include polymorphic types, user-defined rewriting on terms and formulas ("deduction modulo theories"), a lightweight variant of AVATAR for case splitting [EBT21], and Boolean reasoning [VN20]. The core architecture of the prover is based on saturation with an extensible set of rules for inferences and simplifications. Zipperposition uses a full higher-order unification algorithm that enables efficient integration of procedures for decidable fragments of higher-order unification [VBN20]. The initial calculus and main loop were imitations of an earlier version of E [Sch02]. With the implementation of higher-order superposition, the main loop had to be adapted to deal with possibly infinite sets of unifiers [VB+21].

Strategies

The system uses various strategies in a portfolio. The strategies are run in parallel, making use of all CPU cores available. We designed the portfolio of strategies by manual inspection of TPTP problems. Zipperposition's heuristics are inspired by efficient heuristics used in E. Various calculus extensions are used by the strategies [VB+21]. The portfolio mode distinguishes between first-order and higher-order problems. If the problem is first-order, all higher-order prover features are turned off. In particular, the prover uses standard first-order superposition calculus and disables collaboration with the backend prover (described below). Other than that, the portfolio is static and does not depend on the syntactic properties of the problem.

Implementation

The prover is implemented in OCaml. Term indexing is done using fingerprints for unification, perfect discrimination trees for rewriting, and feature vectors for subsumption. Some inference rules such as contextual literal cutting make heavy use of subsumption. For higher-order problems, some strategies use the E prover as an end-game backend prover.

Zipperposition's code can be found at

    https://github.com/sneeuwballen/zipperposition

and is entirely free software (BSD-licensed).

Zipperposition can also output graphic proofs using graphviz. Some tools to perform type inference and clausification for typed formulas are also provided, as well as a separate library for dealing with terms and formulas [Cru15].

Expected Competition Performance

The prover is expected to perform well on THF, about as well as last year's version. We expect to beat E.

ProoVer 2026

CheckProof 0.1

Nik Murzin
Wolfram Institute, USA

Overview

CheckProof 0.1 [Mur26] is a checker for first-order TSTP proofs. A proof is validated by replay: it is parsed into its sequence of annotated formulae, and each derived step is validated independently by dispatching on the inference rule that introduced it, so that one SZS status is reported for the whole proof [Sut08] Each rule carries a distinct obligation. An instantiate step is required to be a substitution instance of a parent; a negated_conjecture step to be the negation of its parent, checked by entailment in both directions so that a wrong quantifier dual is detected; a skolemize step to introduce a fresh Skolem symbol that occurs in the result; and a consequence, horn, or deduction step to be entailed by its parents. Entailment is decided by refutation with superposition, resolution, and paramodulation [BG94]: the parents and the negated conclusion are clausified and saturated, and a derived empty clause certifies the step. The refutation is required to close with the empty clause. Checking is conservative, in keeping with the scoring, under which a wrong verdict is penalised more heavily than abstention: a step is reported unsound only when a saturation is completed that exhibits a counter-model, and a step whose inference rule is not implemented, or whose entailment is neither proved nor refuted within the resource bound, leaves the proof unverified rather than guessed.

Implementation

CheckProof is implemented as a single self-contained C program, with no external dependencies beyond the C standard library. Proofs in TSTP syntax are read directly by a recursive-descent parser into the first-order term representation of the reasoning core. The obligations are discharged by FindProof's own clausal saturation engine, the same system entered in the CASC unit-equality division [Mur26]; no external prover is called, so equational atoms are reasoned about with superposition rather than treated as opaque. The checker is built from source by a single compiler invocation and is available from the author.

Expected Competition Performance

CheckProof is expected to be sound, accepting no invalid proof and rejecting no valid one, and to report a proof unverified when its steps use inference rules outside the implemented set or when an entailment cannot be settled within the time limit.

GAPT 2.20

Fabian Achammer
TU Wien, Austria

Overview

GAPT (General Architecture for Proof Theory) [EH+16] is a framework containing many data structures, algorithms, parsers and other components common in proof theory and automated deduction. As such it implements various proof calculi, including LK proofs and resolution proofs. The main idea of the GAPT proof checker is to attempt to import a given TSTP derivation as an acyclic directed graph of LK proofs which can then combined into a compact representation of an LK proof of a sequent whose antecedent contains input axioms and the negation of the input conjecture and whose succeedent contains $false. As a pre-pass GAPT performs other auxiliary checks like checking that the axiom file directives are correct. To check semantic entailment of plain inferences GAPT uses its internal superposition prover Escargot.

GAPT returns VerifiedBad if any of the checks fail, or any internal proof invariant is violated. If checking takes longer than the given time limit, GAPT returns Timeout. If an unexpected situation occurs (e.g., input syntax error, or unexpected exception thrown), GAPT returns Unknown. Otherwise, GAPT returns VerifiedGood. The checker not only outputs the SZS status, but in case of VerifiedBad it also gives a more detailed message as to what went wrong. For example, if Escargot can establish that a plain inference is incorrect, the checker outputs the step name of the incorrect inference.

The approach of importing the TSTP derivation as an LK proof allows, in principle, to compute a more elaborated version of the TSTP derivation that requires no external ATP tools to check anymore.

Implementation

GAPT is implemented in Scala. It uses parboiled2 as a parsing library to parse the TPTP format. The internal superposition prover Escargot is used for checking plain inferences, so GAPT does not depend on an external ATP tool for proof checking.

The proof checking proceeds in the following way: First the input file is parsed in the TPTP format. Then the file is checked for unique formula names and for acyclicity of the derivation. Next, auxiliary checks are performed to ensure all steps have valid statuses and the file directives in axioms and conjectures are correct. Now, each step is assigned a first-order sequent where the step's parents' formulas are in the antecedent and the step's formula is in the succeedent. For plain inference steps an attempt is made to find an LK proof for this sequent by using the internal superposition prover Escargot. For skolemization inference steps an LK proof for the sequent is constructed using a special skolemization LK rule. For negated conjecture steps an attempt is made to find an LK proof for both the if and only if direction of negation(conjecture) iff negated_conjecture. To make sure that skolem symbols are used appropriately across all inferences, e.g., the same skolem symbol is not used for different skolemized formulas, an additional check is performed across the whole derivation. Finally, for each step, LK cuts between the step's proofs and the step's parents' proofs are applied which results in LK proofs whose antecedents contain axioms and the negation of the conjecture from the input file and whose succeedent contains the formula of the given step. This results in a compact representation of the LK proof of the input refutation, i.e., an LK proof of the sequent whose antecedent consists of input axioms and the negation of the conjecture and whose succeedent consists of $false. This representation avoids unfolding the whole LK proof into a tree and instead saves space by remaining a directed acyclic graph.

GAPT is available at:

    https://www.logic.at/gapt/

Expected Competition Performance

We expect GAPT to be sound, but as this is the first version of the GAPT derivation checker we don't expect fast results. It might struggle to verify long derivations or difficult plain inferences, since Escargot works very well on small examples, but is likely not able to compete with state-of-the art resolution provers in this area. However, GAPT should be able to handle most reasonably granular input derivations correctly and we expect to receive few negative points.

GDV 2.0

Geoff Sutcliffe
University of Miami, USA

Overview

GDV 2.0 [Sut06] is a verifier for derivations in classical first-order and typed first-order logic, written in the TPTP format. GDV checks a derivation in four verification phases: structural verification, leaf verification, rule-specific verification, and inference verification.

Structural verification deals with non-logical aspects of a proof, including checking the syntax, that formulae are uniquely named, the derivation is acyclic, refutations have false roots, etc.
Leaf verification ensures that the leaves of the derivation match formulae in the original input problem, and that introduced formulae such as definitions meet requirements.
Rule-specific verification deals with special cases, e.g., splitting as implemented in SPASS [Wei01]. Special techniques are available for verifying correctly documented Skolemization steps; see Section 2.3 of [SBB25].
Inference verification uses external trusted ATP systems to verify each inference step, based on the SZS status [Sut08] of the inference record. For example, inference steps often have the SZS status thm indicating that the inferred formula is (supposed to be) a theorem of the parent formulae. In this case a proof obligation with the parent formulae as the axioms and in the inferred formula as the conjecture is created, and discharged (or not) using a trusted theorem proving ATP system. Other SZS status values are treated with variants of that process.

Implementation

GDV is implemented in C. It is available from:

    https://github.com/TPTPWorld/GDV

GDV relies heavily on the JJParser library, which has to be downloaded separately into the same directory as GDV:

    https://github.com/TPTPWorld/JJParser

The external ATP systems run remotely, through the SystemOnTPTP service [Sut00].

Expected Competition Performance

The short time limit of 30s per proof, and the relatively slow process of running the remote ATP systems, means that GDV is likely to timeout often. GDV is sound, slow, but very sure of itself.

GDV-LP 2.0

Frédéric Blanqui
ENS Paris-Saclay, INRIA, France

Overview

GDV-LP 2.0 [SBB25] is a verifier for derivations in classical first-order and typed first-order logic, written in the TPTP format. GDV-LP checks a derivation in two steps:

Standard GDV (as described above) is run, using ZenonModulo [DD+13] as the trusted ATP system for discharging proof obligations. ZenonModulo is configured to output a LambdaPi term [HB20] for each discharge proof.
GDV-LP produces the necessary files that declare the formulae, the signatures of the symbols in the terms, a LambdaPi term for the root of the proof, and a lambdapi.pkg package file. ZenonModulo's LambdaPi terms are chained together from the root term, and passed to the lambdapi checker to be checked. The key strength of this added layer is that it is not necessary to trust the "trusted ATP system", here ZenonModulo. Additionally, tools other than lambdapi can be used to check the LambdaPi terms, e.g., dkcheck [Sai15] and kontroli [Far22].

Implementation

GDV-LP is implemented in C. It is available from:

    https://github.com/TPTPWorld/GDV

GDV relies heavily on the JJParser library, which has to be downloaded separately into the same directory as GDV:

    https://github.com/TPTPWorld/JJParser

ZenonModulo is implemented in OCaml. It is available from:

    https://github.com/Deducteam/zenon_modulo

lambdapi is written in OCaml. It is available from:

    https://github.com/Deducteam/lambdapi.git

ZenonModulo, other external ATP systems, and lambdapi run remotely, through the SystemOnTPTP service [Sut00].

Expected Competition Performance

The short time limit of 30s per proof, and the relatively slow process of running the remote ATP systems, means that GDV-LP is likely to timeout often, probably in the first non-LambdaPi step. GDV-LP is sound, slow, but very very sure of itself.

mrs-proover 0.2.0

Olivier Roland
Independent Researcher, France

Architecture

mrs-proover 0.2.0is a proof checker for TSTP first-order refutation proofs, built as a companion to the mrs automated theorem prover. It follows the semantic verification paradigm pioneered by GDV [Sut06]: structural properties of the proof DAG are checked first (uniqueness of formula names, acyclicity, a single $false root), followed by leaf verification against the linked problem file, followed by per-step verification of each inference.

A small set of inference patterns that are expected to recur in every proof — the negated-conjecture step, and Skolemization steps — are verified internally using dedicated structural checks rather than external provers, since these steps have precisely specified shapes (see the ProoVer Rules and Format page). Axiom and conjecture leaves are checked for alpha-equivalence against the named formula in the linked problem file, either internally or, when internal matching fails, by delegating to an external ATP. All other inference steps with status thm or cth are discharged as proof obligations to an ATP ladder.

mrs-proover reports Unknown (rather than guessing) whenever neither a positive nor a negative verdict can be established within the allotted resources for a given step; because the ProoVer scoring penalizes a false VerifiedGood on a bad proof ten times more heavily than a missed detection, the overall verdict policy is deliberately conservative: any single step found unsound anywhere in the proof yields VerifiedBad for the whole proof; otherwise any step left undecided yields Unknown; only if every step is positively confirmed does the proof yield VerifiedGood.

Implementation

mrs-proover is implemented in Rust (edition 2024), reusing the mrs TPTP/TSTP parser, clausifier, and unification/formula-lowering crates from the same workspace. The proof is loaded and its dependency structure is represented as a directed acyclic graph over annotated formulae; formulae are lowered from the TPTP AST into the shared mrs-core term/formula representation for alpha-equivalence and structural comparisons. For inference steps that cannot be settled by the internal structural checks, mrs-proover discharges the corresponding proof obligation (parent formulae as axioms, inferred formula as conjecture) to an ATP ladder that runs, per step, the in-process mrs prover first, then races any available external backends — E [Sch02] and Vampire [KV13] — in parallel; the first definite Sound/Unsound verdict from any ladder member wins and the remaining subprocesses for that step are cancelled. Steps can be verified concurrently across multiple CPU cores (default 8, matching the competition hardware).

mrs-proover has been cross-checked, using only its internal mrs backend (no external ATPs), against the 7 official ProoVer example proofs, against the leoprover/noergler PyRes correctness corpus (170 valid proofs, 170 corresponding falsified/mutated proofs), and against the ATP-Research-Project test corpus of correct and deliberately "evil" proofs. Across all of these, running the in-process backend alone never produced a false VerifiedGood on an evil proof and never produced a false VerifiedBad on a valid proof; adding the external E/Vampire backends only improves detection rate (fewer proofs left at Unknown), never soundness.

mrs-proover is open source (MIT OR Apache-2.0) and available from:

    https://github.com/newca12/mrs

as the mrs-proover crate in the same repository as the mrs ATP system.

Expected Competition Performance

mrs-proover is expected to correctly identify almost all well-formed valid proofs as VerifiedGood and almost all deliberately incorrect ("evil") proofs as VerifiedBad, given the internal structural checks tailored to the exact Skolemization and negated-conjecture formats specified by the ProoVer rules, backed by external ATP calls for the remaining steps. Given the highly asymmetric scoring (a single false Verified on an evil proof costs as much as ten correct verifications), mrs-proover is tuned to prefer reporting Unknown over guessing whenever a step cannot be positively confirmed or refuted within the time budget, so a non-trivial fraction of harder or unusually-shaped valid proofs may be conservatively scored 0 rather than +1.

Nörgler 1.1

Alexander Steen
University of Greifswald, Germany

Overview

The proof checker Nörgler 1.1 [TSS26, STS26]. is designed as a light-weight system for the verification of TSTP refutations across various logic dialects, including propositional, untyped or typed first-order, and higher-order logics. The architecture is built upon the semantic verification paradigm pioneered by the GDV system [Sut06]. This approach combines structural property verification using dedicated checks with the verification of individual proof steps by re-proving each inference using trusted general-purpose automated theorem provers. A core design choice in the system is the integration of multi-core parallelization, which allows independent proof checking tasks to be processed concurrently, thereby significantly increasing performance and reducing overall verification time.

The verification process is executed via a multi-stage checking strategy. Initially, global structural properties of the proof (such as ensuring the proof graph is acyclic, verifying that formula names are unique, and confirming that the derivation correctly terminates in false) are validated. Following these global assessments, individual logical inferences are evaluated. This comprises checks of local structural properties (such as ensuring that a given step has an appropriate TSTP role and that all listed parents exist earlier in the proof) and the semantical verification of the inferences. Nörgler checks the faithfulness of all formulae taken from the problem file. With the exception of the negation of the conjecture and Skolemisazation, which are verified by dedicated checks, the inferences are verified by invoking an external trusted prover. If this check does not succeed within the allocated resource limit, countermodel search is invoked via integrated model finders to actively detect incorrect proof steps. If a countermodel is found, the inference is rejected as invalid (resulting in a VerifiedBad status); however, if neither a successful proof nor a definitive countermodel can be established by the auxiliary tools, a status of Unknown is reported. If all checks are successful, Nörgler returns VerifiedGood.

Several notable features are incorporated into the system to enhance its flexibility and robustness. Fine-grained control over the strictness with which formatting conventions are enforced is provided, enabling the system to either gracefully handle imperfect prover outputs or strictly demand adherence to proof standards, see [A HREF="http://tptp.org/cgi-bin/SeeTPTP?Category=BibTeX&File=TSS26">TSS26] for details.

Implementation

Nörgler is implemented in Scala, making use of its native support for concurrency via Futures and functional data structures. The system incorporates the scala-tptp-parser library [Ste26] for robust parsing of the standard TPTP/TSTP grammar formats, and integrates the tptp-utils tool for formula manipulation [Ste25]. For its backend verification tasks, the well-known systems E [Sch02] and Mace4 [McC03] (for proving resp. disproving) are used. The system is freely available as open-source software (MIT license), and can be accessed online via its repository at GitHub:

    https://github.com/leoprover/noergler/

Expected Competition Performance

This is the very first version of Nörgler, so no reliable expectations can be formulated. The author's are positive, however, that Nörgler should not report false positives (i.e., reporting a successful verification on a flawed proof).

ProofCheck 1.0

Jeff Machado
Independent Researcher, USA

Overview

ProofCheck 1.0 is a verifier for first-order refutation proofs written in the TPTP format. A proof is checked by structural verification followed by independent verification of every inference step, and is reported as VerifiedGood, VerifiedBad, Unknown, or Timeout. The verifier has a single, strict mode: by construction, no derived step can be accepted without verification, and a step that cannot be decided is reported as Unknown rather than VerifiedGood, to avoid false acceptance.

Structural verification checks the syntax, unique naming, acyclicity, a $false root, and that the leaves match the input problem (located from the proof's % Proof : header). Leaf bodies are compared against their cited problem formulae canonically, with an ATP equivalence check as the fallback, and clausal leaves are validated by replaying the clausification of their source formula. Introduced definitions are checked for conservativity and Skolem-symbol laundering.

Inference verification discharges every thm and cth step by an ATP proof obligation built from exactly that step and its cited premises: the parents as axioms and the conclusion as conjecture, discharged by a local E [SCV19] raced against Mace4 [ML26], where a Mace4 countermodel refutes the step; a step that E leaves undecided is retried with Vampire [KV13] as a second entailment oracle before the step is hedged. SideStep, a built-in clausal inference engine, replays clausal steps structurally in process first; a step it refutes is rejected outright, while its acceptances are not trusted and the step still goes to the ATP. Skolemization (esa) steps are never sent to an ATP; they are verified structurally against the parent: the required skolemize(Var,Term) record, the Skolem term's arguments against the universally quantified variables in scope at the eliminated existential, symbol freshness across the proof and against the problem's own symbols, and one fresh symbol per eliminated (non-vacuous) existential. Constructs outside the specification are hedged to Unknown. A wall-clock deadline is self-imposed, so one of the four verdicts is always emitted within the time budget.

Implementation

ProofCheck is implemented in C++, including the SideStep engine and the structural checkers. It compiles into a single static binary that invokes E [SCV19] and Vampire [KV13] as entailment provers, Mace4 [ML26], as the countermodel finder, and Prover9 [ML26], for clausification replay, all locally from the directory of the binary. It is available from:

    https://github.com/AlgorithmicTruth/proofcheck-releases

Expected Competition Performance

The entailment backends are local and most structural checks are in process, so ProofCheck verifies typical proofs in well under a second and rarely times out. It has been tested for soundness on an adversarial library of over 200 red-team proofs, a third-party evil proof suite, and several hundred mutated E proofs, with no false VerifiedGood observed, and for coverage on several hundred valid E and Prover9 proofs. Constructs the verifier does not model are reported as Unknown rather than accepted, so unanticipated cases reduce coverage, not soundness. ProofCheck is sound and fast.

ProofGuard---1.0

Matthew Farah
McMaster University, Canada

Overview

ProofGuard 1.0 is an automatic checker for first-order logic proof certificates submitted in the TSTP format [Sut24]. A problem file and a proposed proof file are taken as input, and it is checked whether the proof establishes the claimed contradiction. The system is designed for refutation proofs. A proof is accepted only when every step of the argument can be followed and it can be confirmed that $false is correctly derived. When all required steps have been verified, the proof is reported as verified. ProofGuard is intentionally conservative. When a proof step is unclear, unsupported, malformed, or cannot be checked within the available time, no guess is made. It is instead reported that the proof could not be verified. A proof is rejected as flawed only when a definite error can be identified, such as an incorrect negated conjecture, an invalid Skolemization step, or an inference which does not follow from its stated premises. This design reflects the competition setting, in which the incorrect acceptance of a bad proof is heavily penalized. Caution is therefore prioritized, wherein only fully checked proofs are accepted, only demonstrably flawed proofs are rejected, and verification is otherwise reported as inconclusive. Verdicts are reported using the required SZS statuses [Sut08]. An accepted proof is reported as VerifiedGood; a rejected proof is reported as VerifiedBad, together with a one-line reason naming the flawed step; an inconclusive verification is reported as Unknown; and an exhausted time budget is reported as Timeout.

Implementation

ProofGuard 1.0 (July 2026) is the first release of the system. The E prover, built from current upstream sources, is bundled with the system as the backend ATP.

Problem and proof parsing. FOF problem files and TSTP proof files are parsed by a recursive-descent parser, and formulas are represented as immutable dataclass terms. Nested inference(...) chains are flattened to their leaf premises.
Input-step validation. Steps imported from the problem file are checked against the referenced problem formula, modulo alpha-renaming and harmless syntactic normalizations such as literal order and equation orientation.
Internal rule checking. Specified rules are verified internally. Negated-conjecture steps are checked against the negation of the original conjecture, which may be consumed by no other step. Skolemization steps are checked for fresh Skolem symbols, consistent annotations, and correct dependency on the universal variables in scope.
Conservative de-Skolemization fallback. When Skolemization annotations are absent, de-Skolemization is attempted only when the transformation is unambiguous: each fresh Skolem term is replaced by a new existentially bound variable, after which the remaining theorem-status content is checked by the ATP. Whenever the transformation cannot be performed unambiguously, verification is reported as inconclusive.
ATP-backed consequence checking. Unspecified inference steps marked with theorem status are checked through the generation of local proof obligations for the E prover [SCV19]. The original conjecture is never independently re-proved.
Conservative failure policy. A demonstrably incorrect step causes a failed verification to be reported. A malformed, unsupported, ambiguous, timed-out, or internally failing check causes verification to be reported as inconclusive; the system is never crashed by malformed input.
Resource management. A global wall-clock budget is managed by the checker. Each ATP call is given a CPU limit derived from the remaining time, and the system is terminated conservatively before the external time limit is exceeded.
Packaging and reproducibility. ProofGuard is implemented in pure Python 3 with no third-party Python libraries. The delivery package is built with Docker and is fully self-contained: the CPython runtime, the E binary, and the checker are all included, glibc 2.17 is targeted, and the system is launched as ./proover-check PROBLEM PROOF.
Testing. The system was tested on the published ProoVer examples, on E-generated TSTP proofs over the TPTP library, on multi-system TSTP samples, and on mutation-fuzzed proof variants.

The source is available here:

    https://github.com/ValueAchooMatthew/ATP-Research-Project

Expected Competition Performance

All published example problems are classified correctly, and each evil proof is rejected for its intended flaw. Wrong answers are expected to be rare: across 58 TSTP proofs produced by five different systems, no correct proof was misclassified as flawed, and no mutation-fuzzed proof was ever accepted. The strongest performance is expected on invalid proofs, for which structural checks and countersatisfiability results are obtained cheaply and definitively. On the published 8 MB stress sample, a flaw (a missing Skolem dependency at step 219 of 223) is located in roughly 12 seconds, without any ATP call on the large formulas. Valid proofs written in the style of the published examples are expected to be verified within the time limit, while proofs relying on techniques outside the system’s coverage are reported as inconclusive rather than risked. Overall, it is expected that most lost points will be attributable to inconclusive verdicts rather than to penalties.

PyCheck 0.1

Stephan Schulz
DHBW Stuttgart, Germany

Overview

PyCheck is a simple (mostly) semantic proof checker for refutation-based proofs in the most commonly used subset of TPTP/TSTP CNF/FOF syntax. Inspired by GDV [Sut06], it performs both structural and semantic checks on proof files. Among the structural checks are existance of an explicit contradiction, acyclicity of the proof graph, and freshness of introduced (Skolem) symbols. Semantic checks include verification of "status(thm)" and "status(cth)" checks using the external prover.

Implementation

PyCheck reuses the parser and a lot of internal data structures from PyRes [SP20], and thus supports the same core of the TPTP-3 proof language, with support added for more detailed derivations.

PyCheck calls an external prover (usually E [SCV19]) to verify most proof steps (in particularly those with thm and cth semantics). It uses internal processing to try to verify Skolemization steps, to check existence of axioms in the original input file, and to perform structural checks on the proof graph.

PyCheck can be downloaded from GitHub:

    https://github.com/eprover/PyCheck

It requires E (https://www.eprover.org to be available in its search path (the ProoVer StaExec package should automatically do everything necessary).

Expected Competition Performance

PyCheck 0.1 is at a very early state of maturity. It can handle all the problems given in the ProoVer description, but may fail to catch some particulary malicious structural problems of proofs. Overall, it should prove to be at least useful.

VaLeaDate 0.1

Jonas Bodingbauer
TU Wien, Austria

Overview

VaLeaDate v0.1 is a proof verification system for TPTP proofs. It leverages a recently developed proof output from Vampire, which generates proofs as Lean input files that are end-to-end verifiable by Lean [BH26]. The core approach of VaLeaDate is to invoke Vampire on each inference, inspired by GDV [Sut06], chain the resulting proof steps together, and subsequently verify the entire proof in Lean. Since the proof checking step is time-consuming, VaLeaDate checks multiple structural properties of the proof in the beginning and terminates early if any of these checks fail. These basic checks include:

Verifying that all parent nodes exist
Checking for acyclicity of the proof structure
Ensuring axioms are alpha-equivalent to the problem input

After that, all inferences are passed to the ATP Vampire, which processes them in a parallelized manner. The resulting proof steps are then chained together (in one or multiple files depending on the size of the proof output) and verified by Lean. Skolemization is handled within VaLeaDate, generating a skolemization section in the Lean proof output, which is then verified by Lean. The process transforms any input into NNF, skolemizes it, and then reconstructs it (if necessary) back to the expected form.

The system reports status Unknown for (some) syntactic errors or if Vampire does not produce a conclusive result (neither satisfiable nor a refutation). VerifiedBad is produced if any of the basic checks fail or if the Lean verification fails. Finally, VerifiedGood is produced if all checks pass and the final Lean verification also succeeds.

Implementation

VaLeaDate is implemented in Scala and utilizes the scala-tptp-parser parser from the leoprover project to parse TPTP input and proof files. Furthermore, it relies on Lean, a small Lean library VampLean, as well as a custom Vampire build configured for proof output (including VaLeaDate-specific functions). It aims to parallelize the computationally intensive parts of proof reconstruction and verification as much as possible.

Expected Competition Performance

It is challenging to predict the performance, but it is hoped that VaLeaDate does not produce any false positives. They can still occur due to the somewhat open definition of correctness or other bugs introduced during the relatively short development period. Since verification in Lean introduces some overhead, the system might time out on many problems within the given 30s time limit.