PROBLEM

Multiple platforms exist where people unite based on emotions. Yet there is not much to offer to rational thinkers.

This post introduces System II -- a protocol, and a demo implementation in Python, that aims to help groups of rational people and/or AI find better answers by thinking together rather than separately. The primary purpose of that implementation is to demonstrate that a functional System II, even if imperfect, can be created.

The code can be obtained from https://github.com/suprathermal/System-II ^[2] (mirror http://tung-sten.no-ip.com/MVP/Src/sources.zip)^[3], or simply from the picture below which is a PNG representation of the whole project source code:

DISCLAIMER

This is a PROTOTYPE. It implements just one possible approach. It may contain bugs. It is intentionally written in a crude and often primitive manner. It is definitely not perfect in terms of performance, functionality, convenience.

It only does two things:

It shows that every principle behind the design can be implemented in code.
It shows that these pieces can be assembled into a working whole.

And next... Next we need to run this thing in all kinds of scenarios. See where it works, where it doesn't. Doing it by myself alone is like trying to outplay oneself in chess, which is very inefficient. Other players are needed. Questions and scenarios that I haven't thought of are needed.

Therefore, the next move is yours. Take it. Study it. Try it. I permit rewriting and improving this prototype (subject to MIT license though). Moreover, I actively hope for it. Just try not to break the core principles (see the corresponding section in the full docs).

THE IDEA

Problems are often solved by voting. How does it work?

A problem is presented to the participants.
Each participant, based on their life experience, knowledge, culture, and reasoning skills, comes up with a method to find the "correct" answer to this problem.
Then they solve it.
And the answers of all participants are averaged in some way.

Although the models built in step 2 may be flawed and incomplete, and then solved with errors in step 3, this is not the main issue. It can be shown that under some reasonable assumptions, voting still converges to the correct answer.

The main drawback here is that almost all the mental work of step 2 is re-created by participants "from scratch." Therefore, successful thought achievements of some participants are rarely reused by others. As a result, most of the intermediate answers are of low quality, and the method, although converges, does so with terrible friction and extremely slowly.

Recognizing this shortcoming, people invented debates. It's when not just the answers, but also the reasoning methods of some participants are brought to light, to offer others an opportunity to adjust their models. But debates don't scale well. All N >> 1 people cannot consider all M >> 1 models devised by the society.

We'd like something better.

And here is the idea: we should "average" not after obtaining the intermediate answers, but before that. So, first combine all the reasoning methods and data from step 2 into the best possible model, and only afterwards compute the answer with it.

There are many ways of doing that. Here is presented just one of them, perhaps not the most general and flexible but perhaps the simplest. Here it is:

When rational people try to predict an outcome of a situation, they often use two types of statements:

Arguments. E.g., "A drought happens when wind blows continuously from the East."
Examples. E.g., "However, there was the wind from the East in Neverlandia, yet there was no drought."

There are multiple reasons for a discussion to never converge to an answer, but the most important of them is the limits of human's memory and attention. Even with ten argument in operation things get convoluted, and checking each of them against each historical example is usually beyond the abilities of a human being at the time of the discission.

Yet it is possible for the ML. It can easily check billions of combinations. They only need to be formalized.

Suppose we need to predict an outcome of some situation. The relevant examples and arguments could be arranged into an ML train/test table like this:

Each cell contains a number. It could be a degree of the argument's agreement to the example with respect to supporting some "positive" outcome. Or simply a number.

To be specific, suppose we want to predict a used car price. Then the table above may end up looking like this:

People interested in that car's price can fill out that table. Then we train the ML on the top part of it and ask it to predict the value for the bottom line. Which would be (after some reasonable assumptions) the best price estimate of that car, given the available data.

Many problems lend themselves well to that structure. A boiling point of a chemical? Examples are chemicals, arguments -- things that the boiling point may depend on, such as molecular weight or presence of C=O group. A court decision on some case? Examples are past precedents, arguments are laws. Or maybe we care to know whether some couple will stay together in 5 years? Then the examples are other couples, and arguments are the observable aspects of their relationships (e.g. whether they go to vacations together or not).

In the end, that is still a form of voting. The difference is that instead of averaging out the outcomes, we "average" (in a smart way) all the input information from all voters, and compute the answer from that.

Participation obviously does not have to be limited to human beings only. AI agents can also use System II. Moreover, I anticipate that this protocol can, in principle, extract objectively correct answers even from hallucinating or non-cooperative AI, by building a model of its knowledge. After all, it was originally designed to work with people who often have the same problems :)

ENGINEERING CHALLENGES

Of course, in practice this spherical horse won't fly for a multitude of reasons:

The table may have missing entries. At a minimum, this could be due to data unavailability. Most ML algorithms handle missing data rather poorly.
Even one discussion participant can intentionally inject a combination of examples, arguments, and data into the table to force an outcome of their liking.
The table might contain noise, spam, and garbage. This could be either accidental or intentional, and apply to either individual cells or entire examples or arguments.
Even honest opinions of participants on the figures in the cells can diverge drastically.
An answer without an error margin is garbage.
Insufficient data. Neural networks need at least hundreds of entries to start functioning, yet our entire table might be as small as 9 by 5.
The significance of the examples could be incomparable. A fluke observation and an experiment repeated in hundreds of laboratories clearly should not have an equal impact on the computation, but here both would occupy one row and would be treated as equals by the ML.
The process of filling out the table in practice is not trivial.

TECHNICAL SOLUTIONS

I'm not claiming that these solutions are the only possible or the best ones. They are only the results my own trials, errors, and insights.

Let's proceed in order:

1. The table may have missing entries. At a minimum, this could be due to data unavailability. Most ML algorithms handle missing data rather poorly.

Dealing with it requires a regressor robust to missing data. Among the ready-made solutions is the HistGradientBoostingRegressor^[4]. On small datasets, its performance isn't very impressive -- but it works right out of the box, at least.

Another solution is implemented in SparseRF.py. It's basically a RandomForest that iterates over ALL possible parameter combinations with non-missing data, weights interim predictions appropriately, and can use any internal regressors (not just decision trees). It handles well up to 20-30% of blanks in the data and generates decent test results. However, there is a significant drawback: the O(2N) complexity relative to the problem size, with a practical limit around 12 dimensions and considerable slowdown starting at 5-7 variables. In practice, I choose SparseRF for smaller dimensionality, and HistGradientBoostingRegressor for larger scales.

Some tempting but incorrect alternatives are:

Removing rows with missing data. Given that a real matrix can have numerous "holes" this could lead to throwing away (nearly) all of it.
Filling in missing values. This is a widely used and extremely slippery path. If these values are calculated based on the data external to the matrix, we shall just incorporate it into the matrix right away! If not, it means we are filling cells with values computable from the matrix itself, thus not introducing any new information, which would make sense only if the filling in regressor is stronger than the one used for the primary task. Why not switch them then? An no matter what we do, the filled in values would end up being not exactly what is missing, but something "slightly different." Understanding how this "slightly" distorts the answer is a daunting problem in practice.
[Some text was replaced with its SHA256 value. Reason: "And some more thoughts", Hash:A0-CF-B2-2E-FA-75-B8-DE-E8-E0-F1-D3-87-25-A5-84-F2-D0-A9-37-CE-27-2A-1C-71-25-02-B0-57-1C-44-69]

2. Even one discussion participant can intentionally inject a combination of examples, arguments, and data into the table to force an outcome of their liking.

This should be handled via separation of duties. The person who proposed an argument or an example should not be the one to fill in the numbers for it. Furthermore, no one should be able to introduce single-handedly a large chunk of data with a predictable structure.

So, it should work like this:

Participant A proposed an argument? Let the intersections with all examples be filled by other participants randomly chosen.
Participant B proposed an example? Again, let the intersections with all arguments be filled by other random people.

Also:

The share of the input data by each participant should never exceed a certain threshold. This is to prevent filling the matrix (especially a small one) at their discretion via creative acceptance or refusal of the questions.
No one should be able to propose an argument ("feature") that alone predicts most of the answer, as such a feature could be easily manipulated.

But the main idea is this: keep the proposal of examples or arguments separate from other data provisioning among the participants of the discussion. This, at least theoretically and on average, supports the stability of the solution to intentional manipulations.

3. The table might contain noise, spam, and garbage. This could be either accidental or intentional, and apply to either individual cells or entire examples or arguments.

Here we apply the following principle: if some data group is statistically proven to only worsen the solution quality (measured, for example, through R²), then it is spam. It carries negative information. Remove it.

This procedure applies at each computation cycle to each "data element" (an example, an argument, a participant). Detected spam "elements" are dropped. But they may be re-submitted, in case the removal was erroneous.

In theory this sounds simple, but in practice, there are many nuances. Even for feature impact, various measurement techniques exists (removal, label shuffling, SHAP). And what if the "data" is a "row"? Or a group of numbers, features, and examples from one user scattered across the matrix? How to measure the impact of that complex data?

Eventually, I settled on the removal method:

Find a solution with the full data matrix.
Record its quality (e.g., mean absolute error on training data, or R2).
Temporarily remove an element (a user, an example, an argument, or all data from a single source).
Construct a new solution.
Measure its quality.
Measure how much quality we've lost through that data removal.

This method is not perfect. It is crude. It might remove an honest feature. For instance, if two features carry similar values and are semantically close, removing one of them will simply "transfer the weight" to the other, and the solution quality will barely change. It would be like standing on two feet. If you lift one, you'll most likely not fall – but it doesn’t mean the lifted foot was doing nothing :)

Nevertheless, I chose this method. It is easy to generalize for any data shapes. It’s simple to write and easy to trace. And temporarily losing even important information isn’t as scary as permanently letting a garbage into the model, since a good argument is likely to be repeated.

Of course, there better (but more complex) solutions are conceivable:

[Some text was replaced with its SHA256 value. Reason: "like this", Hash:7C-BF-49-55-E8-41-16-57-B2-00-A6-96-94-99-AE-68-80-AD-56-9F-E8-52-F8-90-65-92-2B-56-15-4F-33-D9]
[Some text was replaced with its SHA256 value. Reason: "or like this", Hash:3E-E5-EF-44-AB-4A-2D-0B-2F-9E-0E-F7-8C-19-AD-DC-F0-92-60-AC-A0-73-57-0A-AF-05-55-49-EB-BE-F3-CC]

4. Even honest opinions of participants on the figures in the cells can diverge drastically.

To address that, the unit of storage in each cell should be not a number, but a distribution, represented by the set of numbers provided by the participants.

This means we allow more than one figure per cell, and we actually want this. It would look as follows (for the case of predicting war outcome, and the question of "does country A has more than country B of something?"):

Then we walk through the matrix and randomly sample some (ideally many) combinations of values from it:

Then, discard any duplicates and take what's remaining to be the training set.

[Some text was replaced with its SHA256 value. Reason: "There’s likely a better solution for this if one feels like extra coding.", Hash:14-1E-DA-FA-0C-C7-DC-38-AB-8E-24-27-56-46-3F-EF-C8-C1-D9-97-8F-0F-BD-53-EB-D8-63-7E-09-96-DD-A7]

As you can see, we don't discard anyone's responses before modeling. We include them in the process and create a model that captures the entire spectrum of opinions in the population. But later we examine which data helped us to create a coherent picture and which rather obstructed it. No censorship. Data is declared worthless only after the most thorough attempts to predict anything with it have failed. :)

Note this method won't work with respect to the values of labels (i.e., known outcomes of past examples). In principle, ML can distinguish truth from falsehood by the latter's contradictions with "solid" truths, even if non-trivial. But for this, there must exist "solid" truths in the problem, the truths that are agreed upon by all or at least by the majority of the participants. If there is no agreement in the group even on issues like as "Does the sun shine?", then there'd be no shared statements to rely upon and backtrack the conclusions to.

Therefore, with labels we effectively just vote. We look at the spread of the proposed labels on each topic. If the spread is large and the count of participants is high, we may try eliminating 1-2 "outliers". If the spread still remains large after that, we discard the attempt and collect new label data. Otherwise, if the spread is small, we take its average and declare it the label. The goal of this process is not to establish "the true value" of the label (which we can't know!), but to make sure that only labels with high internal consensus in the group are admitted to the discussion, thus offering the solid ground for possible appeal to them. And only that.

(People can still make mistakes as a group. When their errors are few, ML can "flip" them and extract a positive signal. But sometimes it might be impossible to establish any consensus and then it should abandon the task as early as possible).

5. An answer without an error margin is garbage.

This is the least of the problems. The standard procedure is resample, bootstrap & cross-validate. We randomly split the training matrix into, say, 80/20 train/test groups. Then train, predict on the test, calculate quality metrics, and predict the answer to the primary question(s). Repeat this process several times. Average the results, compute their spread, and obtain the estimate of the error.

With small data though this poses a challenge. When having something like just 13 data points, removing three of them "for testing" is rather risky. It's far from guaranteed that the regressor would train the same way on the remaining 10 points. If that happens, the measurements obtained on them might show a systematically worse picture than with a full data set. And at worst, the training might just totally fail.

To decrease the likelihood of such an outcome, we apply jackknifing^[5]. It's like bootstrap but with selecting only one element. Then, the error of the prediction is taken to be the median of the prediction errors of 12 (if there were 13 data points) models.

Of course, to answer the primary posted question (as opposed to just measuring the model's accuracy), training needs to occur on the ENTIRE matrix, without splitting into train/test.

Oh, and do I need to explain why we can't use the average of the errors (or the root mean square deviation of the responses) for regressor quality assessment, but the median and MAD^[5] instead? In many regressors, the answer emerges as a result of X/Y division, where both variables are noisy. Consequently, if Y occasionally approaches zero, the "tail" of the regressor's errors will be distributed as Pareto^[7] with α = 1.0. Such a distribution does not have a mean, not to mention higher order moments. The error of such a regressor, measured this way, will progressively seem to get worse the more tests you run trying to measure it "more reliably". This can easily drive one crazy, especially under time pressure.

[Some text was replaced with its SHA256 value. Reason: "In fact, even the median doesn't completely guarantee correct measurement, but that’s a topic for a separate article", Hash:58-D8-C4-08-A0-C6-CD-F0-2F-A1-4D-B8-F5-8F-A8-17-82-24-54-A0-6D-4C-94-A1-5A-59-AE-BD-18-81-07-80]

6. Insufficient data. Neural networks need at least hundreds of entries to start functioning, yet our entire table might be as small as 9 by 5.

Obviously, this is not a task for neural networks. Instead, methods like Random Forest or its relatives (XGBoost, Boosted Trees) are more suitable. These algorithms are known to perform better on small, noisy datasets with imperfect data (e.g., [8]). After being wrapped in meta-regressors robust to missing data, other methods like Ridge, KernelRidge, or KNearestNeighbors can also produce decent results. However, Lasso is better avoided under feature starvation conditions.

7. The significance of the examples could be incomparable. A fluke observation and an experiment repeated in hundreds of laboratories clearly should not have an equal impact on the computation, but here both would occupy one row and would be treated as equals by the ML.

I'm not sure I have found the best or even a good solution, but here is my approach.

Sure, the first thing that comes to mind when facing such an issue is to assign weights to observations. But where to get these weights? Even if participants agree that observations from an established lab carry more importance than those from a random V. Pupkin, quantifying this difference is challenging.

One could try to calculate an observation weight by recursively applying the same discussion process we are developing until the weight value becomes apparent. But the complexity of this algorithm could easily get unmanageable. Mostly likely, it just won't converge.

I opted for another approach: sources. Each observation should have a source, such as "science," "CNN," "rumors," "personal observation." Even if this attribution isn't very precise, it allows the model to group data and calculate the degree of usefulness of each group for solving the problem, that would let then discard completely unreliable sources. This process implicitly assigns greater weight to the more credible sources.

This, too, potentially opens up possibilities for manipulation by a malicious participant. Unable to "bend" the overall model, they might, for instance, respond to questions with garbage while citing "science" as the source, hoping to discredit it.

The defense against this is based on checking for garbage and the removal of "bad" data in a hierarchical order: example > user > argument > source. That is, if a user starts inputting pure garbage, the system will first discard bad examples (but leaving the possibility to reintroduce them), and if the user persists... they get removed next :)

Of course, this protection is probabilistic so could sometimes be bypassed.

Another option left to the discretion of the end users is introduction of an additional feature column describing the reliability of an example. It could contain, say, (approximately) the base-10 logarithm of the number of people needed to be deceived to "flip" the consensus on the value of a figure. For example, if the observation is a UFO sighting witnessed only by V. Pupkin, this feature would be zero (since only Vasili needs to be deceived, and Log10(1) = 0). If a village of 1000 people saw that UFO, you would enter three. And if the observation asserts that "quantum mechanics generally works," you'd put at least 9.5. Because if not, 3+ billion people worldwide would find their cell phones suddenly not working :)

8. The process of filling out the table in practice is not trivial.

Simply providing people with a link to a Google Sheet and hoping for the best won't work, especially given the potential issues outlined in points 2-4.

It takes a protocol that assigns 1-2 questions to each participant, adheres to all the constraints, and integrates the responses into a data table. In fact, designing and implementing this protocol consumed more than half of my time.

Naturally, here it makes sense to separate transport and questions generation process:

Transport is merely the medium through which questions reach participants and their responses are returned. The Transport should not be aware of the question-generation process. It should be a plug-in module that could be replaced by any other with the same interface. In the current implementation, we have two: Telegram and FileTransport. The latter is a CSV file wrapper that simulates filling the matrix from that CSV by synthetic participants, for testing purposes. In theory anything else -- like email, FTP, or even pigeons -- could serve as Transport.
Question Generation is the mechanism that, by looking at the existing matrix and some statistics from the participants, calculates the next questions to ask, and picks the participants to receive them, in order to continue filling/expanding the table. In contrast, it should know (almost) nothing about the Transport.

The algorithm for question generation is non-trivial. After numerous trials, hacks, patches, and reworks, I managed to implement one version in the file QManager.py. It is far from perfect, and I hope someone will write a better one. Currently, it functions as follows:

If there are examples without labels, generate high-priority questions requesting the labels until they are filled.
Check if new arguments are needed. That is pretty much always the case, but there are interesting exceptions. For example, if we have significantly more arguments than examples, or if the matrix has too many missing data points, or lacks labels. In such cases, we skip requesting new arguments with a high (but not 100%) probability. If we don't skip, we generate a question like, "Can you suggest an argument/feature that would help solving the problem?"
Check if new examples are needed. Examples are needed even more than features, but also with exceptions -- for instance, if the matrix is too sparse yet. In such cases, generating a request for a new example occurs with some low probability, and with 100% certainty otherwise.
Check for cells with missing data in the matrix. If there are any, try to ask for values of that missing data.
(Optionally) request new data for a random element in the matrix, even one possibly already filled, for the purpose of consolidating as many opinions and (ultimately) eliminating erroneous data entries.

When selecting the target users for the questions, care must be taken to ensure that no participant contributes a share of responses to any data category larger than that allowed by random chance. This prevents a single participant from controlling something like 51% of the examples or arguments.

References:

https://github.com/suprathermal/System-II
http://tung-sten.no-ip.com/MVP/Src/sources.zip
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html
https://en.wikipedia.org/wiki/Jackknife_resampling
https://en.wikipedia.org/wiki/Median_absolute_deviation
https://en.wikipedia.org/wiki/Pareto_distribution
https://arxiv.org/abs/2207.08815

MORE?

Visit https://github.com/suprathermal/System-II^[2] (or http://tung-sten.no-ip.com/MVP/Src/sources.zip)^[3] to get the source code and full documentation.

And what about the picture at the beginning of the post? It contains the whole project, encoded as a PNG image. To unpack it:

Save the above image as "src.png" to a local folder.
In that folder, run the following python code:

from PIL import Image
img = Image.open("src.png")
raw_data = img.tobytes()
size = struct.unpack("<I", raw_data[:4])[0]
file_data = raw_data[4:4+size]
with open("src.zip, "wb") as f: f.write(file_data)

Open/extract the resulted src.zip.