I’d be interested in either a deeply analytical explication of that distinction, or pointing to a careful technical one.
It seems to me that if cognition is a computation, or in any case we are limiting ourselves to with world of computable functions, memorization and understanding as cognitive terms needs to reduce to some information-theoretic computational property.
And it seems to me that memorization is basically a lookup table, mapping input to output, and understanding is the ability to compute the output given the input. There isn’t a hard line. The “understanding” program can low low Kolmogorov entropy and require high logical depth to get a value, or in the limit of a lookup table it’s the opposite, and there are many (continuously many, in the discrete sense) options in between.
So in principle it seems to be that memorization and understanding are the same thing, with respect to the same universe of inputs.
What you may mean is that if we want to get valid outputs for inputs not on the lookup table, we need an algorithm that is “bigger” (in entropy plus logical depth) than the table.
Understood this way, the question becomes whether a finite set of example in the dataset is sufficient given LLM/transformer methods, to do the deed.
And as a lay person it seems highly likely that we are seeing that already.
I say this because the system isn’t making a lookup table, it’s compressing the I/O mapping into some kind of algorithm that has relatively low entropy and requires high logical depth to make the output.
It would seem to be a special case, rather than the rule, that doing this in the efficient way the system has annealed to do it, just so happens to exactly match the lookup table I/O mapping, but not more. If the data in the sample all together implies a relationship (causal story) that has richer entanglements, the data is just a “for instance”. Much as with science, the observations are finite and the projectable theory is compact but its I/O table is gigantic and maybe infinite.
It’s true that there may be stuff that’s causally relevant but isn’t implied by the dataset or the logical depth to access that annealing is deeper than we can/do go now. There may even be a Gödel-type reason to think that has to be so. Or a more straightforward old style philosophy of science argument that no finite dataset can pick between all the infinite models that can fit it. “Grue” and all that problem of induction stuff.
But humans bumble around in this context and expand the forefront beyond the finite dataset, and in general it would seem to me that LLM/transformers can do that too, and are doing it, unless there is some reason in principle (a math argument exists?) why that particular AI method can never, or usually won’t, anneal to a compressed representation that can encode causal factors (information generators, “ideas” and “understanding”) that supervene on the dataset but go beyond the datasets boundaries.