Login    
19 - Accelerating progress in Artificial General Intelligence: Choosing a benchmark
Brandon Rohrer
Submitted: Oct 08, 2009; Version 7 submitted Feb 25, 10
Status: Accepted  /  Action Editor: Tsvi Achler
View/Download          

No one else has rated this article yet.


Comments

The following comments are based on Version 5 of the submission.

--- Major issues:

(1) Research goal and evaluation method

The author acknowledges that "A benchmark implies a goal and implicitly contains a success criterion", as well as the situation in the field of AGI where there is no generally accepted research goal. The author suggests that even though this is the situation, there still can be evaluation criteria and benchmarks that are "palatable to a majority of us", and guided by them, the research may eventually lead to "an emergent end goal".

This does not sound right. In any research field, the evaluation methods (criteria, benchmarks, etc.) depend on the research goal, rather than the other way around. Even though AGI has no generally agreed research goal, it does not prevent each researcher from clarifying his/her current research goal, and then establishing a proper evaluation method accordingly, even when any current understanding of intelligence will inevitably be revised with the progress of the research.

Without a clearly specified research goal, it is hard to justify the proposed criteria and benchmarks, or to identify its application scope. Presumably the author does not suggest to apply the proposed benchmark to all existing projects in the field. In that case, which projects are within its scope, and which are beyond it?

Since each benchmark implies a research goal, it is better to state the goal as clearly as possible, even though it is by no means final or agreed by everyone in the field.

(2) The seven criteria

The author explains each criterion, but says little about their relationship, or the property of them as a whole. Even though each criterion sounds reasonable, it doesn't mean the following issues can be omitted:

a) Independence: Is there any redundancy in the list? For example, why "Specificity" and "Task Focus" cannot be merged into a single criterion? Why "Breath", "Low Cost", and "Range" cannot be combined into something like "generally applicability"?

b) Consistency: Can the criteria be satisfied together, in principle? Why "Specificity" and "Breath" do not always contradict with each other? Given the complexity of the notion "intelligence", can we expect a benchmark that is both simple and fit the everyday usage of the term? If compromise among the criteria is inevitable, how to handle it?

c) Completeness: Are these seven enough to cover the various criteria proposed or used so far in the field? For example, should a benchmark not only measure the problem-solving performance of a system, but also its scalability, robustness, adaptivity, learning speed, and so on?

Though formal proof cannot be expected in this situation, some discussions are necessary, otherwise the proposed criteria look arbitrary.

(3) The Direction Task

To include a human coach in performance test is a novel idea. Though it has its benefits, the restrictions should be carefully specified to prevent the human from doing too much. To limit the human involvement to communication with the computer is not enough, since it does not rule out the possibility that the human remotely controls all the activities of the computer. What kind "directions" are allowed so that "coaching" does not become "controlling"? The discussion in 2.8.4 is far from enough.

Also, as the author admits in 2.9.3, this task is not specific enough. Actually it gives readers a feeling that every task currently used in testing AGI systems can be considered as a "direction task", since it allows all kinds of variations. If that is the case, then what is new in this idea?

(4) The AGI Battery

The discussion is vague in the most crucial part of this benchmark: which tasks should be included in the AGI Battery. Only one possible procedure (board selection from individual submissions) is briefly mentioned.

(5) Article structure and writing style

It is probably more natural to move subsections 2.8 to 2.11 to a separate Section 3 (and push the discussion to Section 4), so as to separate the (more general) discussions on evaluation criteria from the (more concrete) discussions on benchmark tasks designed according to the criteria.

The paper touches many topics in a brief and informal manner, which is not suitable for a journal article, even for a conceptual, non-technical one. It will be better to drop some minor topics, and to go deeper in the major ones, so as to produce more concrete conclusions.

--- Minor issues:

*. page 2: We are no closer to a single definition of the term "intelligence.

miss closing "

*. page 5: The matching of human capability was the essence of the Turing Test and most AGI goal descriptions have been in a similar vein.

Accurately speaking, the Turing Test asks more than "human capability" --- it requires "human behaviors", that is, the machine not only solves human-solvable problems, but must do it in the "human way". Most of the current AGI projects are not aiming at such a goal.

*. page 7: task-based AGI vs. model-based AGI

These two are different (though related) research goals, rather than different approaches toward the same goal. Therefore, if a project aims at the former, "biological plausibility" is mostly irrelevant (except as an inspiration), no matter how much the machine has achieved.

*. page 8: reducible to practice

How to "reduce" a theory to practice? May be it should be changed to "applicable to practice" or "instructive to practice".

*. page 8: the human intelligence assessment tools ... are based on the same model

But in the human case, "coaching" happens before, not during, the test, so "remote control" is not an issue.

*. page 11: [The direction task] encompasses all those tasks considered to be cognitive and the hallmarks of human intelligence.

Claims like this should be either supported with evidence, or modified, since they are far from obvious.

*. References

Goertzel, B., and Pennachin, C. 2007 --- "Editors" should be added

Duch et al. 2008 and Wang 2008b--- they are in the same proceedings, but listed differently

websites cited --- it is better to provide URLs
Pei Wang (Dec 4 2009 12:07PM) No one else has rated this yet
You have not rated this yet   Rate

I would first like to thank Dr. Wang for his thoughtful reading of the manuscript and perceptive suggestions. His comments have motivated a significant shift in tone, which is described below.


The following comments are based on Version 5 of the submission.

--- Major issues:

(1) Research goal and evaluation method

The author acknowledges that "A benchmark implies a goal and implicitly contains a success criterion", as well as the situation in the field of AGI where there is no generally accepted research goal. The author suggests that even though this is the situation, there still can be evaluation criteria and benchmarks that are "palatable to a majority of us", and guided by them, the research may eventually lead to "an emergent end goal".

This does not sound right. In any research field, the evaluation methods (criteria, benchmarks, etc.) depend on the research goal, rather than the other way around. Even though AGI has no generally agreed research goal, it does not prevent each researcher from clarifying his/her current research goal, and then establishing a proper evaluation method accordingly, even when any current understanding of intelligence will inevitably be revised with the progress of the research.

Without a clearly specified research goal, it is hard to justify the proposed criteria and benchmarks, or to identify its application scope. Presumably the author does not suggest to apply the proposed benchmark to all existing projects in the field. In that case, which projects are within its scope, and which are beyond it?

Since each benchmark implies a research goal, it is better to state the goal as clearly as possible, even though it is by no means final or agreed by everyone in the field.


Dr. Wang’s comments helped to precipitate a subtle, but important change in the tone of the paper. Originally it was motivated by a desire to find a single, universal benchmark that would be good enough to cover the work of all participants in the AGI community. Now it is more modest in scope. It is intended to identify a benchmark for systems capable of natural world interaction. The title and abstract have been modified to reflect this, and Section 1.2, “A Benchmark as a Statement of Research Objectives,” has been added to clarify this position.


(2) The seven criteria

The author explains each criterion, but says little about their relationship, or the property of them as a whole. Even though each criterion sounds reasonable, it doesn't mean the following issues can be omitted:

a) Independence: Is there any redundancy in the list? For example, why "Specificity" and "Task Focus" cannot be merged into a single criterion? Why "Breath", "Low Cost", and "Range" cannot be combined into something like "generally applicability"?

b) Consistency: Can the criteria be satisfied together, in principle? Why "Specificity" and "Breath" do not always contradict with each other? Given the complexity of the notion "intelligence", can we expect a benchmark that is both simple and fit the everyday usage of the term? If compromise among the criteria is inevitable, how to handle it?

c) Completeness: Are these seven enough to cover the various criteria proposed or used so far in the field? For example, should a benchmark not only measure the problem-solving performance of a system, but also its scalability, robustness, adaptivity, learning speed, and so on?

Though formal proof cannot be expected in this situation, some discussions are necessary, otherwise the proposed criteria look arbitrary.


These issues are now discussed in Section 2.8, “Evaluating the criteria.”


(3) The Direction Task

To include a human coach in performance test is a novel idea. Though it has its benefits, the restrictions should be carefully specified to prevent the human from doing too much. To limit the human involvement to communication with the computer is not enough, since it does not rule out the possibility that the human remotely controls all the activities
of the computer. What kind "directions" are allowed so that "coaching" does not become "controlling"? The discussion in 2.8.4 is far from enough.


The following was added a couple of sections before 2.8.4 (into what is Section 3.1.1, “A Note on Coaching” in version 6 of the manuscript):

The coaching paradigm admits an extreme case in which the low level actions of the AGI candidate may be specifically evoked by the coach's verbal commands. In this ``remote control" case, the AGI is not required to integrate data or make decisions. Those functions are performed by the coach. Remote control tasks are valid instances of the direction task, but they measure only the most rudimentary capabilities of an AGI candidate. Assuming equally competent coaches, they would not differentiate between AGI capabilities, and thus would not be particularly useful in the benchmarking process. This touches on a broader issue: for every AGI candidate, there will be many instances of the direction task that are trivially easy and many that are impossibly difficult. Since it is only the relative performance of one AGI candidate versus another that contributes to its assessment, that fact does not weaken the direction task's measurement effectiveness. It does, however, suggest that some additional constraints to ensure appropriate levels of difficulty may improve the direction task's efficiency.


Also, as the author admits in 2.9.3, this task is not specific enough. Actually it gives readers a feeling that every task currently used in testing AGI systems can be considered as a "direction task", since it allows all kinds of variations. If that is the case, then what is new in this idea?


The choice of “specificity” as a label for this criterion may have introduced some confusion. The direction task lacks specificity in the sense that it produces a relative performance measure, rather than an absolute one. It requires two AGI candidates to compete on the same task, unknown to each beforehand, as when college students take a final exam. It does not imply that the direction task is loosely defined. Its definition, while broad, is clear.

Dr. Wang’s observation that every task currently used in testing AGI systems can be considered an instance of the direction task is accurate. The difference with the direction task is that the AGI system, its designer, and its coach do not know what the specific evaluation task will be beforehand. What is new with the direction task is its breadth. In order to excel at the direction task, an AGI candidate must perform better than others in a broad set of individual tasks without foreknowledge of those tasks.


(4) The AGI Battery

The discussion is vague in the most crucial part of this benchmark: which tasks should be included in the AGI Battery. Only one possible procedure (board selection from individual submissions) is briefly mentioned.


A paragraph of discussion was added (p. 17) on each of these two topics: the individual tasks to be included in the AGI battery and the selection procedure. The text now contains a partial list of potential task types: verbal communication, puzzle solving, object identification and classification, object retrieval, object manipulation and assembly, cooperation, combat, navigation, legged locomotion, and flight. It also outlines in much greater detail how a selection committee might be chosen, solicit proposals for battery tasks, evaluate them, improve them, and release them in a somewhat timely manner with minimal conflict of interest.


(5) Article structure and writing style

It is proba
Brandon Rohrer (Jan 2 2010 9:54PM) No one else has rated this yet
You have not rated this yet   Rate


(5) Article structure and writing style

It is probably more natural to move subsections 2.8 to 2.11 to a separate Section 3 (and push the discussion to Section 4), so as to separate the (more general) discussions on evaluation criteria from the (more concrete) discussions on benchmark tasks designed according to the criteria.

Great suggestion. The paper is organized this way now.


The paper touches many topics in a brief and informal manner, which is not suitable for a journal article, even for a conceptual, non-technical one. It will be better to drop some minor topics, and to go deeper in the major ones, so as to produce more concrete conclusions.


The technical detail of the discussion was adjusted according to this comment. Some sections were expanded and deepened, as described above. Several paragraphs were removed entirely, as were a number of statements that were too informal in either their tone or the extent to which they were supportable. The re-focusing of the paper onto natural interaction tasks allowed the removal of several paragraphs that dealt (inadequately) with other research objectives, including biological fidelity and formal problem solving theory. They are now appropriately handled by the declaration that they are outside the scope of the proposed benchmarks.


--- Minor issues:

*. page 2: We are no closer to a single definition of the term "intelligence.

miss closing "


added


*. page 5: The matching of human capability was the essence of the Turing Test and most AGI goal descriptions have been
in a similar vein.

Accurately speaking, the Turing Test asks more than "human capability" --- it requires "human behaviors", that is, the machine not only solves human-solvable problems, but must do it in the "human way". Most of the current AGI projects are not aiming at such a goal.


The Turing Test referenced here is the original test proposed by Alan Turing in which (depending upon the interpretation) either a man
impersonates a woman or a computer impersonates a human in a verbal communication task. In either case, the task is defined functionally. The methods, strategies, and processing mechanisms are not specified. Variations on the Turing Test are typically defined in a similar way. They are task based and do not specify how the task is achieved. Turing Tests may be considered to require human behaviors in that they often involve tasks (such as verbal communication) that are common to humans, but far more narrow than those addressed by general problem solvers.


*. page 7: task-based AGI vs. model-based AGI

These two are different (though related) research goals, rather than different approaches toward the same goal. Therefore, if a project aims at the former, "biological plausibility" is mostly irrelevant (except as an inspiration), no matter how much the machine has achieved.


This idea has been incorporated in the discussion on scoping the benchmark to natural world interaction. (p. 4)


*. page 8: reducible to practice

How to "reduce" a theory to practice? May be it should be changed to "applicable to practice" or "instructive to practice".


This phrase is taken from terminology in patent law and only means that it must be possible to embody the theory in some way such that it performs its intended function. However, due to content editing, the phrase does not occur in the new version.


*. page 8: the human intelligence assessment tools ... are based on the same model

But in the human case, "coaching" happens before, not during, the test, so "remote control" is not an issue.


This is an excellent observation. This illustrates one strategy for avoiding the remote control case, that is, the case in which the task tests the coach’s intelligence and ingenuity rather than that of the machine. This falls under the limited coaching case described in Section 2.8.4 and is likely to be necessary as machine capabilities increase. Other tests, such as those of horses or very young children, require much more coaching and demonstrate another point on the continuum between remote control and full autonomy.


*. page 11: [The direction task] encompasses all those tasks considered to be cognitive and the hallmarks of human intelligence.

Claims like this should be either supported with evidence, or modified, since they are far from obvious.


This statement was modified as follows:

It encompasses many tasks considered to be cognitive and the hallmarks of human intelligence, including problem solving, language learning, visual perception, category learning, reasoning in uncertain environments, locomotion, manipulation, and social interaction.


*. References

Goertzel, B., and Pennachin, C. 2007 --- "Editors" should be added

Duch et al. 2008 and Wang 2008b--- they are in the same proceedings, but listed differently

websites cited --- it is better to provide URLs


These were all resovled.
Brandon Rohrer (Jan 2 2010 9:55PM) No one else has rated this yet
You have not rated this yet   Rate

(Review from Ben Goertzel)


I think this is a solid though not awesome paper, which makes a reasonable contribution to the discussion of tasks, metrics, evaluation, benchmarks and roadmaps for AGI.

I have no problem with the coaching task...

Regarding the battery task, however, there is one major issue that the author doesn't seem to address: "game-ability by narrow AI systems."
In particular, the battery test approach he suggests seems to be too easily game-able by a narrow AI system consisting of a switch statement wrapping up a bunch of narrow AI programs corresponding to the particular tasks.  The prize would then go to the team that created the best programming language and framework for rapidly creating narrow AI programs fulfilling the newly determined tasks-of-the-year....

I can think of a solution to this, such as having the contestants not know the specific tasks involved each year, so that their AGI systems have to deal with "mystery tasks" in a known environment.  But this of course gets more complicated...

Anyway I'd be happy to recommend acceptance of this paper, under the provision that the author must insert a thorough discussion of the issue mentioned above and its implications for the battery task.


Response to review #2:

I would like to thank Dr. Goertzel for his concise and insightful review. He expressed concern that the AGI battery could be gamed, in the sense that it might be possible to create a system that performs well on it that does not actually exhibit broadly intelligent behavior. As a result, this discussion has been included in section 3.4.1:

***
There is a possible threat to fitness if battery tasks are too narrow. This case would admit a solution that comprises a collection of narrow point solutions with an executive to select between them. This solution would be unsatisfying since it would not apply to any problems outside the task space.

One possible fix to this threat is to constrain the structure of the AGI. This is tempting, especially when such constraints can be biologically motivated. But as was argued previously, the strength of a purely task-based benchmark is that we avoid being trapped by our own preconceptions about what mechanisms underlie intelligence. If an AGI consisted of a large number of narrow heuristics, wrapped with an elaborate if-then loop, yet was still capable of matching human performance on all conceivable tasks, there would be no reason not to consider it a human-level AI. The intended purpose of a benchmark is to provide a measure of intelligence. Regardless of the approach taken, systems that perform well on them should be considered intelligent. Specifying the mechanism beforehand is getting the process backward.

A more principled way to ensure fitness is to make the AGI battery sufficiently broad, both in the individual tasks and in the aggregate, and to heavily reward breadth over virtuosity. If the task space outside the benchmark tasks is of interest, then the tasks can be enlarged to include it. Tasks should be designed with as many free parameters as possible while remaining feasible for at least some of the systems being evaluated. For instance, a battery task might be to "play a board game." An AI that could learn to play any board game at a 2 year old level would far outperform Deep Blue: Even though it would lose spectacularly at chess, it would win at checkers, monopoly, Othello, and every other board game that didn't depend on chance. The Robocup robot soccer league is following this principle as well by gradually removing constraints on their environment. For example, the league recently relaxed the specifications on pitch lighting conditions. Changes like this drive teams away from optimal solutions for a specific environment and toward more robust solutions appropriate for a broader set of environments.
***

In addition, several relevant references were added which have recently come to my attention, including papers by Paul Cohen, Robert Wray, Nils Nilsson, Ronald Brachman, John Laird, and Christian Lebiere.
Brandon Rohrer (Mar 2 2010 3:41PM) No one else has rated this yet
You have not rated this yet   Rate