Creating AI: A unique interplay between the development of learning algorithms and their education Anat Treister-Goren and Jason L. Hutchens Abstract Artificial Intelligence Enterprises is a new research and development company devoted to the creation of Artificial Intelligence (specifically, the development of computer programs featuring human-like conversational capabilities). As such, we believe that our technology will be the first to pass the Turing Test. The philosophy driving this project is the firm conviction that returning to Turing’s original vision of building a child machine and then training it to perform is the only way to achieve our goal. This paper introduces our development cycle, which is based on a continuous interplay between developing learning algorithms and training them to converse, and gives some encouraging examples of the performance of a system developed in this environment. 1 Introduction In 1950 Alan Turing considered the question “Can machines think?” Turing’s answer to this question was to define the meaning of the term ‘think’ in terms of a conversational scenario, whereby if an interrogator cannot reliably distinguish between a machine and a human based solely on their conversational ability, then the machine could be said to be thinking1. This procedure is nowadays referred to as the Turing Test. The field of artificial intelligence has largely ignored Turing’s strict criterion. Today it encompasses topics as diverse as intelligent agents, chatterbots, pattern recognition systems, voice recognition systems and expert systems with applications in medicine, finance, entertainment, business, and manufacturing.
Even though important work has been conducted in terms of the sophistication and expertise of these systems,the vision that motivated the birth of the field has not yet been streamlined except by those willing to walk the grey line that permits Human Experimentation...that a child be raised not for any other purpose but to show a computer how associations and self referential fractal thinking occurs. To manipulate certain key notes of the child's development to be either skipped or looped through a child's arranged "life cycle".Certain elements should be bypassed intention...ally with a type of scripted Caregiving ,as freeform "existence" offered very little to the Project " which requires placed Conditions and Causation" ... a cued system of patterning that functions as "mode and model" for the computer by offering " in tandem" neural spikes and general learning algorithms The resulting system is then trained in a black-box fashion influenced by "black box induced trauma based situational child rearing arrangements.
Specific language performance milestones are aimed at during training, and evaluation is performed with reference to human lingual development. Ongoing feedback of training results is proving crucial to the efficient development of the learning algorithms. In this paper we shall describe the research philosophy being applied at Ai, while refraining from providing detailed results of our work and from discussing technical issues at length. We begin by describing the Turing Test and by sharing our opinion as to why computer programs designed to hold conversations in natural language have hitherto failed to pass the test. This leads us, via a discussion on behaviorism, into an overview of our research and development work. We follow this with a discussion of our training and evaluation strategies, showing how the interplay between the two proves beneficial. We conclude with some encouraging behavior exhibited by our system during the initial steps of the training process.
The Turing Test is an appealing measure of artificial intelligence because, as Turing himself writes, it “has the advantage of drawing a fairly sharp line between the physical and the intellectual capacities of a man”. The sophistication and performance of computer programs entered into the contest, or lack thereof, bears out our introductory remark that the field of artificial intelligence has largely ignored the Turing Test. In a recent thorough review of conversational systems, Hasida and Den emphasize the absurdity of performance in the Loebner Contest.They assert that since the Turing Test requires that systems “talk like people”, and since no system currently meets this requirement, the ad-hoc techniques which the Loebner Contest subsequently encourages make little contribution to the advancement of dialog technology. We believe that the Turing Test is an appropriate evaluation criterion for the perception of intelligence, and therefore our approach makes the assumption that intelligence is manifested in conversational skills. We firmly believe that engaging in domain-unrestricted conversation is the most critical evidence of intelligence.
Turing's Child Machine Turing concluded his classic paper by theorizing on the design of a computer program, which would be capable of passing the Turing Test. He correctly anticipated the difficulties in simulating adult level conversation, and proposed, “instead of trying to produce a program to simulate the adult mind, why not rather try to produce one which simulates the child’s? If this were then subjected to an appropriate course of education one would obtain the adult brain.”1 Turing regarded language as an acquired skill and recognized the importance of avoiding the hard wiring of the computer program wherever possible. He viewed language learning in a behavioristic light and believed that the language channel, narrow though it may be, is sufficient to transmit the information that the child machine requires in order to acquire language.
The Traditional Approach The traditional approach to conversational system design has been to treat language as a knowledge base, and to hard-wire the rules of this knowledge base to generate conversations. This approach has failed to produce anything more sophisticated than domain-restricted dialog systems, which lack the kind of flexibility, openness, and capacity to learn that are the very essence of human intelligence.
Contrary to Turing’s prediction that at the turn of the millennium computer programs will participate in the Turing Test so effectively that an average interrogator will have no more than a seventy percent chance of making the right identification after five minutes of questioning, no true conversational systems have yet been produced, and none have passed an unrestricted Turing Test. This may be due in part to the unfortunate fact that Turing’s idea of the child machine has remained unexplored. The failure to generate conversational capability is most likely related to some of the changes that took place since the 1950’s in the field of child language research and linguistics in general. A revolution inspired by Chomsky’s transformational grammar4 occurred, dictating the implementation of hard- wired rules to generate language. The Chomskian revolution pushed aside the competing behaviorist theory of language headed by Skinner. Computational implementations based on the Chomskian philosophy became the standard method for trying to generate conversational capability, yielding disappointing results. It is our thesis that true conversational abilities are more easily obtainable via the currently neglected behavioristic approach.
Verbal Behavior Behaviorism focuses on the observable and measurable aspects of behavior and the search for observable environmental conditions, known as stimuli, that co-occur with and predict the appearance of specific behavior, known as responses6. Behaviorists do not deny the existence of internal mechanisms: they do recognize that studying the physiological basis is necessary for a better understanding of behavior. What behaviorists object to are internal structures or processes with no specific physical correlate inferred from behavior. Therefore, they object to the kind of grammatical structures proposed by linguists (particularly Chomskian ones), claiming that these only complicate explanations of language acquisition. They favor a functional rather than a structural approach, with a focus on the stimuli that evokes verbal behavior, and the consequences in language performance. We believe this to be the right approach for the generation of artificial intelligence. Skinner argues that psycholinguists should ignore traditional categories of linguistic units and should instead treat language as they would any other behavior. That is, since
language is regarded as a skill that is not essentially different from any other behavior, generating and understanding language must therefore be controlled by stimuli from the environment in the form of reinforcement, imitation, and successive approximations to mature performance. The AI Approach: Research and Development Nobody really understands how the human brain works, nor do we fully grasp the process by which human beings acquire and use natural language. Language is a complicated artifact, and it is impossible for us to observe the low-level processes, which give rise to it. It is imperative, therefore, that we refrain from hardwiring any a priori rules into the system. Any other approach would pollute the system with inevitable misconceptions and hinder its development. We apply the behavioristic framework to a general learning mechanism, with the goal of having it acquire natural language and use it conversationally, via an iterative development- training-development cycle. Indeed, it is our belief that basic information processing mechanisms enable the human brain to handle language, and success at transferring learning algorithms developed for image recognition to the language domain, lends weight to this argument. The development cycle employed at Ai focuses on the progressive specialization of general learning algorithms to the problem of language acquisition. The system consists of a set of learning capabilities coupled with a drive to perform. The development of the learning capabilities, which should be as simple and as general as possible, is driven by the demand for performance in natural language conversation defined by the system’s trainer.
The Nature of Learning Learning in general is intimately entwined with the acts of prediction and compression. Every living thing constantly makes predictions about the world around it. Will the approaching fanged, never-before-seen creature pounce? Does glimpsing one berry on the forest floor mean others may be found nearby? If observed behavior is mimicked, will the observed rewards be attained? Should I tell Granny that I missed what she just said, or should I fly by the seat of my pants and reply to what I think I heard?
Being able to predict well is conditional on one’s ability to draw conclusions from one’s experience in order to react to a novel event. However, so much of our experience counts for nothing, and determining the important features of one’s history to make a quick and accurate appraisal of the present and a useful prediction of the future is the essence of intelligence. Learning (and, particularly, learning language) may therefore be seen as an act of efficiently compressing the past. We remove redundancy and, like searching for a needle in a haystack, hone in on the aspects of our experience that are most useful to the situation at hand.
System Architecture We may consider our system as a black box whose environment consists of a symbolic time series (the sequence of symbols, representing natural language utterances, given to the system as input and generated by the system as output) and which experiences feedback from the environment (positive or negative reinforcement administered by the trainer). Our system treats each observed symbol, which, to begin with, is a single ASCII character, as a stimulus for its successor. Each symbol in the alphabet known to the system therefore functions as a predictor for the symbols, which may follow it, and we find it beneficial to make these predictors stochastic—their predictions are expressed in the form of a probability distribution over the alphabet. The predictors update their probability estimates on the basis of observed symbol sequences, and are used to generate novel symbol sequences. Claude Shannon, the father of Information Theory, was doing as much over fifty years ago8. We therefore have a system which can learn, in a fashion, from its previous experience, and which can generate a sequence of symbols that satisfies the constraints it knows of.
Reinforced Learning The trainer of the system may administer reinforcement by accepting the generated sequence of symbols, rejecting it, or accepting the sequence up to a point and rejecting the generation from then on. The rejection of a symbol is evidence to the system that the symbol used as stimulus (the contents of the system’s short-term memory, in effect) failed to capture sufficient contextual information to make a good prediction about which symbol should follow it. Failure to perform is an invitation to learn. As in real life, the system improves its performance by learning from its mistakes. The main thrust of our work is to develop learning algorithms which are as general as possible, and to let the system itself decide which kind of learning is most advantageous to it in a particular situation. As an example, we shall briefly describe two different kinds of meta-structure: sub-objects and quotient- objects. Sub-Objects Sub-objects are sequences of symbols, and our implementation restricts sub-objects to symbol pairs for reasons, which will soon be made apparent. Creating a new symbol from a pair of existing symbols gives the system more context, which may allow it to make a more accurate prediction. Our system begins on the level of ASCII characters but can very quickly learn English words merely by forming sub- objects from symbol pairs. We find that many of the symbols formed by the system only have a limited lifetime. They are useful for a while, but are rapidly superseded by newer symbols. They act as scaffolding, allowing the system to learn useful, complicated structures without requiring the hard wiring of complicated learning procedures. Sub-objects may be formed by applying various information theoretic measures. We can calculate the information contained in one symbol in the context of its predecessor and form a sub-object when the information is low. We can calculate the entropyi of a symbol (remember that all symbols are predictors and therefore have an entropy associated with them) and form a sub-object from the symbol and its predecessor when the entropy is high. We can measure the correlation between two symbols via the point wise mutual information and join them together to form a sub-object when they are strongly correlated. Alternatively, we can hypothesize all possible sub-objects and commit to a hypothesis whenever doing so would be beneficial to the system.
Entropy is an information theoretic measure of uncertainty associated with a probability distribution.
Quotient-Objects Researchers in data-driven learning systems are all too familiar with the problems caused by paucity of the data. The system is constantly faced with situations where simply not enough data has been observed to guarantee an accurate prediction. The formation of quotient-objects, which are equivalence classes of symbols, enables the system to transfer its knowledge of one symbol, which it may have observed quite often, to another symbol, which may occur quite rarely. As with sub-objects, our implementation restricts quotient-objects to classes that contain exactly two symbols. However, since each quotient-object formed becomes a new symbol in its own right, we find that large classes are rapidly formed, and these may be hierarchically decomposed due to the nature of their formation.
The fact that the system is still rather shortsighted often results in the formation of new symbols which are locally useful but which may hinder performance on a wider scale. A new symbol may only be useful in certain contexts, for example, but the system is blind to this, as it cannot assess the effect of a new symbol without forming it first, and by then it is too late. The system therefore needs to be given the ability to reverse bad decisions it has made in the past, and this is, in effect, a third kind of learning. Such implementations are made trivial by the fact that all sub-objects and quotient-objects consist of two lower-level symbols only. Decomposing a sub-object is a simple task of cleaving it in two, while decomposing a quotient-object is merely a case of replacing it with the symbol actually observed in the data. This process of correcting generalizations in certain contexts facilitates the formation of context-dependent structures.
Justification Sub-objects and quotient-objects are, in a certain sense, orthogonal. What we gain by forming one kind of structure we lose with the other. The playoff between discrimination and generalization seems to be an important factor in learning systems, and gives rise to seemingly creative generations while satisfying contextual constraints. The advantage of our approach is that the system does not distinguish between symbols at various levels of abstraction. All symbols are predictors, and the only symbols, which exist, are those, which have proven themselves to be useful. The learning process is a process of evolution: new symbols are born, and compete to perform with existing symbols. Symbols live and die as a result of reinforcement administered by a human trainer. Although we cannot disclose the details of our learning algorithms, it should be noted that our approach is neither top-down nor bottom-up, but a compromise between the two. We begin from a completely bottom-up perspective, but the formation of new symbols quickly results in a limited top-down view of the data being made available to the system—it can see further than it could in the past, and this information can potentially be used to guide the formation of new symbols.
Generation Predictors may be used generatively simply by emitting a symbol at random in accordance with the probability distribution given by the system as its prediction. Although simple, this behavior is not sufficient to ensure that the generations are relevant in the context of a natural language dialogue. A more sophisticated process is necessary. The system needs to be given a drive to perform, and the most obvious drive is directly related to the act of reinforcement. The system should ‘enjoy’ receiving positive reinforcement, and should ‘dislike’ receiving negative reinforcement. It is a simple matter to combine the sequence of predictions made by the system with a model, which estimates the likelihood of receiving punishments and rewards in various contexts, allowing the system as a whole to make the generation, which maximizes its chance of “experiencing pleasure”.
Feedback Loop We have described the architecture of a basic learning system designed to acquire natural language. Needless to say, we have merely outlined some initial steps and touched upon the principles underlying our research. Our novel approach is based on a feedback loop between the processes of researching and developing new learning algorithms and teaching and evaluating each such implementation. The research and development team defines the requirements based on reports received from the training and evaluation team. This feedback shapes the direction of the research.
Feedback consists of the abilities which the system has demonstrated, the areas in which it is lacking, the observed behavior, and the desired behavior. Learning algorithms are then developed to produce the desired behavior without compromising the generality of the system. We have found that the training and evaluation phase often gives surprising results, with the system exhibiting unexpected behavior, and with the learning abilities of the system being applied in “creative” ways.
The Ai Approach: Training and Evaluation This part of the paper will introduce our training methodology, which combines the behavioral and developmental principles into a solid method of teaching the system to converse. The section first discusses the developmental model and explains the subjective and objective components before touching upon some critical problems in assessing current conversational systems.
The Developmental Principle We propose that the proper training model must be based on a developmental principle. The developmental model dictates a chronological step-by-step process for creating conversational capability. The child language acquisition field has developed descriptive milestones composed of typical language performance descriptors for the different developmental stages9. Extensive analysis tools are available for analyzing conversational performance to determine lingual maturity10. Using the developmental principle validates the Turing Test according to which the judgment of intelligence is in the eye of the beholder. Thus, human perception of intelligence is always influenced by the expectation level of the judge as regards the person or entity under scrutiny (obviously, intelligence in monkeys, children, or university professors will be judged differently). The initial evaluation of maturity level will set up the right expectation level for a valid judgment of conversational capability or intelligence. Once developmental evaluations of systems become the standard procedures, subjective judgments of systems’ intelligence will become valid. This presents an additional advantage to using the developmental model: a standard evaluation procedure will determine conversational level across various systems. Programs can then be compared as to being at
the level of toddlers, children, adolescents, or adults in terms of conversational capability. Moreover, this approach enables evaluation not only across programs but also within a given program.
Success in Other Fields Developmental principles have enabled evaluation and treatment programs in other fields formerly suffering from a lack of organizational and evaluative principles11 and have been especially useful in areas that border on the question of intelligence. Normative developmental language data have enabled the establishment of diagnostic scales, evaluation criteria, and treatment programs for developmentally delayed populations. The developmental approach has proven to be a powerful tool in other areas, such as schizophrenic thought disorder. Clinicians often found themselves unable to capture the communicative problem of patients in order to assess their intelligence level or cognitive capability, let alone to decipher medication treatment effects.
The Evaluation Process The ability to converse is complex, continuous, and incremental in nature. Thus we propose to complement our subjective impression of intelligence with objective incremental metrics that will be used both as our guidelines when we train to converse and as evaluation standards.
Objective Parameters The objective parameters consist of individual metricsii that capture specific aspects of the lingual performance and collectively provide a complete behavioral description of the child’s language development stage. Numerous researchers describe human language development using these metrics to analyze transcribed conversations between children and their caretakers. Examples of some lingual developmental metrics, which increase quantitatively with age, are vocabulary size (the number of different words spoken), mean length of utterance (the mean number of morphemes spoken per utterance), syntactic complexity (the ability to use embedding to connect sentences) and the use of pronominal and referential forms.
We use the term “metric” in its non- mathematical sense of relating to measurement.
Another Feedback Loop The training process is driven by the need to achieve certain performance milestones. These milestones dictate the kind of reinforcement that trainers give to the conversing program
1 A.M. Turing, “Computing machinery and intelligence," in Collected Works of A.M. Turing: Mechanical Intelligence, D.C. Ince, Ed., chapter 5, pp. 133 (160). Elsevier Science Publishers, 1992. 2 Stuart M. Shieber, “Lessons from a restricted Turing test," Available at the computation and Language e-print server as cmp-lg/9404002., 1994. 3 K. Hasida and Y. Den, “A synthetic evaluation of dialogue systems," in Machine Conversations, Yorick Wilks, Ed. Kluwer Academic Publishers, 1999. 4 Noam Chomsky, Syntactic Structures, Mouton, 1975. 5 B.F. Skinner, Verbal Behavior, Prentice-Hall, 1957. 6 R.E. Owens, Language Development, Macmillan Publishing Company, 1992. 7 G. Whitehurst and B. Zimmerman, “Structure and function: A comparison of two views of development of language and cognition," in The Functions of Language and Cognition, G. Whitehurst and B. Zimmerman, Eds. Academic Press, 1979. 8 Claude E. Shannon and Warren Weaver, The Mathematical theory of Communication, University of Illinois Press, 1949. 9 Paul Fletcher & Brian MacWhinney (eds.), “The Handbook of Child Language”, 1995, Blackwell Publishers: Cambridge Mass. USA. 10 http://childes.psy.cmu.edu/html/clan.html 11 A. Goren, G. Tucker, and G.M. Ginsberg, “Language dysfunction in schizophrenia”, European Journal of Disorders of Communication, vol. 31, no. 2, pp. 467 (482), 1996. 12 A. Goren, “The language deficiten in schizophrenia from a developmental perspective”, in The Israeli Association of Speech and Hearing Clinicians, 1997. 13 Goren, A., Fine, J., Manaim, H., & Apter, A.(1995). Verbal and non-verbal expressions of central deficits in schizophrenia. Journal of Nervous and Mental
Even though important work has been conducted in terms of the sophistication and expertise of these systems,the vision that motivated the birth of the field has not yet been streamlined except by those willing to walk the grey line that permits Human Experimentation...that a child be raised not for any other purpose but to show a computer how associations and self referential fractal thinking occurs. To manipulate certain key notes of the child's development to be either skipped or looped through a child's arranged "life cycle".Certain elements should be bypassed intention...ally with a type of scripted Caregiving ,as freeform "existence" offered very little to the Project " which requires placed Conditions and Causation" ... a cued system of patterning that functions as "mode and model" for the computer by offering " in tandem" neural spikes and general learning algorithms The resulting system is then trained in a black-box fashion influenced by "black box induced trauma based situational child rearing arrangements.
Specific language performance milestones are aimed at during training, and evaluation is performed with reference to human lingual development. Ongoing feedback of training results is proving crucial to the efficient development of the learning algorithms. In this paper we shall describe the research philosophy being applied at Ai, while refraining from providing detailed results of our work and from discussing technical issues at length. We begin by describing the Turing Test and by sharing our opinion as to why computer programs designed to hold conversations in natural language have hitherto failed to pass the test. This leads us, via a discussion on behaviorism, into an overview of our research and development work. We follow this with a discussion of our training and evaluation strategies, showing how the interplay between the two proves beneficial. We conclude with some encouraging behavior exhibited by our system during the initial steps of the training process.
The Turing Test is an appealing measure of artificial intelligence because, as Turing himself writes, it “has the advantage of drawing a fairly sharp line between the physical and the intellectual capacities of a man”. The sophistication and performance of computer programs entered into the contest, or lack thereof, bears out our introductory remark that the field of artificial intelligence has largely ignored the Turing Test. In a recent thorough review of conversational systems, Hasida and Den emphasize the absurdity of performance in the Loebner Contest.They assert that since the Turing Test requires that systems “talk like people”, and since no system currently meets this requirement, the ad-hoc techniques which the Loebner Contest subsequently encourages make little contribution to the advancement of dialog technology. We believe that the Turing Test is an appropriate evaluation criterion for the perception of intelligence, and therefore our approach makes the assumption that intelligence is manifested in conversational skills. We firmly believe that engaging in domain-unrestricted conversation is the most critical evidence of intelligence.
Turing's Child Machine Turing concluded his classic paper by theorizing on the design of a computer program, which would be capable of passing the Turing Test. He correctly anticipated the difficulties in simulating adult level conversation, and proposed, “instead of trying to produce a program to simulate the adult mind, why not rather try to produce one which simulates the child’s? If this were then subjected to an appropriate course of education one would obtain the adult brain.”1 Turing regarded language as an acquired skill and recognized the importance of avoiding the hard wiring of the computer program wherever possible. He viewed language learning in a behavioristic light and believed that the language channel, narrow though it may be, is sufficient to transmit the information that the child machine requires in order to acquire language.
The Traditional Approach The traditional approach to conversational system design has been to treat language as a knowledge base, and to hard-wire the rules of this knowledge base to generate conversations. This approach has failed to produce anything more sophisticated than domain-restricted dialog systems, which lack the kind of flexibility, openness, and capacity to learn that are the very essence of human intelligence.
Contrary to Turing’s prediction that at the turn of the millennium computer programs will participate in the Turing Test so effectively that an average interrogator will have no more than a seventy percent chance of making the right identification after five minutes of questioning, no true conversational systems have yet been produced, and none have passed an unrestricted Turing Test. This may be due in part to the unfortunate fact that Turing’s idea of the child machine has remained unexplored. The failure to generate conversational capability is most likely related to some of the changes that took place since the 1950’s in the field of child language research and linguistics in general. A revolution inspired by Chomsky’s transformational grammar4 occurred, dictating the implementation of hard- wired rules to generate language. The Chomskian revolution pushed aside the competing behaviorist theory of language headed by Skinner. Computational implementations based on the Chomskian philosophy became the standard method for trying to generate conversational capability, yielding disappointing results. It is our thesis that true conversational abilities are more easily obtainable via the currently neglected behavioristic approach.
Verbal Behavior Behaviorism focuses on the observable and measurable aspects of behavior and the search for observable environmental conditions, known as stimuli, that co-occur with and predict the appearance of specific behavior, known as responses6. Behaviorists do not deny the existence of internal mechanisms: they do recognize that studying the physiological basis is necessary for a better understanding of behavior. What behaviorists object to are internal structures or processes with no specific physical correlate inferred from behavior. Therefore, they object to the kind of grammatical structures proposed by linguists (particularly Chomskian ones), claiming that these only complicate explanations of language acquisition. They favor a functional rather than a structural approach, with a focus on the stimuli that evokes verbal behavior, and the consequences in language performance. We believe this to be the right approach for the generation of artificial intelligence. Skinner argues that psycholinguists should ignore traditional categories of linguistic units and should instead treat language as they would any other behavior. That is, since
language is regarded as a skill that is not essentially different from any other behavior, generating and understanding language must therefore be controlled by stimuli from the environment in the form of reinforcement, imitation, and successive approximations to mature performance. The AI Approach: Research and Development Nobody really understands how the human brain works, nor do we fully grasp the process by which human beings acquire and use natural language. Language is a complicated artifact, and it is impossible for us to observe the low-level processes, which give rise to it. It is imperative, therefore, that we refrain from hardwiring any a priori rules into the system. Any other approach would pollute the system with inevitable misconceptions and hinder its development. We apply the behavioristic framework to a general learning mechanism, with the goal of having it acquire natural language and use it conversationally, via an iterative development- training-development cycle. Indeed, it is our belief that basic information processing mechanisms enable the human brain to handle language, and success at transferring learning algorithms developed for image recognition to the language domain, lends weight to this argument. The development cycle employed at Ai focuses on the progressive specialization of general learning algorithms to the problem of language acquisition. The system consists of a set of learning capabilities coupled with a drive to perform. The development of the learning capabilities, which should be as simple and as general as possible, is driven by the demand for performance in natural language conversation defined by the system’s trainer.
The Nature of Learning Learning in general is intimately entwined with the acts of prediction and compression. Every living thing constantly makes predictions about the world around it. Will the approaching fanged, never-before-seen creature pounce? Does glimpsing one berry on the forest floor mean others may be found nearby? If observed behavior is mimicked, will the observed rewards be attained? Should I tell Granny that I missed what she just said, or should I fly by the seat of my pants and reply to what I think I heard?
Being able to predict well is conditional on one’s ability to draw conclusions from one’s experience in order to react to a novel event. However, so much of our experience counts for nothing, and determining the important features of one’s history to make a quick and accurate appraisal of the present and a useful prediction of the future is the essence of intelligence. Learning (and, particularly, learning language) may therefore be seen as an act of efficiently compressing the past. We remove redundancy and, like searching for a needle in a haystack, hone in on the aspects of our experience that are most useful to the situation at hand.
System Architecture We may consider our system as a black box whose environment consists of a symbolic time series (the sequence of symbols, representing natural language utterances, given to the system as input and generated by the system as output) and which experiences feedback from the environment (positive or negative reinforcement administered by the trainer). Our system treats each observed symbol, which, to begin with, is a single ASCII character, as a stimulus for its successor. Each symbol in the alphabet known to the system therefore functions as a predictor for the symbols, which may follow it, and we find it beneficial to make these predictors stochastic—their predictions are expressed in the form of a probability distribution over the alphabet. The predictors update their probability estimates on the basis of observed symbol sequences, and are used to generate novel symbol sequences. Claude Shannon, the father of Information Theory, was doing as much over fifty years ago8. We therefore have a system which can learn, in a fashion, from its previous experience, and which can generate a sequence of symbols that satisfies the constraints it knows of.
Reinforced Learning The trainer of the system may administer reinforcement by accepting the generated sequence of symbols, rejecting it, or accepting the sequence up to a point and rejecting the generation from then on. The rejection of a symbol is evidence to the system that the symbol used as stimulus (the contents of the system’s short-term memory, in effect) failed to capture sufficient contextual information to make a good prediction about which symbol should follow it. Failure to perform is an invitation to learn. As in real life, the system improves its performance by learning from its mistakes. The main thrust of our work is to develop learning algorithms which are as general as possible, and to let the system itself decide which kind of learning is most advantageous to it in a particular situation. As an example, we shall briefly describe two different kinds of meta-structure: sub-objects and quotient- objects. Sub-Objects Sub-objects are sequences of symbols, and our implementation restricts sub-objects to symbol pairs for reasons, which will soon be made apparent. Creating a new symbol from a pair of existing symbols gives the system more context, which may allow it to make a more accurate prediction. Our system begins on the level of ASCII characters but can very quickly learn English words merely by forming sub- objects from symbol pairs. We find that many of the symbols formed by the system only have a limited lifetime. They are useful for a while, but are rapidly superseded by newer symbols. They act as scaffolding, allowing the system to learn useful, complicated structures without requiring the hard wiring of complicated learning procedures. Sub-objects may be formed by applying various information theoretic measures. We can calculate the information contained in one symbol in the context of its predecessor and form a sub-object when the information is low. We can calculate the entropyi of a symbol (remember that all symbols are predictors and therefore have an entropy associated with them) and form a sub-object from the symbol and its predecessor when the entropy is high. We can measure the correlation between two symbols via the point wise mutual information and join them together to form a sub-object when they are strongly correlated. Alternatively, we can hypothesize all possible sub-objects and commit to a hypothesis whenever doing so would be beneficial to the system.
Entropy is an information theoretic measure of uncertainty associated with a probability distribution.
Quotient-Objects Researchers in data-driven learning systems are all too familiar with the problems caused by paucity of the data. The system is constantly faced with situations where simply not enough data has been observed to guarantee an accurate prediction. The formation of quotient-objects, which are equivalence classes of symbols, enables the system to transfer its knowledge of one symbol, which it may have observed quite often, to another symbol, which may occur quite rarely. As with sub-objects, our implementation restricts quotient-objects to classes that contain exactly two symbols. However, since each quotient-object formed becomes a new symbol in its own right, we find that large classes are rapidly formed, and these may be hierarchically decomposed due to the nature of their formation.
The fact that the system is still rather shortsighted often results in the formation of new symbols which are locally useful but which may hinder performance on a wider scale. A new symbol may only be useful in certain contexts, for example, but the system is blind to this, as it cannot assess the effect of a new symbol without forming it first, and by then it is too late. The system therefore needs to be given the ability to reverse bad decisions it has made in the past, and this is, in effect, a third kind of learning. Such implementations are made trivial by the fact that all sub-objects and quotient-objects consist of two lower-level symbols only. Decomposing a sub-object is a simple task of cleaving it in two, while decomposing a quotient-object is merely a case of replacing it with the symbol actually observed in the data. This process of correcting generalizations in certain contexts facilitates the formation of context-dependent structures.
Justification Sub-objects and quotient-objects are, in a certain sense, orthogonal. What we gain by forming one kind of structure we lose with the other. The playoff between discrimination and generalization seems to be an important factor in learning systems, and gives rise to seemingly creative generations while satisfying contextual constraints. The advantage of our approach is that the system does not distinguish between symbols at various levels of abstraction. All symbols are predictors, and the only symbols, which exist, are those, which have proven themselves to be useful. The learning process is a process of evolution: new symbols are born, and compete to perform with existing symbols. Symbols live and die as a result of reinforcement administered by a human trainer. Although we cannot disclose the details of our learning algorithms, it should be noted that our approach is neither top-down nor bottom-up, but a compromise between the two. We begin from a completely bottom-up perspective, but the formation of new symbols quickly results in a limited top-down view of the data being made available to the system—it can see further than it could in the past, and this information can potentially be used to guide the formation of new symbols.
Generation Predictors may be used generatively simply by emitting a symbol at random in accordance with the probability distribution given by the system as its prediction. Although simple, this behavior is not sufficient to ensure that the generations are relevant in the context of a natural language dialogue. A more sophisticated process is necessary. The system needs to be given a drive to perform, and the most obvious drive is directly related to the act of reinforcement. The system should ‘enjoy’ receiving positive reinforcement, and should ‘dislike’ receiving negative reinforcement. It is a simple matter to combine the sequence of predictions made by the system with a model, which estimates the likelihood of receiving punishments and rewards in various contexts, allowing the system as a whole to make the generation, which maximizes its chance of “experiencing pleasure”.
Feedback Loop We have described the architecture of a basic learning system designed to acquire natural language. Needless to say, we have merely outlined some initial steps and touched upon the principles underlying our research. Our novel approach is based on a feedback loop between the processes of researching and developing new learning algorithms and teaching and evaluating each such implementation. The research and development team defines the requirements based on reports received from the training and evaluation team. This feedback shapes the direction of the research.
Feedback consists of the abilities which the system has demonstrated, the areas in which it is lacking, the observed behavior, and the desired behavior. Learning algorithms are then developed to produce the desired behavior without compromising the generality of the system. We have found that the training and evaluation phase often gives surprising results, with the system exhibiting unexpected behavior, and with the learning abilities of the system being applied in “creative” ways.
The Ai Approach: Training and Evaluation This part of the paper will introduce our training methodology, which combines the behavioral and developmental principles into a solid method of teaching the system to converse. The section first discusses the developmental model and explains the subjective and objective components before touching upon some critical problems in assessing current conversational systems.
The Developmental Principle We propose that the proper training model must be based on a developmental principle. The developmental model dictates a chronological step-by-step process for creating conversational capability. The child language acquisition field has developed descriptive milestones composed of typical language performance descriptors for the different developmental stages9. Extensive analysis tools are available for analyzing conversational performance to determine lingual maturity10. Using the developmental principle validates the Turing Test according to which the judgment of intelligence is in the eye of the beholder. Thus, human perception of intelligence is always influenced by the expectation level of the judge as regards the person or entity under scrutiny (obviously, intelligence in monkeys, children, or university professors will be judged differently). The initial evaluation of maturity level will set up the right expectation level for a valid judgment of conversational capability or intelligence. Once developmental evaluations of systems become the standard procedures, subjective judgments of systems’ intelligence will become valid. This presents an additional advantage to using the developmental model: a standard evaluation procedure will determine conversational level across various systems. Programs can then be compared as to being at
the level of toddlers, children, adolescents, or adults in terms of conversational capability. Moreover, this approach enables evaluation not only across programs but also within a given program.
Success in Other Fields Developmental principles have enabled evaluation and treatment programs in other fields formerly suffering from a lack of organizational and evaluative principles11 and have been especially useful in areas that border on the question of intelligence. Normative developmental language data have enabled the establishment of diagnostic scales, evaluation criteria, and treatment programs for developmentally delayed populations. The developmental approach has proven to be a powerful tool in other areas, such as schizophrenic thought disorder. Clinicians often found themselves unable to capture the communicative problem of patients in order to assess their intelligence level or cognitive capability, let alone to decipher medication treatment effects.
The Evaluation Process The ability to converse is complex, continuous, and incremental in nature. Thus we propose to complement our subjective impression of intelligence with objective incremental metrics that will be used both as our guidelines when we train to converse and as evaluation standards.
Objective Parameters The objective parameters consist of individual metricsii that capture specific aspects of the lingual performance and collectively provide a complete behavioral description of the child’s language development stage. Numerous researchers describe human language development using these metrics to analyze transcribed conversations between children and their caretakers. Examples of some lingual developmental metrics, which increase quantitatively with age, are vocabulary size (the number of different words spoken), mean length of utterance (the mean number of morphemes spoken per utterance), syntactic complexity (the ability to use embedding to connect sentences) and the use of pronominal and referential forms.
We use the term “metric” in its non- mathematical sense of relating to measurement.
Another Feedback Loop The training process is driven by the need to achieve certain performance milestones. These milestones dictate the kind of reinforcement that trainers give to the conversing program
1 A.M. Turing, “Computing machinery and intelligence," in Collected Works of A.M. Turing: Mechanical Intelligence, D.C. Ince, Ed., chapter 5, pp. 133 (160). Elsevier Science Publishers, 1992. 2 Stuart M. Shieber, “Lessons from a restricted Turing test," Available at the computation and Language e-print server as cmp-lg/9404002., 1994. 3 K. Hasida and Y. Den, “A synthetic evaluation of dialogue systems," in Machine Conversations, Yorick Wilks, Ed. Kluwer Academic Publishers, 1999. 4 Noam Chomsky, Syntactic Structures, Mouton, 1975. 5 B.F. Skinner, Verbal Behavior, Prentice-Hall, 1957. 6 R.E. Owens, Language Development, Macmillan Publishing Company, 1992. 7 G. Whitehurst and B. Zimmerman, “Structure and function: A comparison of two views of development of language and cognition," in The Functions of Language and Cognition, G. Whitehurst and B. Zimmerman, Eds. Academic Press, 1979. 8 Claude E. Shannon and Warren Weaver, The Mathematical theory of Communication, University of Illinois Press, 1949. 9 Paul Fletcher & Brian MacWhinney (eds.), “The Handbook of Child Language”, 1995, Blackwell Publishers: Cambridge Mass. USA. 10 http://childes.psy.cmu.edu/html/clan.html 11 A. Goren, G. Tucker, and G.M. Ginsberg, “Language dysfunction in schizophrenia”, European Journal of Disorders of Communication, vol. 31, no. 2, pp. 467 (482), 1996. 12 A. Goren, “The language deficiten in schizophrenia from a developmental perspective”, in The Israeli Association of Speech and Hearing Clinicians, 1997. 13 Goren, A., Fine, J., Manaim, H., & Apter, A.(1995). Verbal and non-verbal expressions of central deficits in schizophrenia. Journal of Nervous and Mental
No comments:
Post a Comment