Do People Ask Good Questions? - NYU

Computational Brain & Behavior (2018) 1:69?89

Do People Ask Good Questions?

Anselm Rothe1 ? Brenden M. Lake1,2 ? Todd M. Gureckis1

Published online: 9 July 2018 ? Springer International Publishing 2018

Abstract People ask questions in order to efficiently learn about the world. But do people ask good questions? In this work, we designed an intuitive, game-based task that allowed people to ask natural language questions to resolve their uncertainty. Question quality was measured through Bayesian ideal observer models that considered large spaces of possible game states. During free-form question generation, participants asked a creative variety of useful and goal-directed questions, yet they rarely asked the best questions as identified by the Bayesian ideal observers (Experiment 1). In subsequent experiments, participants strongly preferred the best questions when evaluating questions that they did not generate themselves (Experiments 2 and 3). On one hand, our results show that people can accurately evaluate question quality, even when the set of questions is diverse and an ideal observer analysis has large computational requirements. On the other hand, people have a limited ability to synthesize maximally informative questions from scratch, suggesting a bottleneck in the question asking process.

Keywords Question asking ? Question generation ? Information search ? Active learning ? Bayesian modeling

Introduction

Asking questions is a hallmark of human intelligence which enables us to flexibly learn about, navigate in, and adapt to our environment. For example, a simple question to a fellow traveler ("Where is the uptown train platform?") can save us from wandering around until we find our way. Similarly, a doctor asking a patient "Have you traveled internationally in the past 21 days?" might be able to rule out a large number of exotic diseases if the answer is "no." Questions also are important in course of cognitive development. For example, Nelson (1973) found that most children acquire an utterance for asking questions within their first few words (e.g., "Eh?" or "Doh?" to mean "What is that?"). Such words may actually help bootstrap the process of learning language by coordinating information requests between the child and caregiver (Nelson 1973). There is little doubt then that asking questions is a powerful cognitive and linguistic

Anselm Rothe anselm@nyu.edu

1 Department of Psychology, New York University, 6 Washington Place, New York, NY 10003, USA

2 Center for Data Science, New York University, 60 5th Ave, New York, NY 10011, USA

tool, but are people actually effective at asking questions? Specifically, given all the possible questions someone could ask in a given situation, do people generally ask the best or most informative questions?

In recent decades, there has been a growing scientific interest in how children and adults ask questions. Much of the past work on this topic is largely qualitative, essentially cataloging the different types of questions people ask in different situations. The quantitative work on this topic, which has the potential to objectively assess the quality of human questions, has tended to focus on relatively simple scenarios where the range of allowed questions is limited to the features and class membership of each object (e.g., in the "Guess Who?" game, a question might be "Is your person wearing a hat?"). As a result, it is unclear whether people ask objectively good (or maximally informative) questions given their knowledge and goals in more unconstrained scenarios such as those encountered in everyday life.

In this paper, we attempt to explore this issue in a novel way using a relatively unconstrained question asking task which is nonetheless amenable to computational analysis. After reviewing the past literature on question asking and active learning, we describe the task we used to study question asking. Next, we describe alternative models of question evaluation that allow us to objectively measure the quality of the questions people asked in the experiment.

70

Comput Brain Behav (2018) 1:69?89

We then report empirical results from three experiments in which people either asked questions in order to resolve some ambiguous situation, or evaluated the questions generated by other people. To foreshadow, our results highlight interesting limitations in the intuitive question asking abilities of humans which we argue results, in part, from the immense computational demands of optimal question asking.

Past Work on Question Asking

The way people ask information-seeking questions has attracted considerable attention in cognitive science in both laboratory studies and observational designs.

Qualitative Studies of Question Asking

Studies in psychology and education investigating human question asking in classroom settings have focused on rather general distinctions between good and bad questions (see Graesser et al. 1993; Graesser and Person 1994; Davey and McBride 1986; Chin and Brown 2002; Dillon 1988). For instance, Chin and Brown (2002) distinguished basic information questions from "wonderment" questions, characterized as deeper questions used for planning or making predictions. In a reading comprehension study, Davey and McBride (1986) judged participants' questions as better if the questions targeted central ideas, used a "wh" stem, and required more than a yes/no response. Building on a classification scheme of questions that children ask in the classroom (Graesser et al. 1992), Graesser and colleagues defined good questions as having a question type that fits to the type of knowledge structure that the questioner wants to learn about. If the questioner was learning about a taxonomic structure of musical instruments, then "What are the types of X?" (e.g., "To which groups does a soprano clarinet belong?") constituted a good question while "How can you create X?" (e.g., "How do you build a soprano clarinet?") was a bad one because it targeted an answer unlikely to be informative about the taxonomic structure (Graesser et al. 1993). In subsequent work, Graesser and colleagues defined some of their categories as "deep" (e.g., why, why not, how, what-if, what-if-not) in contrast with "shallow" questions (e.g., who, what, when, where) and found that the proportion of deep questions asked in a class correlated with students' exam scores (see Graesser and Person 1994).

Although these studies offer interesting insight into the causes and consequences of question asking, they leave unresolved if any particular question is informative from the perspective of an individual learner. For example, even "deep" questions can be uninformative if the learner already knows the answer. This is largely because in

observational classroom studies, it is difficult to measure and control the amount of knowledge that different learners have as well as to account for differing individual goals during learning. In this paper, we focus on the quality of questions from an individual's perspective by controlling the background and prior knowledge that participants had about our experimental task as well as their goals.

Quantitative Studies of Question Asking

Although the above studies often focus on question asking in natural settings such as a classroom, quantitative studies often rely more heavily on laboratory tasks. Such tasks typically create a scenario (such as a game) where a learner must ask questions or perform certain actions in order to acquire information (e.g., Coenen et al. 2015; Markant and Gureckis 2014a, b; Meder and Nelson 2012; Nelson et al. 2014; Ruggeri and Feufel 2015; Ruggeri et al. 2016).1 The key concern in this work is if children and adults ask the "best" or most informative question as measured by computational models.

In order to apply computational models to this data, often these experiments are simplified relative to real-life inquiry. One view of the purpose of question asking is to resolve between alternative world states. For instance, when you ask a question like "Is this the uptown train platform?" you are attempting to resolve between two hypothetical worlds, one where you are standing on the uptown train and one where you are not. The answer to the question helps to resolve that ambiguity enabling you to act accordingly. Good questions are those that, from the perspective of the individual, rule out alternative world states.

In most laboratory tasks that try to mimic these real-life situations, the space of possible hypotheses (or alternative world states) that the learner is trying to discriminate is relatively curtailed, ranging from around 20 hypotheses (Ruggeri and Feufel 2015; Ruggeri et al. 2016; Nelson et al. 2014) to as few as two (Meder and Nelson 2012; Coenen et al. 2015; Markant and Gureckis 2014a). Similarly, many tasks allow only yes/no questions, which constrains the size of the set of answers (e.g., Ruggeri et al. 2016 only allowed yes/no questions but provided "some"

1In some studies, participants performed information-seeking actions, such as clicking on a certain part of an object, to obtain information about the object, which is, for our purposes, equivalent to asking information-seeking questions. For instance, participants could click on either the eye or claw of a plankton creature presented on a computer screen, to reveal the eye/claw color and then categorize the plankton based on that information (Meder and Nelson 2012), which is equivalent to asking "What is the color of the eye/claw?" Similarly, in Coenen et al. (2015), participants could click on one of three nodes in a causal network and subsequently observe which of the other nodes would turn on, which is equivalent to asking "Which nodes will turn on when I activate this node?"

Comput Brain Behav (2018) 1:69?89

71

as an additional, third answer; Coenen et al. 2015 had as many as four possible nodes or components that could be intervened on one at a time). Finally, a common strategy in the laboratory literature has been to effectively provide people with a predetermined list of questions allowing them to select the best (e.g., Nelson et al. 2014; Coenen et al. 2015; Meder and Nelson 2012). Although this approach simplifies data analysis considerably, it relieves learners from the burden of generating interesting and informative questions from scratch. As we highlight in the experiments below, the distinction between question generation and evaluation is a psychologically significant part of the task of asking question in more complex tasks and everyday life.

One notable exception to this trend is the work by Ruggeri and colleagues, who formally analyzed the question quality of relatively unconstrained yes/no questions that adults and children generated in order to identify a target category (Ruggeri et al. 2016). They reported relatively high performance by adults, who on average asked a first question that was very close to optimal. Our experiments similarly examine open-ended question asking performance, but with a much broader range of questions and a more complex task.

In summary, past work has tended to organize into observational studies of real-world question asking, where the issue of question quality is addressed relatively qualitatively, or careful laboratory settings, which are more simplified but allow precise measurement of the information value of different queries. The goal of the present study is

to combine elements of both traditions by studying question asking in a rich and unconstrained setting that is nonetheless amenable for formal mathematical modeling.

Studying Question Asking in the Battleship Game

In light of the issues laid out above, we identified a few key features for studying question asking in an experimental setting. First, we wanted to provide participants with ambiguous situations, in which they can ask a variety of questions with the goal of resolving the ambiguity. Second, we wanted participants to share the same understanding of what that ambiguity is. Thus, the situations should be defined by instructions that are easy for people to understand (e.g., as part of an intuitive task). Third, we wanted situations that are amenable to formal modeling, that is, constrained enough such that all possible ways to resolve the ambiguous situation can be represented in a mathematical model.

These features are ideally captured by an active learning task that is called the Battleship game due to its similarity to a single-player version of the popular children's game (Gureckis and Markant 2009; Markant and Gureckis 2012, 2014b). The goal of the game is to determine the location and size of three non-overlapping rectangular "ships" on a 6?6 grid (see Fig. 1). The ships have a width of 1 tile, are 2 to 4 tiles long, and are horizontally or vertically

Fig. 1 Battleship game boards

as viewed by participants.

Sampling phase: A participant

sequentially clicks on tiles to

turn them over. The revealed

color indicates a ship (blue, red,

or purple) or water (dark gray).

Painting phase: At a certain

point, the sampling phase is stopped and the participant

...

guesses the color of the

remaining tiles. For each

correctly painted tile, one point

is awarded

72

Comput Brain Behav (2018) 1:69?89

oriented. In past work that has used this information search task, a participant sequentially clicked on tiles to uncover either the color of the underlying ship part or an empty water part (sampling phase, Fig. 1). An efficient active learner seeks out tiles that are expected to reduce uncertainty about the ship locations and avoids tiles that would provide redundant information (e.g., when the hidden color can be inferred from the already revealed tiles). At a certain point, the sampling was stopped and participants were asked to fill in the remaining tiles with the appropriate color, based on their best guess (painting phase, Fig. 1). The score they receive was a decreasing function of the number of observations made in the sampling phase and an increasing function of the number of correctly painted tiles.

The task is well suited for the present study because the underlying hypothesis space (i.e., possible ship configurations that define a possible gameboard) is relatively large (1.6 million possible game boards) but is easy to explain to participants prior to the start of the task. In addition, the game is interesting and fun for participants while being amenable to an ideal observer analysis (see below). In previous work using this task, the only means a participant has for acquiring information is by sampling or turning over individual tiles (Settles 2009; Markant and Gureckis 2012, 2014b). In this paper, we allow participants to ask any question they want in natural language (e.g., "Are the ships touching?" or "What is the total area of the ships?"). This modification allows participants to use much more powerful tools to gain information (e.g., general purpose question asking) and allows us to study rich, natural language question asking in the context of a well-understood active learning task. Importantly, the design implied that people were conversing with an English-speaking oracle who knew the true hidden gameboard and would always answer honestly, similar to the assumption that would apply to clicking a tile to uncover it.

The Battleship domain has also proved useful to study how a machine can generate informative, human-like questions from scratch (Rothe et al. 2017).

Defining the Computational Problem of Asking Good Questions

A computational-level analysis describes, in a formal way, the goals and constraints of computations relevant for the cognitive system to solve a task (Marr 1982; Anderson 1990). Let us begin with the very general notion that any question we might ask has a certain utility with respect to our goals. This idea can be formalized as the expected utility of question x,

EU(x) = EdAx [U (d; x)]

(1)

where d is an answer from the set of possible answers Ax to question x, and U (d; x) is the utility of that answer for that question. Using an expectation is required here given that, at

the time of asking, we do not know yet what the answer to

the question is going to be. Under this framework, the task of asking the best question x can then be cast as a search

over the set Q of all possible questions one could ask,

x = arg max EU(x).

(2)

xQ

The utility of a question U (d; x) can include a range of factors depending on the agent's goals and situation. For example, question asking is often a social activity implying a listener and speaker, and assumptions about each along with the limited bandwidth of spoken communication may enter into the calculation of utility. For instance, research has shown that a learner may consider how difficult a question is to answer or how much information is conveyed to the answerer about one's intentions (e.g., Clark 1979; Hawkins et al. 2015). Any of such social and pragmatic considerations can be included as factors in the computation of a question's utility U (d; x).

Another issue concerns how much the utility of a question is influenced by the costs and rewards of the current task. For example, Chater, Crocker, and Pickering distinguish between cost-insensitive (or "disinterested") utilities and cost-senstive (or "interested") utilities (Chater et al. 1998; Markant and Gureckis 2012). A cost-insensitive, information-maximizing utility values knowledge by itself without reference to task-specific rewards, as for example reflected in the spirit of basic research or genuine curiosity. For instance, attempting to reduce one's uncertainty about the position of the ships in the Battleship game as much as possible follows this strategy and is captured by the expected information gain (EIG) model (see below).

In contrast, a cost-sensitive, utility-maximizing strategy values information only to the degree that it will lead to later rewards or avoid costs, making it a more economically oriented strategy. For example, students who want to minimize their study time and decide to ignore information that is unlikely to be tested in an exam engage in such a strategy, where the cost structure is given by the time spent studying as well as points lost in the exam. In the Battleship game, one utility-maximizing strategy is captured by the expected savings (ES) model (see below), which evaluates information with respect to the errors that can be avoided in the painting task.

In the present paper, we compare these two ways of assigning utility to a question (ES versus EIG). The two models provide alternative "yardsticks" for objectively evaluating the quality of people's questions with respect to the goals of the task. To give an intuitive example, EIG assigns a high value to a question such as "How many tiles

Comput Brain Behav (2018) 1:69?89

73

are occupied by ships?" because every answer allows the learner to rule out many hypothesized ship configurations that are inconsistent with the obtained answer. On the other hand, ES assigns a low value because such abstract information about the number of ship tiles does often not help much with the painting task.

The EIG versus ES contrast is also interesting because past work found that people's strategies were more in line with the cost-insensitive EIG than the cost-sensitive ES model (Markant and Gureckis 2012). Before defining these models formally, we introduce a Bayesian ideal observer of our task, which forms the foundation of both measures of utility for question asking.

Bayesian Ideal Observer Analysis of the Battleship Game

In a given Battleship game context, a player aims to identify a hidden configuration corresponding to a single hypothesis h in the space of possible configurations H . We model her prior belief distribution over the hypothesis space, p(h), as uniform over ship sizes. The prior is specified by first sampling the size of each ship from a uniform distribution and second sampling uniformly a configuration from the space of possible configurations given those sizes. The player can make a query x (uncovering a tile or asking a natural language question) and receives the response d (the answer). The player can then update her posterior probability distribution over the hypothesis space by applying Bayes' rule,

p(d|h; x)p(h)

p(h|d; x) = h H p(d|h ; x)p(h ) .

(3)

The semi-colon notation indicates that x is a parameter rather than a random variable. The posterior p(h|d; x) becomes the next step's prior p(h|D; X), with X representing all past queries and D representing all past responses,

p(h|d, D; x, X) =

p(d|h; x)p(h|D; X) h H p(d|h ; x)p(h |D; X) .

(4)

The likelihood function p(d|h; x) models the oracle that

provides answer d. The likelihood is zero if d is not a valid

response

to

the

question

x,

and

1 n

otherwise,

where

n

is

the number of correct answers that the oracle chooses from

uniformly. For most questions that we collected, there was a

single correct answer, n = 1. But for example when asking

for the coordinates of any one of the tiles that contain a blue

ship, n is defined by the number of blue ship tiles in the true

configuration. The posterior predictive value of a new query

x resulting in the answer d can be computed as

p(d|D; x, X) = p(d|h; x)p(h|D; X).

(5)

hH

Cost-Insensitive Utilities: Expected Information Gain

A primary goal of most question asking is to gain information about the world relevant for the learner's goal. In this case, the utility of a question and its answer is entirely or at least heavily determined by their expected informativeness. More precisely, we can define the utility of a question and its answer as the amount of gained information,

U (d; x) = I [p(h|D; X)] - I [p(h|d, D; x, X)],

(6)

where I [?] is the Shannon entropy (uncertainty, Shannon 1948) of the previous belief distribution p(h|D; X) (i.e., before receiving an answer d to the new query x, but with prior knowledge D and X; Eq. 4) and of the posterior belief distribution p(h|d, D; x, X) (i.e., after receiving an answer d to question x) over the hypothesis space.

For illustration, consider the example of asking a friend about the location of your car keys. The hypothesis space, H , spans the space of possible locations you can consider (e.g., locations in your apartment). Assuming there have been no previous questions X and answers D, p(h) represents your subjective belief in these possible locations (e.g., twice as likely to be in the office than the kitchen), and the entropy, I [p(h)], indicates how uncertain you are about the key location (this scalar measure will be zero when you know its location and will be high if you assign equal belief to each possibility). Likewise, I [p(h|d; x)] is the uncertainty about the key location after the answer is revealed. Some answers d can be far more informative than others. For example, imagine asking your friend, "Where are my car keys?". The answer "Somewhere in your apartment" is intuitively less informative than the more precise answer "On your desk, to the left of your laptop."

Of course, when asking a question such as "Where are my car keys?", we do not yet know which answer we will receive, but under a computational analysis, we can simulate how our uncertainty would hypothetically be reduced for each possible answer. Combining Eqs. 1 and 6, we define the expected utility of question x as the expected information gain (EIG),

EU(x) := EIG(x) = EdAx I [p(h|D; X)] - I [p(h|d, D; x, X)]

= p(d|D; x, X) I [p(h|D; X)] - I [p(h|d, D; x, X)]

(7)

d Ax

or the average amount of information we can expect from each of the possible answers to the question (e.g., Oaksford and Chater 1994; Coenen et al. in press). EIG is a commonly used metric in machine learning approaches to active learning (Settles 2012) and has a long history of study as a model of human information gathering (Oaksford and

74

Comput Brain Behav (2018) 1:69?89

Chater 1994). The value of a query is measured by EIG in the unit of bits.

Assuming the learner is motivated to quickly gain information, it would make sense for people to optimize information with their questions. That is, out of the possibly infinite set of questions, Q, we want to find the optimal question, x Q, that maximizes the expected information (see Eq. 2).

Cost-Sensitive Utilities: Expected Savings

According to ES, a query x is valued according to the expected rewards or avoided costs that the learner is expected to accrue after learning the answer to the question. For example, some questions might have high utility not just because they convey information but because they allow the agent to act effectively in the world (e.g., "Is this mushroom poisonous to eat?"). Cost-sensitive utilities are highly task-dependent in the sense that they depend on what the question asker plans to do with the acquired information (asking about the safety of a poisonous mushroom is not useful if you never planned to eat it, in which case the same question is essentially trivia). As a result, these utilities are defined differently for almost all tasks and goals the learner might have (unlike cost-insensitive utilities, which only depend on what the learner knows).

In the case of the Battleship, the primary goal might be reducing errors in the painting task (Fig. 1), leading to the following utility function,

U (d; x) = EC[p(h|D; X)] - EC[p(h|d, D; x, X)]. (8)

The function EC[p(h|v)] is used to denote the expected cost when coloring tiles in the painting task according to a particular belief distribution p(h|v), using v as shorthand for past questions and responses. Expected cost is defined as

EC[p(h|v)] =

p(l|v; i) ? [Chit p(l|v; i)

il

+ Cmiss(1 - p(l|v; i))],

(9)

where the belief that tile i has color l is given by p(l|v; i) = hH p(l|h; i)p(h|v). The choice to actually paint the tile

in that color is here given by p(l|v; i) again because we assume during the painting phase participants will use a probability matching decision strategy to choose the color of each tile. Chit = 0 and Cmiss = 1 indicate the costs associated with painting a tile correctly or incorrectly, respectively.

As with EIG, the question asking agent does not know the answer d to question x in advance. We define the expected savings (ES) as the expected reduction of errors in the

painting task averaged across all possible answers Ax of the query

EU(x) := ES(x) = p(d|D; x, X) EC[p(h|D; X)]

d Ax

- EC[p(h|d, D; x, X)] .

(10)

Thus, in the Battleship game, ES measures a query's value in units of expected (average) number of correctly painted tiles. As above, we want to find the optimal question, x Q, that maximizes the expected savings. Note that because of the conjunctive consideration of hypotheses and actions (painting tiles, Eq. 9), ES cannot be recast as a mere weighted variant of EIG, which only concerns hypotheses.

Example Calculation and the Computational Challenge of Question Asking

To make the equations just described more concrete, consider the computations involved in finding your lost keys by asking questions according to EIG. If, based on your prior knowledge p(h), the kitchen is an unlikely place for the keys, then "Are my car keys in the kitchen?" is an intuitively uninformative question. According to the analysis above and Eq. 7, we would evaluate for each of the two possible answers to this question, Ax = {"yes,""no"}, how much our uncertainty would be reduced. If the answer is d = "no" (which is very likely the answer), then our uncertainty about the location did not change much relative to what we believed already. If the answer is d = "yes," then we would rule out most of the other locations in our apartment, accordingly update our belief p(h|d; x), and we could see that our uncertainty I [p(h|d; x)] is much smaller than before. Overall, since d = "yes" is unlikely given that the kitchen is an unlikely place for the keys, the expected information gain of the question is low, as captured by the weighted expectation EdAx .

Although it might seem straightforward to simply ask "Where are my car keys?" given the goal of finding them, there are a myriad of questions one could ask instead (e.g., "What room did you go to after you parked my car?," "Where do you usually put the keys?," "Are the keys in the living room?," etc.). Each of these has a slightly different semantic meaning and provides additional information or leaves additional ambiguity depending on our goal.

This computational analysis of question asking provides one way to formalize the notion of a "good question." However, it raises interesting computational challenges. For example, the computations under this information-theoretic model grows with the number of hypotheses (computing I [p(h)] and I [p(h|d; x)]), the number of possible answers (computing EdAx ), and the size of the question space

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download