283x Filetype PDF File size 0.22 MB Source: aclanthology.org
Somestatistical methods for evaluating information extraction systems
Will Lowe GaryKing
Computer Science Department Center for Basic Research
Bath University in the Social Sciences
wlowe@latte.harvard.edu Harvard University
king@harvard.edu
Abstract event categories in real data generates severe prob-
lems for evaluators. We discuss these problems in
We present new statistical methods for section 3, show how to circumvent using a novel
evaluating information extraction sys- sampling scheme in section 4, and briefly describe
tems. The methods were developed our application. Finally we discuss the advantages
to evaluate a system used by polit- and disadvantages of the methods, and their rela-
ical scientists to extract event infor- tions to standard evaluation procedure. We start
mation from news leads about inter- with a brief review of information extraction in in-
national politics. The nature of this ternational relations.
data presents two problems for evalu-
ators: 1) the frequency distribution of
event types in international event data 2 Event Analysis in International
is strongly skewed, so a random sample Relations
of newsleads will typically fail to con-
tain any low frequency events. 2) Man- Researchers in quantitative international relations
ual information extraction necessary to have been performing manual information ex-
create evaluation sets is costly, and most traction since the mid-1970s (McClelland, 1978;
effort is wasted coding high frequency Azar, 1982). The information extracted has re-
categories . mained fairly simple; a researcher fills a ’who did
We present an evaluation scheme that what to whom’ template, usually from historical
overcomes these problems with consid- documents, a list of countries and international
erably less manual effort than traditional organizations to describe the actors, and a more
methods, and also allows us to interpret or less articulated ontology of international events
an information extraction system as an to describe what occurred (McClelland, 1978).
estimator (in the statistical sense) and to In the early 1990s automated information extrac-
estimate its bias. tion tools mostly replaced manual coding efforts
(Schrodt et al., 1994). Information extraction sys-
1 Introduction tems in international relations perform a similar
task to those competing in early Message Under-
This paper introduces a statistical approach we standing Competitions (Sundheim, 1991, 1992).
developed to evaluate information extraction sys- With machine extracted events data it is now pos-
tems used to study international relations. Event sible to do near real-time conflict forecasting with
extraction is a form of categorization, but the data based on newswire leads, and detailed politi-
highly skewed frequency profile of international cal analysis afterwards.
3 EventCategoryDistributions 3.1 StandardEvalution Methods
We wanted to evaluate an information extraction The standard evaluation methods developed over
system from Virtual Research Associates1. This the course of the Message Understanding Compe-
system bundles extraction and visualization soft- titions consist mainly in sample statistics to com-
ware with a custom event ontology containing, at pute over the evaluation materials e.g. precision
last count, about 200 categories of international and recall, but do not give any guidance for choos-
event. ing the materials themselves (Cowie and Lehnert,
Wefoundtwoproblemswiththenatureofinter- 1996; Grishman, 1997). This is just done by hand
national events data. First, the frequency distribu- bythejudges. Perhaps because the selection ques-
tion over the system’s ontology, or indeed several tion is neglected, it is seldom clear what larger
other ontologies we considered, is heavily skewed. population the test materials are from (save that it
Ahandful of mostly diplomatic event types pre- is the same one as the training examples), and as a
dominate, and the frequency of other event types consequence itis unclear what the implications for
falls of very sharply: we ran the system over all generalization are when a system obtains a partic-
the newsleads in Reuters’ coverage of the Bosnia ular set of scores for precision and recall (Lehnert
conflict, and of the approximately 45,000 events and Sundheim, 1991).
it extracted, 10,605 were in the category of ’neu- Since this literature did not help us generate
tral comment’, 4 of ’apology’ and 35 of ’threat of a suitable evaluation sample, we approached the
force’. Thus the relative frequencies of event cat- problem from scratch, and developed a statistical
egories in this data can be 2,500 to 1. framework specific to our needs.
Also, as these figures suggest, the more inter- 4 Method
esting and politically relevant events tend to be of One reasonable-sounding but wrong way to ad-
low frequency. This problem is quite general in dress the problem of creating a test set without
categorization systems with reasonably articulated having to code tens of thousands of irrelevant sto-
category systems, and not specific to international ries is the following:
relations. But any dataset with these properties
causes an immediate problem for evaluation. 1. Usetheextraction system itself to perform an
Ideally we would choose a random subset of initial coding,
leads whose events are known with certainty (be-
cause we have coded them manually beforehand), 2. Takeasampleoftheoutputthatcoversallthe
run the system over them, and then compute var- event types in reasonable quantities,
ious sample statistics such as precision and re- 3. Examine each coding to see whether the sys-
call2. However, a small randomly chosen subset
is very unlikely contain instances of most interest- tem assigned the correct event code.
ing events, and so the system’s performance will This looks like it can guarantee a good sample of
not be evaluated on them. Given the possible fre- low frequency events at much lower cost to the
quency ratios above, the size of subset necessary manual coder; we can just pick a fixed number
to ensure reasonable coverage of lower frequency of events from each category and evaluate them.
eventcategories isenormous. Putmoreconcretely, However, this method exhibits selection bias. To
to construct a test set of news leads the evaluator see this, let M and T be variables indicating which
will on average have to code around 2,500 com- event category the Machine (that is, the informa-
ments to reach a single apology and about 300 tion extraction system) codes an event into, and
comments to find a single threat of force. the True category to which the event actually be-
1 longs. Statistically, the quantity of interest to us is
http://www.vranet.com the probability that the machine is correct:
2This paper only evaluates extraction performance on
event types, though there would seem to be no reason why
a similar approach would not work for actors etc. P(M=i|T =i) (1)
This is the probability that the machine classifies 2. Compute P(M) by running the system over
an event into category i given that the true event the entire data set and normalizing the fre-
coding is indeed i. A full characterization of the quency histogram of event categories
success ofthe machine requires knowing P(M =i| 3. Estimate P(M | T) by correcting P(T | M)
T =i) for i = 0,...,J, which includes all J event with P(M) using Bayes theorem
categories and where i = 0 denotes the situation
where the machine is unable to classify an event Our implementation of this scheme was to first
into any category. In short, the quantity of interest run the system over 45,000 leads about the Bosnia
is the full probability density P(M | T). conflict, and normalize the frequency histogram of
In statistical terms, this distribution is a likeli- events extracted to create P(M). Then, randomly
hood function for the information extraction sys- choose 5 leads assigned to each event category,
tem. This observation allows us to treat the system and manually determine which event type the in-
like any other statistical estimator and offers the stantiate. Then normalize to estimate P(T | M).
interesting possibility of analyzing generalization And finally, use (3) to create P(M | T). We chose
via its sampling properties, e.g. its bias, variance, four times as many uncategorized leads as from
meansquared error, or risk. each true category in addition. A larger sample
Unfortunately, the problem with the reasonable- here is advisable to see what sort of categories the
sounding approach described above is that it does system misses. These sample sizes are fixed, but
not in fact allow us to estimate P(M | T) because it may also be possible to use active learning tech-
it is implicitly conditioning on M, not T. In par- niques to tune them (as in e.g. Argamon-Engelson
ticular, the proportion of events that are actually in and Dagan, 1999) for even more efficient sam-
category i among those the machine put in cate- pling.
gory i gives us instead an estimate of Theadvantage of this roundabout route to (1) is
that it requires many fewer events to be manually
P(T | M) (2) coded. We ran the system over 45,000 leads but
only manually coded a handful of events for each
which is not the quantity of interest. (2) is the category. This guaranteed us even coverage of the
probability of the truth being in some event cate- lowest frequency event categories whilst not bias-
gory rather than the machine’s response whereas ing the end result – for an ontology with about 200
in fact the true event category is fixed and it is categories this is a substantial decrease in evalua-
the machine’s response that is uncertain3. Worse, tor effort.
P(T | M) is a systematically biased estimate of This method works by making use of the ex-
P(M | T) because these two quantities are related traction system itself to produce one important
by Bayes theorem: marginal: P(M). If we assume that the aim is to
P(M,T) P(T | M)P(M) evaluate the system on the Bosnia conflict, P(M)
P(M|T)= P(T) = P(T) , (3) is not estimated, but is rather an exact population
marginal4. Then we can guarantee that our esti-
and the only circumstances under which they mate of P(M | T) is unbiased because the method
would be equal is when P(M) is uniform. But for estimating P(T | M) is clearly unbiased, and
the figures in section 3 suggest that P(M) is highly P(M)addsnoerror.
skewed. 4.1 SummaryMeasures
However this last observation suggests a better P(M | T) allows the computation of a number
method for unbiased estimation of (1). of useful summary measures5. For example, we
1. Estimate P(T | M) as described above 4We might consider the Bosnian conflict to be a sample
point from the larger population of all wars, but that popula-
3Thisisduetochangesinthejournalist’schoiceofvocab- tion – if it exists at all – is certainly difficult to quantify.
ulary and syntactic construction that are uncorrelated with the 5Detailed discussion of several summary measures for the
identity of the event being described. system we evaluated can be found in King and Lowe (2002).
can easily compute P(M,T) from quantities al- 10
ready available, so J P(M = i,T = i) is the pro- 8
∑
portion of time the system extracts the correct 6
category. Alternatively, if it is more important
to extract some categories than others, then var- 4
ious weighted measures can be constructed e.g. 2
J
∑ P(M=i|T =i)wi where ws are non-negative gi0
and sum to 1, representing the relative importance
of extracting each category. Some more graphi- −2
cal methods of evaluation using P(M | T) are pre- −4
sented below. −6
4.2 Estimator Properties −8
−10
Given a likelihood function for the extraction sys- −10 −8 −6 −4 −2 0 2 4 6 8 10
G
tem we can investigate its properties as an esti- i
mator. It is particularly useful to know the bias Figure 1: Expected (gi) versus true (Gi) conflict-
of an estimator, defined in this case as the dif- cooperation level for each event category.
ference between the expected category response
from the system when the true event category is aid’, G = 7.4, ‘policy endorsement’ maps to 3.6,
i, and i itself, where the expectation is taken of re- i
peated information extraction tasks that instantiate ‘halt negotiations’ maps to -3.8, and a ‘military en-
the same event categories. We do not examine the gagement’ maps to -10, the maximally conflictual
corresponding variance here, and a more complete event. The mapping allows univariate, and polit-
evaluation might also address the question of con- ically relevant comparison between the true con-
sistency. flict level and that of the event categories the sys-
tem extracts.
4.2.1 Conflict and Cooperation The expected system response when the true
The machines response and the true category category has conflict/cooperation level Gi is:
is best seen as a set of multinomial probabilities J
(with a unit vector with the value 1 at the index gi = ∑GjP(M= j|T =i,M6=0) (4)
of the system’s extracted category or the true cate- where
gory respectively. Estimator properties are cum- P(M|T)1(M6=0)
bersome to represent in this format, so here we P(M= j|T =i,M6=0)= P(M6=0|T) .
map the system’s response to a single real value
corresponding to the level of conflict or coopera- and 1(M 6= 0) is an indicator function equaling 1
tion of the event category. This re-representation if M 6= 0 and 0 otherwise.
is usual in international relations and allows stan- AplotofGi against gi for each event category is
dard econometric time series methods to be ap- shown in Figure 1. An unbiased estimator would
plied (Schrodt and Gerner, 1994; Goldstein and show expected values on the main diagonal. Esti-
Freeman, 1990; Goldstein and Pevehouse, 1997). mator bias for event category i is simply gi −Gi.
For our purposes it also allows the straightfor- Estimator variance is simply the spread around the
wardgraphical presentation of the main ideas. We diagonal.
define the level of conflict or cooperation level 4.3 Comparison
of an event category i as Gi, a real number be-
tween -10 (most conflictual) to 10 (most coopera- We also compared the system’s performance to
tive) (see Goldstein, 1992, for the full mapping). 3 undergraduate coders (U1-3) working on the
For example, according to this scheme, when i same data set. To examine undergraduate perfor-
denotes the event category ‘extending economic mance requires first P(U,T), from which we can
no reviews yet
Please Login to review.