-
Notifications
You must be signed in to change notification settings - Fork 22
Usage
The NEL evaluation tools are invoked using ./nel inside the repository. Usage:
./nel <command> [<args>]To list available commands:
./nelTo get help for a specific command:
./nel <command> -hThe commands that are relevant to TAC KBP entity linking evaluation and analysis are described below.
The following describes a typical workflow. See also run_tac14_evaluation.sh and run_tac13_evaluation.sh.
For data in [TAC14 format](data format):
./nel prepare-tac \
-q /path/to/gold.xml \ # gold queries/mentions file
/path/to/gold.tab \ # gold KB/NIL annotations file
> gold.combined.tsvFor data in TAC12 and TAC13 format, remove extra columns first, e.g.:
cat /path/to/gold.tab \
| cut -f1,2,3 \
> gold.tab
./nel prepare-tac \
-q /path/to/gold.xml \
gold.tab \
> gold.combined.tsvFor data in [TAC14 format](data format):
./nel prepare-tac \
-q /path/to/system.xml \ # system mentions file
/path/to/system.tab \ # system KB/NIL annotations
> system.combined.tsvFor data in TAC12 and TAC13 format, add dummy NE type column first, e.g.:
cat /path/to/system.tab \
| awk 'BEGIN{OFS="\t"} {print $1,$2,"NA",$3}' \
> system.tab
./nel prepare-tac \
-q /path/to/gold.xml \ # gold queries/mentions file
system.tab \ # system KB/NIL annotations
> system.combined.tsvTo calculate micro-averaged scores for all evaluation measures:
./nel evaluate \
-m all \ # report all evaluation measures
-f tab \ # print results in tab-separated format
-g gold.combined.tsv \ # prepared gold standard annotation
system.combined.tsv \ # prepared system output
> system.evaluationTo list available evaluation measures:
./nel list-measuresThe following describes additional commands for analysis. See also run_tac14_all.sh (TODO) and run_tac13_all.sh.
To calculate confidence intervals using bootstrap resampling:
./nel confidence \
-m strong_typed_link_match \ # report CI for TAC14 wikification measure
-f tab \ # print results in tab-separated format
-g gold.combined.tsv \ # prepared gold standard annotation
system.combined.tsv \ # prepared system output
> system.confidenceWe recommend that you pip install joblib and use -j NUM_JOBS to run this in parallel. This is also faster if an individual evaluation measure is specified (e.g., strong_typed_link_match) rather than groups of measures (e.g., tac).
The run_report_confidence.sh script is available to create reports comparing multiple systems.
Note that bootstrap resampling is not appropriate for nil clustering measures. For more detail, see the Significance wiki page.
It is also possible to calculate pairwise differences:
./nel significance \
--permute \ # use permutation method
-f tab \ # print results in tab-separated format
-g gold.combined.tsv \ # prepared gold standard annotation
system1.combined.tsv \ # prepared system1 output
system2.combined.tsv \ # prepared system2 output
> system1-system2.significanceWe recommend calculating significance for selected system pairs as it can take a while over all N choose 2 combinations of systems. You can also use -j NUM_JOBS to run this in parallel.
Note that bootstrap resampling is not appropriate for nil clustering measures. For more detail, see the Significance wiki page.
To create a table of classification errors:
./nel analyze \
-s \ # print summary table
-g gold.combined.tsv \ # prepared gold standard annnotation
system.combined.tsv \ # prepared system output
> system.analysisWithout the -s flag, the analyze command will list and categorize differences between the gold standard and system output.
The following describes a workflow for evaluation over subsets of mentions. See also run_tac14_filtered.sh (TODO) and run_tac13_filtered.sh.
Prepared data is in a simple tab-separated format with one mention per line and six columns: document_id, start_offset, end_offset, kb_or_nil_id, score, entity_type. It is possible to use command line tools (e.g., grep, awk) to select mentions for evaluation, e.g.:
cat gold.combined.tsv \ # prepared gold standard annotation
| egrep "^eng-(NG|WL)-" \ # select newsgroup and blog (WB) mentions
> gold.WB.tsv # filtered gold standard annotation
cat system.combined.tsv \ # prepared system output
| egrep "^eng-(NG|WL)-" \ # select newsgroup and blog (WB) mentions
> system.WB.tsv # filtered system outputAfter filtering, evaluation is run as before:
./nel evaluate \
-m all \ # report all evaluation measures
-f tab \ # print results in tab-separated format
-g gold.WB.tsv \ # filtered gold standard annotation
system.WB.tsv \ # filtered system output
> system.WB.evaluation