Leaderboards

We maintain leaderboards for both the temporal and geographical SituatedQA tasks. In addition to overall performance, we also report performance on different splits of context values (see Table 4 in our paper for details). All performance values are measured in exact match accuracy.

Submissions

To add your system to our leaderboards, please see our submission instructions.

Temporal

Rank System Overall Static Sampled Start
- Human* 57.0 - - -
1 Baseline-DPR
+ Query Mod.
+ Finetuning
23.0 39.8 17.2 24.9
2 Baseline-DPR 19.4 44.2 16.0 14.2
3 Baseline-DPR
+ Query Mod.
18.6 28.8 15.9 18.5
4 Baseline-BART
+ Query Mod.
+ Finetuning
18.3 26.0 16.2 18.3
5 Baseline-BART 16.2 27.2 15.3 12.9
6 Baseline-BART
+ Query Mod.
14.5 19.5 12.4 15.7

Geographical

Rank System EM Common Uncommon
- Human* 34.0 - -
1 Baseline-DPR
+ Query Mod.
+ Finetuning
26.5 27.9 25.0
2 Baseline-DPR
+ Query Mod.
25.0 27.5 22.1
3 Baseline-BART
+ Query Mod.
+ Finetuning
16.8 21.5 11.7
4 Baseline-BART
+ Query Mod.
14.5 19.2 9.2
5 Baseline-BART 7.1 9.4 4.6
6 Baseline-DPR 6.1 9.1 2.9