Leaderboards
We maintain leaderboards for both the temporal and geographical SituatedQA tasks. In addition to overall performance, we also report performance on different splits of context values (see Table 4 in our paper for details). All performance values are measured in exact match accuracy.Submissions
To add your system to our leaderboards, please see our submission instructions.
Temporal
Rank | System | Overall | Static | Sampled | Start |
---|---|---|---|---|---|
- | Human* | 57.0 | - | - | - |
1 | Baseline-DPR + Query Mod. + Finetuning |
23.0 | 39.8 | 17.2 | 24.9 |
2 | Baseline-DPR | 19.4 | 44.2 | 16.0 | 14.2 |
3 | Baseline-DPR + Query Mod. |
18.6 | 28.8 | 15.9 | 18.5 |
4 | Baseline-BART + Query Mod. + Finetuning |
18.3 | 26.0 | 16.2 | 18.3 |
5 | Baseline-BART | 16.2 | 27.2 | 15.3 | 12.9 |
6 | Baseline-BART + Query Mod. |
14.5 | 19.5 | 12.4 | 15.7 |
Geographical
Rank | System | EM | Common | Uncommon |
---|---|---|---|---|
- | Human* | 34.0 | - | - |
1 | Baseline-DPR + Query Mod. + Finetuning |
26.5 | 27.9 | 25.0 |
2 | Baseline-DPR + Query Mod. |
25.0 | 27.5 | 22.1 |
3 | Baseline-BART + Query Mod. + Finetuning |
16.8 | 21.5 | 11.7 |
4 | Baseline-BART + Query Mod. |
14.5 | 19.2 | 9.2 |
5 | Baseline-BART | 7.1 | 9.4 | 4.6 |
6 | Baseline-DPR | 6.1 | 9.1 | 2.9 |