Leaderboards
We maintain leaderboards for both the temporal and geographical SituatedQA tasks. In addition to overall performance, we also report performance on different splits of context values (see Table 4 in our paper for details). All performance values are measured in exact match accuracy.Submissions
To add your system to our leaderboards, please see our submission instructions.
Temporal
| Rank | System | Overall | Static | Sampled | Start |
|---|---|---|---|---|---|
| - | Human* | 57.0 | - | - | - |
| 1 | Baseline-DPR + Query Mod. + Finetuning |
23.0 | 39.8 | 17.2 | 24.9 |
| 2 | Baseline-DPR | 19.4 | 44.2 | 16.0 | 14.2 |
| 3 | Baseline-DPR + Query Mod. |
18.6 | 28.8 | 15.9 | 18.5 |
| 4 | Baseline-BART + Query Mod. + Finetuning |
18.3 | 26.0 | 16.2 | 18.3 |
| 5 | Baseline-BART | 16.2 | 27.2 | 15.3 | 12.9 |
| 6 | Baseline-BART + Query Mod. |
14.5 | 19.5 | 12.4 | 15.7 |
Geographical
| Rank | System | EM | Common | Uncommon |
|---|---|---|---|---|
| - | Human* | 34.0 | - | - |
| 1 | Baseline-DPR + Query Mod. + Finetuning |
26.5 | 27.9 | 25.0 |
| 2 | Baseline-DPR + Query Mod. |
25.0 | 27.5 | 22.1 |
| 3 | Baseline-BART + Query Mod. + Finetuning |
16.8 | 21.5 | 11.7 |
| 4 | Baseline-BART + Query Mod. |
14.5 | 19.2 | 9.2 |
| 5 | Baseline-BART | 7.1 | 9.4 | 4.6 |
| 6 | Baseline-DPR | 6.1 | 9.1 | 2.9 |