SituatedQA

Leaderboards

We maintain leaderboards for both the temporal and geographical SituatedQA tasks. In addition to overall performance, we also report performance on different splits of context values (see Table 4 in our paper for details). All performance values are measured in exact match accuracy.

Submissions

To add your system to our leaderboards, please see our submission instructions.

Temporal

Rank	System	Overall	Static	Sampled	Start
-	Human*	57.0	-	-	-
1	Baseline-DPR + Query Mod. + Finetuning	23.0	39.8	17.2	24.9
2	Baseline-DPR	19.4	44.2	16.0	14.2
3	Baseline-DPR + Query Mod.	18.6	28.8	15.9	18.5
4	Baseline-BART + Query Mod. + Finetuning	18.3	26.0	16.2	18.3
5	Baseline-BART	16.2	27.2	15.3	12.9
6	Baseline-BART + Query Mod.	14.5	19.5	12.4	15.7

Geographical

Rank	System	EM	Common	Uncommon
-	Human*	34.0	-	-
1	Baseline-DPR + Query Mod. + Finetuning	26.5	27.9	25.0
2	Baseline-DPR + Query Mod.	25.0	27.5	22.1
3	Baseline-BART + Query Mod. + Finetuning	16.8	21.5	11.7
4	Baseline-BART + Query Mod.	14.5	19.2	9.2
5	Baseline-BART	7.1	9.4	4.6
6	Baseline-DPR	6.1	9.1	2.9