Evidence for Systematic Bias in the Spatial Memory of Large Language Models

Recent studies primarily view Large Language Models (LLMs) in geography as tools for linking natural language to geographic information systems. However, Roberts et al. (2023) demonstrated GPT-4’s inherent ability to perform spatial reasoning tasks without relying on external processing engines. This includes calculating the final destinations of routes based on initial locations, transport modes, directions, and travel durations. Such capabilities could have practical applications like creating personalized travel itineraries. Identifying LLM weaknesses in spatial tasks may guide their development.

This study investigates whether biases in human spatial reasoning manifest in LLMs, focusing on four well-studied biases: hierarchical bias, proximity bias, rotation bias, and alignment bias. Hierarchical bias involves the tendency to infer directions based on dominant geographical orientations, leading to inaccuracies when exceptions occur. Proximity bias involves underestimating distances within the same categorical group. Rotation bias is the tendency to align mental representations of geographical elements with conventional cardinal directions. Alignment bias involves overestimating the alignment of grouped locations, skewing perceptions of their true relationships.

LLMs learn from textual data, which may contain human spatial reasoning biases. This study hypothesizes that LLMs may exhibit similar biases due to their learning mechanisms. To investigate hierarchical bias, ten questions were posed to four models: GPT-3.5, GPT-4, LLaMA 2, and Gemini 1.0 Pro. GPT-4 showed superior performance, leading to its use for further bias analysis. Each question was posed in ‘zero-shot’ mode to reset the model after every question, ensuring unbiased responses.

Results for hierarchical bias indicated consistent errors across models, particularly with directions between cities like Portland and Toronto. GPT-4 achieved the highest accuracy at 75%, followed by Gemini at 55%, GPT-3.5 at 53%, and LLaMA-2 at 47%. For hierarchical bias-specific tasks, GPT-4 scored 50%, while other models scored lower. Absence of hierarchical bias improved model performance, with accuracy rates above 75%.

For proximity bias, GPT-4 often misjudged distances within and between states. For rotation bias, it incorrectly identified relative positions of cities due to oversimplification of geographical curvatures. For alignment bias, GPT-4 struggled with intercardinal directions, such as between Monaco and Chicago, reflecting misconceptions about continental alignments.

Overall, LLMs showed 87% accuracy in straightforward tasks but only 24% in bias-highlighted questions. Future research will focus on proximity bias, querying models with numerous cities to better understand these biases and their origins. Training LLMs with spatially explicit datasets could mitigate these issues, enhancing their spatial reasoning capabilities. Future work includes fine-tuning an open-source LLM to improve its understanding of geographic relationships.

References:

https://ceur-ws.org/Vol-3683/paper8.pdf
J. Roberts, T. Lüddecke, S. Das, K. Han, S. Albanie, Gpt4geo: How a language model sees the world’s geography, arXiv preprint arXiv:2306.00020 (2023).

Photo: Second GeoExT workshop in Glasgow