DeepSeek's WEIRD Behavior: The cultural alignment of Large Language Models and the effects of prompt language and cultural prompting

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Culture is a core component of human-to-human interaction and plays a vital role in how we perceive and interact with others. Advancements in the effectiveness of Large Language Models (LLMs) in generating human-sounding text have greatly increased the amount of human-to-computer interaction. As this field grows, the cultural alignment of these human-like agents becomes an important field of study. Our work uses Hofstede’s VSM13 international surveys to understand the cultural alignment of the following models: DeepSeek-V3, V3.1, GPT-4, GPT-4.1, GPT-4o, and GPT-5. We use a combination of prompt language and cultural prompting, a strategy that uses a system prompt to shift a model’s alignment to reflect a specific country, to align these LLMs with the United States and China. Our results show that DeepSeek-V3, V3.1, and OpenAI’s GPT-5 exhibit a close alignment with the survey responses of the United States and do not achieve a strong or soft alignment with China, even when using cultural prompts or changing the prompt language. We also find that GPT-4 exhibits an alignment closer to China when prompted in English, but cultural prompting is effective in shifting this alignment closer to the United States. Other low-cost models, GPT-4o and GPT-4.1, respond to the prompt language used (i.e., English or Simplified Chinese) and cultural prompting strategies to create acceptable alignments with both the United States and China.

💡 Research Summary

This paper presents a comprehensive empirical analysis of the cultural alignment of Large Language Models (LLMs) with the United States and China, using Geert Hofstede’s well-established cultural dimensions framework (VSM13 survey). The study investigates the efficacy of two prompt-based mitigation strategies—changing the prompt language and employing cultural prompting (i.e., using system prompts to instruct the model to respond as a person from a specific country)—to shift a model’s inherent cultural biases.

The research evaluates six state-of-the-art LLMs: OpenAI’s GPT-4, GPT-4.1, GPT-4o, GPT-5, and DeepSeek-AI’s DeepSeek-V3 and V3.1. For each model, the authors simulate a population by generating 20 complete survey responses across six experimental conditions: prompts in English or Simplified Chinese, each combined with no cultural prompt, a U.S. cultural prompt, or a Chinese cultural prompt. This rigorous design, involving 17,280 total prompt-response pairs, ensures statistical reliability. Alignment is measured by calculating the absolute distance between the model’s derived Hofstede dimension scores (Power Distance, Individualism, etc.) and the reference scores for the U.S. and China.

The key findings reveal a pronounced and persistent bias towards Western, Educated, Industrialized, Rich, and Democratic (WEIRD) cultural values across most models. Notably, even the models developed by the Chinese company DeepSeek (V3 and V3.1) showed a strong alignment with the United States when prompted in English and failed to achieve close alignment with China across all tested strategies. This underscores that the cultural bias is deeply embedded in the training data corpus, which is predominantly English and WEIRD, rather than being solely determined by the model’s country of origin.

The adaptability of models to prompt-based interventions varied significantly. Lower-cost models like GPT-4o and GPT-4.1 demonstrated flexibility, responding effectively to both language switching and cultural prompting to produce acceptable alignments with both target cultures. In contrast, GPT-5 and the DeepSeek models were relatively “sticky,” showing limited movement from their default WEIRD alignment despite the interventions. GPT-4 occupied a more neutral ground but leaned closer to China when prompted in English, an alignment that could be shifted toward the U.S. with cultural prompting.

The analysis of mitigation strategy effectiveness showed that using the target country’s native language (English for U.S., Chinese for China) provided the most consistent improvement in alignment. Cultural prompting was highly effective (29.8% improvement) when applied in English to shift alignment towards China, but had minimal effect when applied in Chinese to shift alignment towards the U.S. This asymmetry suggests that English may function as a more culturally neutral or flexible medium within the LLMs’ representation space compared to Chinese.

In conclusion, the study confirms the pervasive WEIRD bias in contemporary LLMs and highlights that prompt engineering techniques offer a viable, low-cost method for cultural adaptation, but their success is highly dependent on the specific model architecture and training. The findings call for greater attention to cultural diversity in training data and model design to develop truly global and culturally competent AI systems, while providing practical guidance for developers seeking to deploy LLMs in cross-cultural applications.

DeepSeek's WEIRD Behavior: The cultural alignment of Large Language Models and the effects of prompt language and cultural prompting

💡 Research Summary

Comments & Academic Discussion

Leave a Comment