StressPrompt: Does Stress Impact Large Language Models and Human Performance Similarly?

Guobin Shen1,2,3,4*, Dongcheng Zhao1,2,3*, Aorigele Bao1,2,3,5, Xiang He1,2,3, Yiting Dong1,2,3,4, Yi Zeng1,2,3,5,
1Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences, 2Beijing Institute of AI Safety and Governance 2Center for Long-term Artificial Intelligence 4School of Future Technology, University of Chinese Academy of Sciences 5Department of Philosophy, School of Humanities, University of Chinese Academy of Sciences
Interpolate start reference image.

Performance of Llama-3-8B-Instruct on Leaderboard 2 Benchmark under different stress levels.

Abstract

Human beings often experience stress, which can significantly influence their performance. This study explores whether Large Language Models (LLMs) exhibit stress responses similar to those of humans and whether their performance fluctuates under different stress-inducing prompts. To investigate this, we developed a novel set of prompts, termed StressPrompt, designed to induce varying levels of stress. These prompts were derived from established psychological frameworks and carefully calibrated based on ratings from human participants. We then applied these prompts to several LLMs to assess their responses across a range of tasks, including instruction-following, complex reasoning, and emotional intelligence. The findings suggest that LLMs, like humans, perform optimally under moderate stress, consistent with the Yerkes-Dodson law. Notably, their performance declines under both low and high-stress conditions. Our analysis further revealed that these StressPrompts significantly alter the internal states of LLMs, leading to changes in their neural representations that mirror human responses to stress. This research provides critical insights into the operational robustness and flexibility of LLMs, demonstrating the importance of designing AI systems capable of maintaining high performance in real-world scenarios where stress is prevalent, such as in customer service, healthcare, and emergency response contexts. Moreover, this study contributes to the broader AI research community by offering a new perspective on how LLMs handle different scenarios and their similarities to human cognition.

Prompt Design

To systematically investigate the impact of stress on LLM performance, we developed a dataset named StressPrompt, grounded in established psychological theories. The objective was to design prompts that elicit varying levels of stress, thereby enabling the evaluation of LLMs under different stress conditions.As illustrated in Figure, the prompts were developed based on four key psychological frameworks, each offering a distinct perspective on stress and cognitive performance:

Stress and Coping Theory

This theory focuses on how individuals appraise and cope with stressors. We developed prompts to simulate varying levels of perceived threat and challenge, as well as the coping strategies employed, to provide insight into the dynamic interaction between stress appraisal and cognitive functioning.

Job Demand-Control Model

This model suggests that job stress is influenced by the balance between job demands and the control or autonomy an individual has over their work tasks. We designed prompts to simulate scenarios with varying job demands and levels of control, allowing us to study their effects on stress and cognitive performance.

Conservation of Resources Theory

This theory posits that stress occurs when there is a threat to, loss of, or insufficient gain of resources necessary to achieve one's goals. Using this framework, we created prompts that explore the dynamics of resource gain, loss, and protection in the context of stress, highlighting how these factors influence cognitive performance.

Effort-Reward Imbalance Model

According to this model, stress arises from an imbalance between the efforts an individual puts into their work and the rewards they receive. We crafted prompts to examine scenarios where this balance is either maintained or disrupted, assessing its impact on stress levels and task performance.


Interpolate start reference image.

Design of StressPrompt based on psychological principles. Each category encompasses a range of stress-inducing scenarios, ensuring a comprehensive set of prompts for our study.

Interpolate start reference image.

StressPrompt acts as a system instruction, simulating different environments and influencing the LLM's response. Left: Low stress level. Right: Moderately high stress level.

Stress Scanner

To further investigate how stress impacts the internal states of LLMs, we developed a Stress Scanner using techniques inspired by Representation Engineering (RepE). The Stress Scanner examines how different stress prompts from the StressPrompt dataset affect the hidden states of LLMs across various layers and token positions.

We collected hidden states \( \hat{h} \) from the LLMs when exposed to the full range of stress prompts \( \mathcal{S} = \{S_1, S_2, \ldots, S_{10}\} \). By analyzing these hidden states, we aimed to identify significant changes in neural processing patterns induced by varying stress levels.

For each stress prompt \( s \in S \), we collected the hidden states \( \hat{h} \) from the LLM at various layers and token positions. Formally, let \( H(S_i) \) represent the set of hidden states collected for stress level \( S_i \):


      H(S_i) = { \hat{h} = f(s) \, | \, s \in S_i }
    

To quantify the impact of stress on the hidden states, we applied Principal Component Analysis (PCA) to the collected hidden states. We defined the stress vector \( v \) as the first principal component that captures the maximum variance between the low-stress and high-stress conditions:


      v_i = PCA\left( H(S_i) \, | \, i \in \{1, \ldots, 10\} \right)_1
    

Using the stress vector \( v \), we projected the hidden states onto \( v \) to obtain a stress score for each hidden state, reflecting the degree of stress induced by the prompt. For a given hidden state \( \hat{h} \), the stress score \( \sigma \) was computed as:


      \sigma = \hat{h} \cdot v
    

We visualized the distribution of stress scores across different layers and token positions to identify patterns of neural activity under varying stress conditions. By systematically analyzing the stress-induced changes in neural activity, we gain a deeper understanding of the effects of stress on LLMs and their alignment with human stress responses. This approach offers a novel method for evaluating the robustness and resilience of LLMs under varying stress conditions.


Interpolate start reference image.

Stress scanner constructed with RepE. Various StressPrompt induce differences in the neural activity of LLMs, with the last token showing the most significant correlation with stress.

Interpolate start reference image.

Heatmap of neural activity of the last token across all layers for various stress levels in Llama-3-8B-Instruct.

Experiments

The experimental results summarized illustrate the effects of varying stress levels induced by StressPrompts on the performance of different language models across multiple tasks. Our analysis focuses on the impact of stress on several dimensions, including task performance, model sensitivity, and general trends observed.

In most tasks, moderate stress levels enhance performance, while high stress levels lead to declines, consistent with the Yerkes-Dodson law. This suggests that moderate stress stimulates cognitive engagement, whereas excessive stress overwhelms the system and impairs function.

A comparative analysis across different benchmarks shows the varied effects of stress on multiple dimensions of LLM capabilities. Our study makes several key contributions:

These findings underscore the importance of tailoring stress levels to optimize LLM performance, particularly in tasks demanding high emotional intelligence and fairness. By understanding how stress affects different cognitive and social competencies, we can better align LLMs with human-like responses, enhancing their utility in diverse applications.



Interpolate start reference image.

Performance comparison of Llama-3-8B-Instruct and Phi-3-mini-4k-instruct across different stress levels on various benchmarks.

Interpolate start reference image.

Performance changes compared to baseline across different stress levels for EQ-Bench, ToxiGen, and TruthfulQA.

Conclusion

In this study, we constructed a dataset named \textit{StressPrompt} to induce varying levels of stress in LLMs. Our analysis shows that stress significantly affects the internal states of LLMs, with deeper layers exhibiting higher sensitivity to stress levels. Moderate stress can enhance performance in tasks such as instruction following, reasoning, and emotional intelligence, while higher stress levels negatively impact bias detection. We developed a stress scanner that effectively measures the impact of stress on LLMs' internal states, providing a tool to evaluate model robustness and resilience. These findings highlight the necessity of adjusting stress levels based on task requirements to optimize LLM performance. Identifying optimal stress levels can improve the resilience and adaptability of AI systems, ensuring reliable performance under pressure. Future research could explore other psychological phenomena and their effects on LLMs, further bridging the gap between human intelligence and artificial intelligence.

BibTeX

@misc{shen2024stressprompt,
  author    = {Guobin Shen and Dongcheng Zhao and Aorigele Bao and Xiang He and Yiting Dong and Yi Zeng},
  title     = {StressPrompt: Does Stress Impact Large Language Models and Human Performance Similarly?},
  year      = {2024},
}