LaplaceAI
- Nov 21, 2023
- 5 min read

AI Role-Playing in Personality Assessment: Big Five and MBTI Test Results

Accurate character representation is key in AI role-playing. Researchers have revised the Big Five's NEO-FFI questionnaire and MBTI's 16Personalities, using Large Language Models (LLM) to transform the questions from declarative to open-ended, prompting questions for testing AI characters. The test results found an 82.8% trait restoration rate.

Ever wanted to chat with your favorite anime or novel characters? Desire a virtual companion? Or own a digital intelligent entity?

With the advancement of Large Language Models (LLMs), these once-fanciful ideas seem increasingly within reach.

Character AIs, Chat Haruhi Suzumiya, Yandere AI Girlfriend Simulator, and other role-playing chatbots based on large language models have attracted global attention.

Compared to traditional chatbots (like Microsoft's Xiaoice) which required extensive engineering and were specific to certain scenarios, recent large language models can easily create role-playing AIs with different identities, personalities, memories, and language habits, requiring only simple prompt engineering and memory mechanisms. Hence, AI for role-playing is becoming increasingly popular.

Nevertheless, analytical studies on role-playing AI, especially regarding their evaluation, are still scarce. How can the quality of AI's role-playing be assessed?

In the realms of cosplay and fan fiction, it's emphasized that one should not be "out-of-character (OOC)."

Thus, accurately restoring a character's traits is a key dimension in evaluating role-playing AI.

Recently, Fudan University and Renmin University of China, in collaboration with the Chat Haruhi Suzumiya team, published a paper analyzing the restoration of character traits in role-playing AI from a personality trait perspective.

Paper link: https://arxiv.org/pdf/2310.17976.pdf
Github link: https://github.com/LC1332/Chat-Haruhi-Suzumiya/tree/main/research/personality

In this paper, researchers conducted personality tests on role-playing AI based on theories related to personality in psychology, such as the Big Five and MBTI.

The researchers proposed a set of interview-style personality test frameworks designed for role-playing AI, involving open-ended questions based on personality test scales. They used LLMs or the 16Personalities API to predict the personality traits of role-playing AI, comparing them with widely recognized character tags by human fans.

The experimental results showed that the existing role-playing AI's trait restoration rate reached 82.8%.

Method Overview

Interview-style Personality Testing Based on Open Questions

Although existing Large Language Models (LLMs) still have a significant gap from human intelligence, they can still be viewed from a psychological perspective as a classic "stimulus-response" system.

Thus, the paradigm of psychological personality research can be well applied to study the behavior patterns of LLMs. Some recent studies have explored whether LLMs possess stable and consistent personality traits and attempted to customize AIs with specified personalities.

These typically use a questionnaire with 60 or more questions, testing the LLMs' personalities from various dimensions. For instance, the Big Five includes dimensions like openness, neuroticism, conscientiousness, agreeableness, and extraversion, while MBTI includes dimensions such as extraversion/introversion, sensing/intuition, thinking/feeling, and judging/perceiving.

Existing works often use the Likert Scale, guiding human subjects or LLMs to choose from five or seven different levels of options, such as "strongly agree," "agree," "neutral," "disagree," "strongly disagree."

However, this method has many shortcomings for role-playing AI:

Although providing options is more efficient for human subjects, this method offers very limited information compared to open-ended questions.
Option-based questioning does not effectively stimulate the role-playing behavior of AI and can be easily influenced by the training data of the base LLM, leading to choices that do not align with the role-playing AI.
Interestingly, some characters with distinct personalities might refuse to cooperate with given options because they accurately restore the character's personality.

Therefore, the researchers proposed a set of interview-style personality test frameworks designed for role-playing AI, conducting personality tests through open-ended questioning of scale questions.

The researchers rewrote the Big Five's NEO-FFI questionnaire and MBTI's 16Personalities, using LLMs to transform the questions from declarative to open-ended, prompting questions, creating a new set of questionnaires.

Researchers conducted interview-style personality tests on 32 role-playing AIs from Chat Haruhi (based on the gpt-3.5-turbo as the base model).

For each target role AI, the researchers set up a related character as the experimenter, selecting questions from the final questionnaire to ask the target role, while the role AI gave open-ended responses as answers.

Each question was posed in different contexts to avoid mutual influence between them. Afterwards, all Q&A pairs for each character on each scale were recorded as the basis for personality assessment.

Based on the Q&A results of the characters on the scale questions, there are two methods to assess the personality traits of the characters. One is like existing work, converting the role AI's answers back to Likert Scale options, then performing personality assessment through APIs like 16 Personalities.

This study proposes another method, letting GPT-4 or ChatGPT and other LLMs perform personality assessment of the role AI based on Q&A pairs.

Researchers divided the role AI's Q&A pairs on the questionnaire into different dimensions, inputting all or multiple groups of Q&A pairs for each dimension into the LLM in turn, obtaining the role AI's score in that dimension.

On the Big Five, the role AI receives a score between -5 and 5 for each dimension; on MBTI, the role AI receives two scores that add up to 100% for each dimension, such as 30% E / 70% I, and is predicted to belong to the category with a score over 50%.

Experimental Results

Big Five Test Results of Different Role-Playing AI

The above figure shows the test results of 32 role-playing AIs from Chat Haruhi on the Big Five scale. The results show that role-playing AIs can exhibit a variety of personality traits based on different roles.

However, their personality traits are also largely influenced by the "baseline personality" of the base LLM. For example, the average score of role-playing AIs on the "neuroticism" dimension is -1.69, while on "conscientiousness," it is 1.56.

Researchers speculate that this is due to both the prior bias in character selection and the influence of the base model, as ChatGPT and other LLMs are trained to provide detailed, helpful, and positive responses.

To study this, researchers compared the average personality scores of 32 role-playing AIs with the personality scores of the base model itself, selecting ChatGPT and GLMPro as two different base models.

According to the results shown above, it can be seen that on the "neuroticism" dimension, the average score of role-playing AIs is consistent with the base model, while no clear corresponding associations are seen in other dimensions.

MBTI Test Results of Different Role-Playing AIs

Researchers also conducted MBTI tests on role-playing AIs and compared the test results with personality tags collected from the internet. Most of the personality tags came from www.personality-database.com, where many fans voted on the MBTI personalities of characters, and the voting proportions for each dimension can be seen.

Researchers considered tags with a voting proportion of 40%-60% as "controversial" and did not consider them when calculating accuracy. In the figure, red text indicates incorrectly predicted dimensions, and pink text indicates incorrectly predicted but controversial dimensions.

Next, researchers calculated the accuracy of the role-playing AI's personality test results, i.e., their consistency with fan tags.

It should be noted that there are two factors affecting accuracy: the performance of the role-playing AI itself and the effectiveness of the personality testing method. The experiment focused on analyzing the effectiveness of the personality testing method, thus controlling all role-playing AIs as the ChatHaruhi model based on gpt-3.5-turbo.

Accuracy of Role-Playing AI's Personality Test Results

The experimental results show that the personality testing method proposed in this study has an 82.76% consistency rate with human fan tags on a single dimension for the role-playing AIs of ChatHaruhi, and a 50% accuracy rate in predicting the complete MBTI tag of the character.

This result demonstrates the effectiveness of the personality testing method proposed in this article, and also shows that existing role-playing AIs can effectively restore the personality traits of corresponding characters.

Conclusion

This paper evaluates the restoration ability of role-playing AI from the perspective of personality testing. The article proposes a set of personality testing frameworks for role-playing AI, including conducting open-ended questions on scale questions for role-playing AI and using Large Language Models (LLMs) to evaluate the Q&A results.

Researchers conducted Big Five and MBTI personality tests on 32 role-playing AIs from ChatHaruhi, comparing them with personality tags annotated by human fans, showing that existing role-playing AIs can already restore characters' personality traits well.

In subsequent work, the authors plan to study how to further enhance the personality restoration ability of role-playing AI and include a consistency study of personality evaluation results given by LLMs with those from psychological experts.