top of page

GPT-4 Outperforms Humans in Posing Questions: Leveraging Large Models to Break Down Barriers in Huma

In the latest developments within the field of artificial intelligence, the quality of human-generated prompts significantly impacts the response accuracy of large language models (LLMs). OpenAI suggests that precise, detailed, and specific questions are crucial for the performance of these advanced models. However, can ordinary users ensure their queries are clear and understandable for LLMs?


A notable disparity exists between human natural understanding in certain contexts and machine interpretation. For instance, the concept of "even-numbered months" is straightforward for humans, signifying February, April, etc. Yet, GPT-4 might misconstrue this as months with an even number of days. This not only highlights the limitations of AI in comprehending everyday contexts but also prompts us to reconsider how to communicate more effectively with these advanced models. As AI technology continuously evolves, bridging the gap in language comprehension between humans and machines is an important subject for future research.


Addressing this, a groundbreaking solution to the ambiguity in question understanding by large models like GPT-4 has been proposed by the General Artificial Intelligence Lab, led by Professor Quanquan Gu at the University of California, Los Angeles (UCLA). This research, completed by doctoral students Yihe Deng, Weitong Zhang, and Zixiang Chen, introduces an innovative approach.



  • Thesis: https://arxiv.org/pdf/2311.04205.pdf

  • Code: https://uclaml.github.io/Rephrase-and-Respond


The core of this solution involves prompting large language models to rephrase and expand the questions posed to them, enhancing the accuracy of their responses. The study discovered that questions rephrased by GPT-4 became more detailed and clearer in format. This method of rephrasing and expanding significantly improved the model's response accuracy. Experiments showed that a well-rephrased question could elevate the accuracy of responses from around 50% to nearly 100%. This improvement not only demonstrates the potential for self-improvement in large language models but also offers a new perspective on how AI can more effectively process and understand human language.



Methodology


Based on these findings, researchers proposed a simple yet effective prompt: “Rephrase and expand the question, and respond” (abbreviated as RaR). This prompt directly enhances the quality of responses from LLMs, marking a significant advancement in question handling.



Furthermore, the research team introduced a variant of RaR, termed “Two-step RaR,” to fully leverage the capacity of large models like GPT-4 in rephrasing questions. This method follows two steps: first, generating a rephrased question using a specialized Rephrasing LLM for a given query; second, combining the original and rephrased questions to prompt a Responding LLM for answers.



Results


Experiments across various tasks demonstrated the consistent effectiveness of both One-step and Two-step RaR in improving GPT-4’s response accuracy. Notably, RaR showed significant improvements in tasks that were initially challenging for GPT-4, achieving near 100% accuracy in some cases. Based on this, the team drew two key conclusions:


  1. Rephrasing and expanding (RaR) provides a plug-and-play, black-box style prompt method that effectively enhances the performance of LLMs across various tasks.

  2. Assessing the quality of questions is crucial in evaluating the performance of LLMs in question-answering (QA) tasks.

This performance enhancement not only exhibits the self-improvement potential of large language models but also provides a new angle on how AI can more effectively handle and comprehend human language.




Further Exploration


The researchers employed the Two-step RaR to explore the performances of different models, including GPT-4, GPT-3.5, and Vicuna-13b-v.15. The findings indicated that for models with more complex architectures and greater processing capabilities, like GPT-4, the RaR method significantly improved their accuracy and efficiency in handling questions. Even for simpler models like Vicuna, though with a smaller margin of improvement, the effectiveness of the RaR strategy was evident. The study further examined the quality of rephrased questions from different models, finding that while smaller models sometimes distorted the intent of the questions, advanced models like GPT-4 aligned more closely with human intent and enhanced the response efficacy of other models.


This discovery reveals an important phenomenon: the quality and effectiveness of question rephrasing by language models vary across different levels. Especially for advanced models like GPT-4, their rephrased questions not only provide clearer understanding for themselves but also serve as effective inputs to enhance the performance of smaller models.



Differentiating from Chain of Thought (CoT)


To understand the difference between RaR and Chain of Thought (CoT), researchers presented mathematical formulations of both and elucidated how RaR differs mathematically from CoT and how they can be easily combined.



The study also indicates that before delving into enhancing the reasoning capabilities of models, it's essential to improve the quality of questions to ensure accurate assessment of the models' reasoning abilities. For instance, in the “coin flip” problem, it was found that unlike human intent, GPT-4 interpreted the term “flip” as a random tossing action. Even when guided to reason with “Let’s think step by step,” this misunderstanding persisted in the reasoning process. Only after clarifying the question did the large language model respond to the anticipated query.



Moreover, the researchers noted that besides the question text, the question-answer examples used for few-shot CoT are also crafted by humans. This raises a question: how would LLMs react when these artificially constructed examples are flawed? The study presented an intriguing example, finding that poor few-shot CoT examples could negatively impact LLMs. For instance, in the “last letter connection” task, previously used question examples showed a positive effect in enhancing model performance. However, when the prompt logic changed, such as shifting from finding the last letter to the first letter, GPT-4 provided incorrect answers. This phenomenon highlights the model's sensitivity to human-crafted examples.


Researchers found that by employing RaR, GPT-4 could correct logical flaws in given examples, thereby enhancing the quality and robustness of few-shot CoT.



Conclusion


Communication between humans and large language models (LLMs) can be prone to misunderstandings: questions that seem clear to humans might still be interpreted differently by LLMs. The UCLA research team, based on this issue, proposed the novel RaR method, prompting LLMs to first rephrase and clarify the question before responding.


Experimental evaluations on a series of benchmark datasets confirmed the effectiveness of the RaR method. Further analysis showed that the quality improvement in rephrased questions could be transferred across models.


Looking ahead, methods like RaR are expected to continue evolving, and their integration with other approaches like CoT will pave the way for more accurate and effective interactions between humans and large language models, ultimately expanding the boundaries of AI's interpretative and reasoning capabilities.




Comments


bottom of page