DialSim: A Real-Time Simulator
for Evaluating Long-Term Dialogue Understanding of Conversational Agents

1KAIST, 2SNU
Interpolate start reference image.

An overall process of DialSim. Gray speech bubbles indicate predetermined utterances from the script, and white speech bubbles indicate spontaneous questions asked during the simulation. Colored speech bubbles indicate the agent's responses to the questions. (Left) An unanswerable question. (Center) A question that references a specific time. (Right) A multi-hop question that requires understanding past sessions (i.e., the Left and Center boxes).

DialSim

DialSim places an agent in the role of the main character of a TV show, engaging it in an extensive conversation based on the show’s scripted content. During this conversation, the agent is evaluated for its ability to respond appropriately to spontaneous questions from other speakers within a predefined time limit (e.g., 1s / 3s / 5s). The agent should be able to answer questions based on the information from the past dialogue and acknowledge when it does not know the answer. DialSim simulates three different dialogue environments based on scripts from popular TV shows (i.e., Friends, The Big Bang Theory, and The Office), and it has the following four main features.

Four key features of DialSim

Real-Time Simulator: For a conversational agent to function effectively in real-time, the agent should be capable of updating its memory and generating responses on the fly. To evaluate this, DialSim measures the accuracy of responses within a predefined time limit. To the best of our knowledge, this is the first work that evaluates the performance of a conversational agent in a time-constrained environment.

Evaluating Long-Term Multi-Party Dialogue Understanding: DialSim includes multi-party dialogues with an average length of 350k tokens, making it the longest among existing long-term dialogue datasets. Throughout the simulation, the agent faces complex questions that demand comprehension and reasoning across several multi-party sessions, ensuring a thorough assessment of its understanding capabilities.

Randomized Test: In our simulator, there is an average of 1,300 sessions of conversations occurring over a period of 5 years as assumed in the corresponding TV show. For each session, a randomly picked character asks a question that is randomly sampled from an average of 1,000 candidates, at a random time. This setup allows us to test whether an agent can respond appropriately in an unpredictable environment, resembling real-world settings. Additionally, since a random set of questions is given for each test, repeating the tests multiple times allows us to measure the agent's performance variability and reliability.

Adversarial Test: LLM-based agents are likely to have prior knowledge about these TV shows from the pre-training process. Since such agents can provide answers without referencing the actual dialogue history, it is crucial to ensure that the agent relies strictly on the history for its responses. To achieve this, we developed adversarial tests that modify character names in two specific ways: by swapping their names with each other (e.g., Joey ↔ Monica) or by assigning new names to them (e.g., Joey → John). This approach ensures that the agent's responses depend strictly on the context of the dialogue history.

Examples of Simulations

Simulation: ChatGPT-3.5 with a time limit of 1 second

Simulation: Mistral-7B-it with a time limit of 1 second

Question Generation Methods

1) Question Generation based on Fan Quizzes

Interpolate start reference image.

The overall process of question generation based on fan quizzes. First, we crawled fan quizzes from the web (1). Then, we applied filtering and revision processes to the crawled data (2-a, b). From this, we created secondary versions of the questions by adding dates to each (3-a). Then, we mapped each question to the scenes by determining whether it is answerable in that scene or not (3-b). Finally, we applied character style transfer to make the questions more natural (4).

2) Question Generation Based on a Temporal Knowledge Graph

Interpolate start reference image.

The overall process of question generation based on the temporal knowledge graph. We first extracted quadruples and constructed a temporal knowledge graph (1). Then, we generated questions based on this and mapped each question to the sessions by determining whether it was answerable in that session or not, similar to fan quiz-based questions (2-1, 2-2). Character style transfer was performed afterwards (3).

Experimental Results

Interpolate start reference image.

Findings

1. Overall, the performance of agents remained below 40%, indicating that current LLMs are ineffective as conversational agents in complex, multi-party, long-term dialogues.

2. API-based models generally outperformed open-source models under time constraints, demonstrating superior inference capabilities and faster response times.

3. There was no significant performance difference among different model sizes under time constraints; sometimes smaller models outperformed larger ones. Without time limits, larger models performed better, with the larger open-source models matching the inference capabilities of API models.

4. Storing the entire session yielded better results compared to other methods due to more context retention.

Citation

@article{kim2024dialsim,
  author    = {Kim, Jiho and Chay, Woosog and Hwang, Hyeonji and Kyung, Daeun and Chung, Hyunseung and Cho, Eunbyeol and Jo, Yohan and Choi, Edward},
  title     = {DialSim: A Real-Time Simulator for Evaluating Long-Term Dialogue Understanding of Conversational Agents},
  journal   = {arXiv preprint arXiv:2406.13144},
  year      = {2024},
}