WOS 1 — LongMemEval-S Report

How it was measured측정 방법

Retrieval검색: Every LongMemEval-S question was retrieved with the WOS Memory Engine — semantic embeddings plus neural reranking, with no BM25 or keyword matching. Retrieval is the only component under test, and it is deterministic, so it contributes no run-to-run variation.모든 LongMemEval-S 질문은 WOS 메모리 엔진으로 검색했습니다 — 의미 임베딩과 신경망 리랭킹을 쓰며, BM25나 키워드 매칭은 사용하지 않습니다. 검색이 유일한 평가 대상이고 결정론적이므로 회차 간 편차를 만들지 않습니다.
Reader & judge리더 & 평가자: Claude Opus 4.8 (via the Claude API) was held constant across all five runs as the reader, answering from the retrieved memories only. Because retrieval is fixed, every bit of variance between runs comes from the reader. Each answer was then graded by GPT-4o at temperature 0, returning a plain yes / no under the official LongMemEval per-category rules.Claude Opus 4.8(Claude API)를 다섯 회차 내내 리더로 고정해, 검색된 기억만으로 답하게 했습니다. 검색이 고정돼 있으므로 회차 간 모든 편차는 리더에서 나옵니다. 각 답변은 GPT-4o가 temperature 0으로 채점해, 공식 LongMemEval 카테고리별 규칙에 따라 yes / no만 반환했습니다.
Protocol프로토콜: We ran the full LongMemEval-S set five independent times and report the average — every run published, no best-of selection and no cherry-picking.전체 LongMemEval-S 세트를 다섯 번 독립적으로 실행하고 평균을 보고합니다 — 모든 회차 공개, 베스트 선별이나 체리피킹 없음.

Every run, every category회차별, 카테고리별 전체

The raw scores behind the 85.2%.85.2% 뒤의 원본 점수.

Category카테고리	Run 11회	Run 22회	Run 33회	Run 44회	Run 55회	Average평균
Single-session user단일 세션 사용자	98.6	100.0	98.6	98.6	98.6	98.9
Single-session assistant단일 세션 어시스턴트	94.6	96.4	96.4	96.4	98.2	96.4
Knowledge update지식 업데이트	92.3	89.7	93.6	88.5	92.3	91.3
Preference inference선호 추론	90.0	86.7	83.3	80.0	90.0	86.0
Temporal reasoning시간 추론	82.0	78.9	81.2	78.9	82.7	80.8
Multi-session멀티 세션	75.9	72.2	72.2	73.7	75.2	73.8
Overall전체	86.2	84.2	85.0	83.8	86.6	85.2

Mean 85.2% · standard deviation 1.1% · all 5 runs published, 0 cherry-picks.평균 85.2% · 표준편차 1.1% · 5회 전부 공개, 체리피킹 0.

The exact prompts사용한 프롬프트 원문

Verbatim — nothing paraphrased.원문 그대로 — 일절 의역 없음.

Reader prompt — answer generation리더 프롬프트 — 답변 생성

readerused to generate every answer모든 답변 생성에 사용

Answer the question using ONLY the retrieved memories below (each is
prefixed with its [date]). This question is being asked on: {qdate}.
Apply whichever of these fits the question:
- For any 'how long ago' / 'how many days/weeks/months since' question,
  compute the duration relative to the asking date above (not any other
  today), using the memory dates.
- If the memories give CONFLICTING values for the same fact (different
  values as of different dates), mention BOTH and note which is more recent.
- If the question asks for ADVICE or a RECOMMENDATION, first identify this
  user's relevant preferences, interests, and past choices from the
  memories, then tailor your answer to them (not generic advice).
- Otherwise, answer the factual question concisely and directly.
If the answer is not in the memories, say you don't know. Answer in the
SAME LANGUAGE as the question.

Memories:
{mems}

Question: {q}

Answer:

{qdate} — the date the question is asked질문이 던져진 날짜 {mems} — retrieved memories, each prefixed with its date검색된 기억, 각 항목 앞에 날짜 표기 {q} — the question질문

Judge prompt — grading (temp 0)평가자 프롬프트 — 채점 (temp 0)

judgeindependent · temperature 0 · yes / no only독립 · temperature 0 · yes / no만

I will give you a question, the correct answer, and a model's response.
{RULE} Respond with ONLY 'yes' or 'no'.

Question: {q}
Correct answer: {gt}
Model response: {ans}

Is the model response correct?

{RULE} is the official LongMemEval rule applied per category:은 카테고리별로 적용되는 공식 LongMemEval 규칙입니다:

For temporal-reasoning, an answer is correct if it contains the correct answer or an equivalent, and off-by-one errors in the number of days, weeks, or months are not penalized. For knowledge-update, it is correct if it contains the correct updated answer — mentioning previous or outdated info is fine as long as the updated answer is present. For single-session-preference, the response need not cover every point in a rubric; it counts as long as it recalls and uses the user's personal information or preference correctly. For all other categories, it is correct if it contains the correct answer, or an equivalent that includes all the intermediate steps to reach it; if it gives only a subset of the required information, it is marked wrong.시간 추론(temporal-reasoning)에서는 정답 또는 그에 준하는 내용이 포함되면 정답으로 보며, 일·주·월 수의 ±1 오차는 감점하지 않습니다. 지식 업데이트(knowledge-update)에서는 갱신된 정답이 포함되면 정답이며 — 이전 정보나 오래된 정보를 언급해도 갱신된 답이 있으면 괜찮습니다. 단일 세션 선호(single-session-preference)에서는 루브릭의 모든 항목을 다룰 필요는 없고, 사용자의 개인 정보나 선호를 올바르게 회상해 활용하면 정답입니다. 그 외 모든 카테고리에서는 정답, 또는 그에 도달하는 모든 중간 단계를 포함한 동등한 답이 있으면 정답이며, 필요한 정보 중 일부만 제시하면 오답으로 처리합니다.

WOS 1 scores 85.2%.WOS 1, 85.2% 기록.

How it was measured측정 방법

Every run, every category회차별, 카테고리별 전체

The exact prompts사용한 프롬프트 원문

Reader prompt — answer generation리더 프롬프트 — 답변 생성

Judge prompt — grading (temp 0)평가자 프롬프트 — 채점 (temp 0)

Numbers you can check.검증 가능한 숫자.