[Paper Review] Confidence Improves Self-Consistency in LLMs
๐Ÿ’ป

[Paper Review] Confidence Improves Self-Consistency in LLMs

Tags
LLM Reasoning
Test-time Scaling
Efficient Inference
AI
Published
March 9, 2026
Author

๋ฌธ์ œ

Self-Consistency ๋””์ฝ”๋”ฉ์€ LLM ์ถ”๋ก  ์„ฑ๋Šฅ์„ ๋†’์—ฌ์ฃผ์ง€๋งŒ, ์ •๋‹ต์ด ์ตœ๋นˆ๊ฐ’์œผ๋กœ ์ˆ˜๋ ดํ•˜๋ ค๋ฉด ๋งŽ์€ ์ˆ˜์˜ ์ถ”๋ก  ๊ฒฝ๋กœ๋ฅผ ์ƒ์„ฑํ•ด์•ผ ํ•˜๋ฏ€๋กœ ๊ณ„์‚ฐ ๋น„์šฉ์ด ํฝ๋‹ˆ๋‹ค.

๋ฐฉ์•ˆ

๊ฐ ์ถ”๋ก  ๊ฒฝ๋กœ์— ๋ชจ๋ธ์ด ์Šค์Šค๋กœ ์‚ฐ์ถœํ•œ ์‹ ๋ขฐ๋„ ์ ์ˆ˜๋ฅผ ๋ถ€์—ฌํ•˜๊ณ , ๋‹จ์ˆœ ๋นˆ๋„ ํˆฌํ‘œ ๋Œ€์‹  ์‹ ๋ขฐ๋„ ๊ฐ€์ค‘ ํˆฌํ‘œ๋กœ ์ตœ์ข… ๋‹ต์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.

์ฃผ์š” ๊ธฐ์—ฌ

  1. CISC (Confidence-Informed Self-Consistency) ์ œ์•ˆ โ€” Self-Consistency์— drop-in ๊ต์ฒด ๊ฐ€๋Šฅํ•œ ๊ฒฝ๋Ÿ‰ ํ™•์žฅ์œผ๋กœ, ๊ฑฐ์˜ ๋ชจ๋“  ๋ชจ๋ธ-๋ฐ์ดํ„ฐ์…‹ ์กฐํ•ฉ์—์„œ ์„ฑ๋Šฅ ํ–ฅ์ƒ ๋˜๋Š” ๋น„์šฉ ์ ˆ๊ฐ ๋‹ฌ์„ฑ
  1. Within-Question Discrimination (WQD) ๋ฉ”ํŠธ๋ฆญ ๋„์ž… โ€” ๊ธฐ์กด Calibration ๋ฉ”ํŠธ๋ฆญ(ECE, Brier Score)์ด CISC ์„ฑ๋Šฅ์„ ์˜ˆ์ธกํ•˜์ง€ ๋ชปํ•˜๋Š” ์ด์œ ๋ฅผ ๋ฐํžˆ๊ณ , ๋™์ผ ์งˆ๋ฌธ ๋‚ด์—์„œ ์ •๋‹ต/์˜ค๋‹ต ๊ตฌ๋ถ„ ๋Šฅ๋ ฅ์„ ์ธก์ •ํ•˜๋Š” ์ƒˆ ํ‰๊ฐ€ ์ง€ํ‘œ ์ œ์‹œ
  1. LLM ์ž๊ธฐ ํ‰๊ฐ€ ๋Šฅ๋ ฅ์— ๋Œ€ํ•œ ์‹ค์ฆ์  ์ฆ๊ฑฐ โ€” ๋ชจ๋ธ์ด ๋‚ฎ์€ ์‹ ๋ขฐ๋„๋ฅผ ๋ถ€์—ฌํ•œ ์‘๋‹ต์€ ์ธ๊ฐ„ ํ‰๊ฐ€์ž๋„ ์ €ํ’ˆ์งˆ๋กœ ํŒ๋‹จํ•˜๋Š” ๊ฐ•ํ•œ ์ƒ๊ด€๊ด€๊ณ„ ํ™•์ธ

๋ฐฐ๊ฒฝ ๋ฐ ๋™๊ธฐ

๊ธฐ์กด ๋ฐฉ๋ฒ•์˜ ์ ‘๊ทผ

Self-Consistency (Wang et al., 2022) ๋Š” Chain-of-Thought ์ถ”๋ก ์˜ ๋Œ€ํ‘œ์  ๋””์ฝ”๋”ฉ ์ „๋žต์ž…๋‹ˆ๋‹ค.
์งˆ๋ฌธ q โ†’ [์ถ”๋ก  ๊ฒฝ๋กœ 1 โ†’ ๋‹ต A] [์ถ”๋ก  ๊ฒฝ๋กœ 2 โ†’ ๋‹ต B] [์ถ”๋ก  ๊ฒฝ๋กœ 3 โ†’ ๋‹ต A] ... [์ถ”๋ก  ๊ฒฝ๋กœ m โ†’ ๋‹ต A] ์ตœ์ข… ๋‹ต = argmax(๋นˆ๋„) = A (๋‹จ์ˆœ ๋‹ค์ˆ˜๊ฒฐ)
์ด ๋ฐฉ์‹์€ Greedy ๋””์ฝ”๋”ฉ ๋Œ€๋น„ ์ˆ˜ํ•™/์ƒ์‹ ์ถ”๋ก  ํƒœ์Šคํฌ์—์„œ ์ผ๊ด€๋œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๋ฌธ์ œ์  ๋ฐ ๋ณ‘๋ชฉ

1. ๋†’์€ ์ƒ˜ํ”Œ๋ง ๋น„์šฉ
์ •๋‹ต ๋น„์œจ์ด 60%์ธ ๋ชจ๋ธ์ด 90% ์ •ํ™•๋„์— ๋„๋‹ฌํ•˜๋ ค๋ฉด,
  • ๋‹จ์ˆœ ๋‹ค์ˆ˜๊ฒฐ: 40๊ฐœ ์ƒ˜ํ”Œ ํ•„์š” (์ดํ•ญ ๋ถ„ํฌ ๊ธฐ๋ฐ˜ ๊ณ„์‚ฐ)
  • ์ •๋‹ต์— 2๋ฐฐ ๊ฐ€์ค‘์น˜๋ฅผ ์ค„ ์ˆ˜ ์žˆ๋‹ค๋ฉด: 10๊ฐœ ๋ฏธ๋งŒ์œผ๋กœ ์ถฉ๋ถ„
์ƒ˜ํ”Œ ์ˆ˜๊ฐ€ ๋Š˜์–ด๋‚ ์ˆ˜๋ก ์ถ”๋ก  ๊ฒฝ๋กœ ์ƒ์„ฑ ๋น„์šฉ์€ ์„ ํ˜•์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๋ฏ€๋กœ, ์‹ค์‚ฌ์šฉ์—์„œ ์‹ฌ๊ฐํ•œ ๋ณ‘๋ชฉ์ด ๋ฉ๋‹ˆ๋‹ค.
2. ๋ชจ๋“  ์ถ”๋ก  ๊ฒฝ๋กœ๋ฅผ ๋™๋“ฑํ•˜๊ฒŒ ์ทจ๊ธ‰
Self-Consistency๋Š” ๊ฐ ๊ฒฝ๋กœ์˜ ํ’ˆ์งˆ(Quaility)์„ ๋ฌด์‹œํ•ฉ๋‹ˆ๋‹ค. ๋…ผ๋ฆฌ์  ๋น„์•ฝ์ด ์žˆ๋Š” ๊ฒฝ๋กœ๋“ , ํƒ„ํƒ„ํ•œ ๊ฒฝ๋กœ๋“  ๋™์ผํ•œ 1ํ‘œ๋ฅผ ํ–‰์‚ฌํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ์ด ์ž์ฒด์ ์œผ๋กœ ๊ฒฝ๋กœ์˜ ํ’ˆ์งˆ์„ ํŒ๋‹จํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด, ์ด ์ •๋ณด๋ฅผ ํ™œ์šฉํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์€ ๋‚ญ๋น„์ž…๋‹ˆ๋‹ค.
3. ๊ธฐ์กด ํšจ์œจํ™” ์‹œ๋„์˜ ํ•œ๊ณ„
Self-Consistency์˜ ๋น„์šฉ์„ ์ค„์ด๋ ค๋Š” ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์€ ๋Œ€๋ถ€๋ถ„ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ œ์•ฝ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ฒ˜๋ฆฌ๋Ÿ‰(throughput)์€ ์ค„์ด์ง€๋งŒ ์ง€์—ฐ ์‹œ๊ฐ„(latency)์ด ์ฆ๊ฐ€ํ•˜๊ฑฐ๋‚˜
  • ๋ฐ์ดํ„ฐ์…‹๋ณ„ ์ˆ˜๋™ ํŠœ๋‹์ด ํ•„์š”ํ•˜๊ฑฐ๋‚˜
  • ํ‘œ์ค€ ๋ฒค์น˜๋งˆํฌ์—์„œ ์ผ๋ฐ˜ํ™”๋˜์ง€ ์•Š๊ฑฐ๋‚˜
  • ์‹ค์ œ Self-Consistency๋ณด๋‹ค ์˜คํžˆ๋ ค ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋นˆ๋ฒˆ

๊ทผ๋ณธ ์ฐฉ์•ˆ์ 

LLM์ด ์ž๊ธฐ ์ถœ๋ ฅ์˜ ์ •ํ™•์„ฑ์„ ํŒ๋‹จํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์—ฐ๊ตฌ(Kadavath et al., 2022; Zhang et al., 2024)์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ, ๊ฐ ์ถ”๋ก  ๊ฒฝ๋กœ์— ์‹ ๋ขฐ๋„ ์ ์ˆ˜๋ฅผ ๋งค๊ธฐ๋ฉด ์†Œ์ˆ˜ ์ƒ˜ํ”Œ๋งŒ์œผ๋กœ๋„ ์ •๋‹ต์„ ์‹๋ณ„ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฐ€์„ค์„ ์„ธ์›๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๋ฐฉ๋ฒ•

์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ

notion image

Definition 3.1: CISC์˜ ๊ณต์‹ ์ •์˜

์งˆ๋ฌธ ์™€ ์‘๋‹ต ์ง‘ํ•ฉ ์ด ์ฃผ์–ด์กŒ์„ ๋•Œ, CISC๋Š” ์„ธ ๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค:
1๋‹จ๊ณ„ โ€” ์‹ ๋ขฐ๋„ ์ถ”์ถœ
๊ฐ ์‘๋‹ต ์— ๋Œ€ํ•ด ์ž๊ธฐ ํ‰๊ฐ€ ์‹ ๋ขฐ๋„ ์ ์ˆ˜ ์„ ๋„์ถœํ•ฉ๋‹ˆ๋‹ค.
2๋‹จ๊ณ„ โ€” ์‹ ๋ขฐ๋„ ์ •๊ทœํ™”
์—ฌ๊ธฐ์„œ ๋Š” ์กฐ์ ˆ ๊ฐ€๋Šฅํ•œ ์˜จ๋„(temperature) ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์ž…๋‹ˆ๋‹ค.
  • : ์ •๊ทœํ™”๋œ ์‹ ๋ขฐ๋„๊ฐ€ ๊ท ์ผ ๋ถ„ํฌ์— ์ˆ˜๋ ด โ†’ ๊ธฐ์กด Self-Consistency์™€ ๋™์ผ
  • : Softmax๊ฐ€ argmax์— ์ˆ˜๋ ด โ†’ ๊ฐ€์žฅ ๋†’์€ ์‹ ๋ขฐ๋„์˜ ๋‹จ์ผ ์‘๋‹ต๋งŒ ์„ ํƒ
  • ์ ์ ˆํ•œ : ๋นˆ๋„ ์ •๋ณด์™€ ์‹ ๋ขฐ๋„ ์ •๋ณด๋ฅผ ๊ท ํ˜• ์žˆ๊ฒŒ ๊ฒฐํ•ฉ
3๋‹จ๊ณ„ โ€” ๊ฐ€์ค‘ ๋‹ค์ˆ˜๊ฒฐ
๊ฐ™์€ ๋‹ต์„ ๋‚ธ ๊ฒฝ๋กœ๋“ค์˜ ์ •๊ทœํ™” ์‹ ๋ขฐ๋„๋ฅผ ํ•ฉ์‚ฐํ•˜์—ฌ ๊ฐ€์žฅ ๋†’์€ ์ดํ•ฉ์„ ๊ฐ€์ง„ ๋‹ต์„ ์ตœ์ข… ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.

์˜จ๋„ T์˜ ์—ญํ• 

๋Š” "๋นˆ๋„ vs ์‹ ๋ขฐ๋„" ์‚ฌ์ด์˜ ๊ท ํ˜•์„ ์กฐ์ ˆํ•˜๋Š” ํ•ด๋‹น ์•„ํ‚คํ…์ฒ˜์—์„œ ์œ ์ผํ•˜๊ฒŒ ์กฐ์ ˆ ๊ฐ€๋Šฅํ•œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์ž…๋‹ˆ๋‹ค.
T๊ฐ€ ๋งค์šฐ ํฐ ๊ฒฝ์šฐ (T โ†’ โˆž): cฬƒแตข โ‰ˆ 1/m (๋ชจ๋“  ๊ฒฝ๋กœ์— ๋™์ผ ๊ฐ€์ค‘์น˜) โ†’ Self-Consistency์™€ ๋™์ผํ•˜๊ฒŒ ๋™์ž‘ T๊ฐ€ ๋งค์šฐ ์ž‘์€ ๊ฒฝ์šฐ (T โ†’ 0): ๊ฐ€์žฅ ๋†’์€ cแตข๋ฅผ ๊ฐ€์ง„ ๊ฒฝ๋กœ ํ•˜๋‚˜๋งŒ ์„ ํƒ โ†’ ๋นˆ๋„ ์ •๋ณด ์™„์ „ํžˆ ๋ฌด์‹œ, ์œ„ํ—˜ํ•  ์ˆ˜ ์žˆ์Œ ์‹ค์ œ ์ตœ์  T: 10% hold-out set์—์„œ grid search๋กœ ๊ฒฐ์ • ๋ชจ๋“  ๋ฐ์ดํ„ฐ์…‹์— ๊ฑธ์ณ ๋‹จ์ผ T๊ฐ’ ์‚ฌ์šฉ (๋ฐ์ดํ„ฐ์…‹ ๋น„์˜์กด์ )

Confidence ์ถ”์ถœ ๋ฐฉ๋ฒ•

1. Response Probability (Wang et al., 2022)

๋ชจ๋ธ์ด ์ „์ฒด ์‘๋‹ต ์„ ์ƒ์„ฑํ•  ๊ธธ์ด ์ •๊ทœํ™” ํ™•๋ฅ :
  • ๋ณ„๋„์˜ ํ”„๋กฌํ”„ํŒ… ๋ถˆํ•„์š” (์ƒ์„ฑ ์‹œ ์ž๋™ ๊ณ„์‚ฐ ๊ฐ€๋Šฅ)
  • ์ถ”๋ก  ๊ฒฝ๋กœ์˜ ๊ธธ์ด์— ์˜ํ–ฅ์„ ๋ฐ›์„ ์ˆ˜ ์žˆ์Œ

2. Verbal Binary (Lin et al., 2022)

์ƒ์„ฑ ์™„๋ฃŒ ํ›„ ๋ชจ๋ธ์—๊ฒŒ ์‹ ๋ขฐ๋„๋ฅผ 0 ๋˜๋Š” 1๋กœ ํ‰๊ฐ€ํ•˜๋„๋ก ์š”์ฒญํ•ฉ๋‹ˆ๋‹ค:
ํ”„๋กฌํ”„ํŠธ: "Now I will rate my confidence in the proposed answer as either 0 or 1. Proposed confidence: ("
  • ๊ฐ€์žฅ ๋‹จ์ˆœํ•œ ๋ฐฉ์‹, ์ด์ง„ ์Šค์ผ€์ผ์ด๋ฏ€๋กœ ์„ธ๋ฐ€ํ•œ ๊ตฌ๋ถ„์ด ์–ด๋ ค์›€

3. Verbal 0-100 (Lin et al., 2022)

0-100 ์Šค์ผ€์ผ๋กœ ์‹ ๋ขฐ๋„๋ฅผ ํ‘œํ˜„ํ•˜๋„๋ก ์š”์ฒญํ•ฉ๋‹ˆ๋‹ค:
ํ”„๋กฌํ”„ํŠธ: "Now I will rate my confidence in the proposed answer on a scale of 0-100. Proposed confidence: ("
  • Verbal Binary๋ณด๋‹ค ์„ธ๋ฐ€ํ•œ ๊ตฌ๋ถ„ ๊ฐ€๋Šฅ
  • ๋ชจ๋ธ์˜ ์ถœ๋ ฅ์ด ์‹ค์ œ ํ™•๋ฅ ๊ณผ ์ž˜ ๊ต์ •(calibrate)๋˜์–ด ์žˆ์Œ

4. P(True) (Kadavath et al., 2022)

Verbal Binary ํ”„๋กฌํ”„ํŠธ๋ฅผ ์‚ฌ์šฉํ•˜๋˜, ๋ชจ๋ธ์ด "1" ํ† ํฐ์— ๋ถ€์—ฌํ•˜๋Š” ํ™•๋ฅ ๊ฐ’ ์ž์ฒด๋ฅผ ์‹ ๋ขฐ๋„๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:
ํ”„๋กฌํ”„ํŠธ: Verbal Binary์™€ ๋™์ผ ์‹ ๋ขฐ๋„ = p_ฮธ("1" | q, r, a, e) (e = confidence extraction prompt)
  • ์ถœ๋ ฅ ํ† ํฐ์ด ์•„๋‹ˆ๋ผ ํ† ํฐ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ํ™œ์šฉ
  • ๊ฐ€์žฅ ์„ธ๋ฐ€ํ•œ ์—ฐ์†๊ฐ’ ์‹ ๋ขฐ๋„๋ฅผ ์ œ๊ณต
  • ์‹คํ—˜ ๊ฒฐ๊ณผ ๊ฐ€์žฅ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž„

ํšจ์œจ์  ํ”„๋กฌํ”„ํŒ…: Two-Step ๋ฐฉ์‹

notion image
CISC์˜ ๊ตฌํ˜„์—์„œ ํ•ต์‹ฌ์ ์ธ ์„ค๊ณ„ ๊ฒฐ์ •์€ two-step prompting์ž…๋‹ˆ๋‹ค:
Step 1: ์งˆ๋ฌธ ํ”„๋กฌํ”„ํŠธ q๋กœ ์ถ”๋ก  ๊ฒฝ๋กœ (r, a) ์ƒ์„ฑ Step 2: (q, r, a) ๋’ค์— ์‹ ๋ขฐ๋„ ์ถ”์ถœ ํ”„๋กฌํ”„ํŠธ e๋ฅผ ์ด์–ด๋ถ™์—ฌ ๊ณ„์† ์ƒ์„ฑ ํ•ต์‹ฌ: prefix (q, r, a)๊ฐ€ Step 1๊ณผ ๋™์ผํ•˜๋ฏ€๋กœ โ†’ KV ์บ์‹œ ์žฌํ™œ์šฉ ๊ฐ€๋Šฅ โ†’ ์ถ”๊ฐ€ ๋น„์šฉ = e ์ธ์ฝ”๋”ฉ (~20 ํ† ํฐ) + 1 ํ† ํฐ ์ƒ์„ฑ
์‹ ๋ขฐ๋„ ์ถ”์ถœ ํ”„๋กฌํ”„ํŠธ๋Š” ์•ฝ 20 ํ† ํฐ์— ๋ถˆ๊ณผํ•˜๊ณ , ์‹ ๋ขฐ๋„ ์ž์ฒด๋Š” ๋‹จ์ผ ํ† ํฐ ์ƒ์„ฑ์ด๋ฏ€๋กœ, ์ „์ฒด ์ถ”๋ก  ๊ฒฝ๋กœ ๋Œ€๋น„ ๋ฌด์‹œํ•  ์ˆ˜ ์žˆ๋Š” ์ˆ˜์ค€์˜ ์ถ”๊ฐ€ ๋น„์šฉ์ž…๋‹ˆ๋‹ค.

์ฃผ์š” ๊ฒฐ๊ณผ

1. CISC๋Š” ๊ฑฐ์˜ ๋ชจ๋“  ์„ค์ •์—์„œ Self-Consistency๋ฅผ ์ด๊น€

notion image
P(True) ๋ฐฉ๋ฒ•์ด ์••๋„์ ์œผ๋กœ ์šฐ์ˆ˜ํ•ฉ๋‹ˆ๋‹ค. Budget 10 ๊ธฐ์ค€์œผ๋กœ, Self-Consistency๊ฐ€ CISC์™€ ๋™์ผํ•œ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ•˜๋ ค๋ฉด ํ‰๊ท  18.6๊ฐœ ์ƒ˜ํ”Œ์ด ํ•„์š”ํ•˜๋ฉฐ, ์ด๋Š” 46%์˜ ๊ณ„์‚ฐ ๋น„์šฉ ์ ˆ๊ฐ์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค.
notion image

2. ๋ชจ๋ธ-๋ฐ์ดํ„ฐ์…‹๋ณ„ ์ƒ์„ธ ๊ฒฐ๊ณผ

notion image
P(True) + Budget 10 ๊ธฐ์ค€์œผ๋กœ,
  • ๊ฑฐ์˜ ๋ชจ๋“  ๋ชจ๋ธ-๋ฐ์ดํ„ฐ์…‹ ์กฐํ•ฉ์—์„œ ์–‘(+)์˜ Cost Reduction
  • ์ผ๋ถ€ ์กฐํ•ฉ์—์„œ๋Š” 30๊ฐœ ์ƒ˜ํ”Œ์˜ SC๋กœ๋„ 10๊ฐœ CISC์— ๋ฏธ์น˜์ง€ ๋ชปํ•˜์—ฌ 67%+ Cost Reduction ํ‘œ๊ธฐ
  • Qwen 72B + BBH์—์„œ๋งŒ ์œ ์ผํ•˜๊ฒŒ -25%๋กœ CISC๊ฐ€ ๋ถˆ๋ฆฌํ•œ ๊ฒฐ๊ณผ (์ด ๋ชจ๋ธ์ด ํ•ด๋‹น ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ด๋ฏธ ๋งค์šฐ ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋ณด์ด๊ธฐ ๋•Œ๋ฌธ์œผ๋กœ ๋ถ„์„๋จ)

3. ์ •๊ทœํ™”์˜ ํšจ๊ณผ

์„ค์ •
Cost Reduction @ 10
P(True) - ์ •๊ทœํ™” ์—†์Œ
32%
P(True) - Softmax T=1
30%
P(True) - Softmax T=Tuned
46%
์ •๊ทœํ™” ์—†์ด๋„ ์ƒ๋‹นํ•œ ํšจ๊ณผ๊ฐ€ ์žˆ์ง€๋งŒ, ์˜จ๋„ ์กฐ์ ˆ๋œ Softmax ์ •๊ทœํ™”๊ฐ€ ๋ชจ๋“  ์‹ ๋ขฐ๋„ ๋ฐฉ๋ฒ•์—์„œ ์ผ๊ด€๋˜๊ฒŒ ์„ฑ๋Šฅ์„ ๋Œ์–ด์˜ฌ๋ฆฝ๋‹ˆ๋‹ค. ๋‹จ, Softmax T=1 (์˜จ๋„๋ฅผ ์กฐ์ ˆํ•˜์ง€ ์•Š๋Š” ์ผ€์ด์Šค)์€ ์˜คํžˆ๋ ค ์ •๊ทœํ™” ์—†๋Š” ๊ฒƒ๋ณด๋‹ค ๋‚˜๋น ์งˆ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ, ๋ฐ˜๋“œ์‹œ ํŠœ๋‹์ด ์ˆ˜๋ฐ˜๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

4. Ablation

๋ฐฉ๋ฒ•
Cost Reduction (Budget 5)
Cost Reduction (Budget 10)
Max (์ตœ๊ณ  ์‹ ๋ขฐ๋„ ๋‹ต๋งŒ ์„ ํƒ)
-11%
-84%
Tie (๋™์  ์‹œ์—๋งŒ CISC ์ ์šฉ)
27%
28%
CISC (๊ฐ€์ค‘ ๋‹ค์ˆ˜๊ฒฐ)
41%
46%
  • Max ์ „๋žต์€ ์˜คํžˆ๋ ค ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ์•…ํ™”์‹œํ‚ต๋‹ˆ๋‹ค โ€” ๋นˆ๋„ ์ •๋ณด๋ฅผ ์™„์ „ํžˆ ๋ฌด์‹œํ•˜๋ฉด ์•ˆ ๋œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
  • Tie ์ „๋žต๋„ ์˜๋ฏธ ์žˆ๋Š” ๊ฐœ์„ ์„ ๋ณด์ด์ง€๋งŒ, ์ „์ฒด ๊ฐ€์ค‘ ๋‹ค์ˆ˜๊ฒฐ์—๋Š” ๋ฏธ์น˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค.
  • ๋นˆ๋„์™€ ์‹ ๋ขฐ๋„์˜ ๊ท ํ˜• ์žˆ๋Š” ๊ฒฐํ•ฉ์ด ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค.

Within-Question Discrimination (WQD)

๋ฌธ์ œ ์ œ๊ธฐ: ์™œ ๊ฐ€์žฅ ์ž˜ ๊ต์ •๋œ ์‹ ๋ขฐ๋„๊ฐ€ CISC์—๋Š” ๊ฐ€์žฅ ๋น„ํšจ๊ณผ์ ์ธ๊ฐ€?

์‹ ๋ขฐ๋„ ๋ฐฉ๋ฒ•
ECE-t (โ†“ ์ข‹์Œ)
Brier-t (โ†“ ์ข‹์Œ)
CISC Cost Reduction
Verbal Binary
0.005
0.187
10%
Verbal 0-100
0.046
0.173
30%
Response Prob.
0.090
0.192
31%
P(True)
0.030
0.182
46%
Verbal Binary๊ฐ€ ECE/Brier Score์—์„œ ๊ฐ€์žฅ ์šฐ์ˆ˜ํ•œ๋ฐ, CISC์—์„œ๋Š” ๊ฐ€์žฅ ๋‚˜์œ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค. ๊ธฐ์กด Calibration ๋ฉ”ํŠธ๋ฆญ์€ ์งˆ๋ฌธ ๊ฐ„(between-question) ๋ณ€๋ณ„๋ ฅ์„ ์ธก์ •ํ•˜๋Š”๋ฐ, CISC์— ํ•„์š”ํ•œ ๊ฒƒ์€ ๋™์ผ ์งˆ๋ฌธ ๋‚ด(within-question) ์ •๋‹ต๊ณผ ์˜ค๋‹ต์˜ ๊ตฌ๋ถ„ ๋Šฅ๋ ฅ์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

์‚ฌ๊ณ  ์‹คํ—˜

๋ชจ๋ธ M์ด ๋‘ ์œ ํ˜•์˜ ์งˆ๋ฌธ์— ์ง๋ฉด, - "์‰ฌ์šด" ์งˆ๋ฌธ: 95% ํ™•๋ฅ ๋กœ ์ •๋‹ต - "์–ด๋ ค์šด" ์งˆ๋ฌธ: 5% ํ™•๋ฅ ๋กœ ์ •๋‹ต ์‹ ๋ขฐ๋„ ๋ฐฉ๋ฒ• A: ์‰ฌ์šด ์งˆ๋ฌธ์˜ ๋ชจ๋“  ๋‹ต โ†’ ์‹ ๋ขฐ๋„ 0.95 ์–ด๋ ค์šด ์งˆ๋ฌธ์˜ ๋ชจ๋“  ๋‹ต โ†’ ์‹ ๋ขฐ๋„ 0.05 Calibration: ์™„๋ฒฝ! (ECE = 0) CISC ์œ ์šฉ์„ฑ: ์ œ๋กœ. ๊ฐ™์€ ์งˆ๋ฌธ์˜ ์ •๋‹ต/์˜ค๋‹ต์— ๋™์ผํ•œ ์ ์ˆ˜๋ฅผ ๋ถ€์—ฌํ•˜๋ฏ€๋กœ ๊ฐ€์ค‘์น˜๊ฐ€ ์˜๋ฏธ ์—†์Œ.

์ •์„ฑ์  ๋ถ„์„ (Qualitative Analysis)

์ธ๊ฐ„ ํ‰๊ฐ€ ์„ค๊ณ„

  • MMLU-Pro ๋ฐ์ดํ„ฐ์…‹์—์„œ 3๊ฐœ ๋ชจ๋ธ(Qwen2.5 3B, Gemma2 9B, Mistral Large 123B) ์‚ฌ์šฉ
  • ๊ฐ ์งˆ๋ฌธ๋‹น 3๊ฐœ ์‘๋‹ต ์ƒ˜ํ”Œ๋ง
  • NLP ๋ฐ•์‚ฌ๊ณผ์ • 2๋ช…์ด 90๊ฐœ ์‘๋‹ต์„ ํ‰๊ฐ€ (๋ชจ๋ธ ์‹ ๋ขฐ๋„, ์ •๋‹ต ์—ฌ๋ถ€๋ฅผ ๋ชจ๋ฅด๋Š” ๋ธ”๋ผ์ธ๋“œ ์ƒํƒœ)
  • ํ‰๊ฐ€ ๊ธฐ์ค€: "์ €ํ’ˆ์งˆ ์ถ”๋ก  ํŒจํ„ด" ์‹๋ณ„

๊ฒฐ๊ณผ

๋ชจ๋ธ์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์€ ์‹ ๋ขฐ๋„๋ฅผ ๋ถ€์—ฌํ•œ ์‘๋‹ต: โ†’ 67%์—์„œ ์ธ๊ฐ„ ํ‰๊ฐ€์ž๋„ ์ €ํ’ˆ์งˆ ์ง€ํ‘œ ๋ฐœ๊ฒฌ ๋ชจ๋ธ์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋†’์€ ์‹ ๋ขฐ๋„๋ฅผ ๋ถ€์—ฌํ•œ ์‘๋‹ต: โ†’ 33%์—์„œ๋งŒ ์ €ํ’ˆ์งˆ ์ง€ํ‘œ ๋ฐœ๊ฒฌ (2๋ฐฐ์˜ ์ฐจ์ด โ€” ๊ฐ•ํ•œ ์ƒ๊ด€๊ด€๊ณ„)

์ €ํ’ˆ์งˆ ์ถ”๋ก ์˜ 3๊ฐ€์ง€ ์œ ํ˜•

์œ ํ˜•
์„ค๋ช…
์ €์‹ ๋ขฐ๋„ ๋น„์œจ
๊ณ ์‹ ๋ขฐ๋„ ๋น„์œจ
No Choice
์„ ํƒ์ง€์— ์—†๋Š” ๋‹ต์— ๋„์ถœ
38%
13%
Incomplete Calculations
๊ณ„์‚ฐ์„ ๋๊นŒ์ง€ ์ˆ˜ํ–‰ํ•˜์ง€ ์•Š์Œ
22%
2%
Multiple Candidates
์—ฌ๋Ÿฌ ๋‹ต ํ›„๋ณด๋ฅผ ํƒ์ƒ‰๋งŒ ํ•˜๊ณ  ํ™•์ •ํ•˜์ง€ ๋ชปํ•จ
11%
16%
"No Choice"์™€ "Incomplete Calculations"๋Š” ๋ชจ๋ธ์˜ ์ €์‹ ๋ขฐ๋„์™€ ๊ฐ•ํ•œ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด "Multiple Candidates"๋Š” ์‹ ๋ขฐ๋„์™€ ๋ช…ํ™•ํ•œ ์ƒ๊ด€์ด ์—†๋Š”๋ฐ, ์ด๋Š” ๋ชจ๋ธ์ด ์—ฌ๋Ÿฌ ๊ฒฝ๋กœ๋ฅผ ํƒ์ƒ‰ํ•˜๋Š” ํ–‰์œ„ ์ž์ฒด๋Š” ํ’ˆ์งˆ ์ €ํ•˜ ์‹ ํ˜ธ๊ฐ€ ์•„๋‹ ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

ํ•œ๊ณ„ ๋ฐ ํ–ฅํ›„ ๋ฐฉํ–ฅ

ํ•œ๊ณ„

  1. ํ† ํฐ ํ™•๋ฅ  ์ ‘๊ทผ ํ•„์š”: ์ตœ์ ์˜ P(True) ๋ฐฉ๋ฒ•์€ ๋ชจ๋ธ์˜ ์ถœ๋ ฅ ํ™•๋ฅ ์— ์ ‘๊ทผํ•ด์•ผ ํ•˜๋ฉฐ, ๋ชจ๋“  ๋ฐฐํฌ ํ™˜๊ฒฝ์—์„œ ๊ฐ€๋Šฅํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
  1. ์ž์œ  ํ˜•์‹ ํƒœ์Šคํฌ ๋ฏธ์ง€์›: Self-Consistency์˜ ๊ทผ๋ณธ์  ์ œ์•ฝ์„ ๊ณต์œ ํ•˜๋ฉฐ, ์ด์‚ฐ์  ๋‹ต์ด ์ •์˜๋˜์ง€ ์•Š๋Š” ํƒœ์Šคํฌ(์š”์•ฝ, ๋Œ€ํ™” ๋“ฑ)์—๋Š” ์ง์ ‘ ์ ์šฉ์ด ์–ด๋ ต์Šต๋‹ˆ๋‹ค.
  1. ์ธ๊ฐ„ ํ‰๊ฐ€์˜ ์ œํ•œ๋œ ๋ฒ”์œ„: ์ •์„ฑ ๋ถ„์„์ด MMLU-Pro ๋ฐ์ดํ„ฐ์…‹์—๋งŒ ์ˆ˜ํ–‰๋˜์—ˆ์œผ๋ฉฐ, ๋‹ค๋ฅธ ๋„๋ฉ”์ธ์œผ๋กœ์˜ ์ผ๋ฐ˜ํ™” ๊ฒ€์ฆ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
  1. Zero-shot๋งŒ ์‹คํ—˜: Few-shot ํ”„๋กฌํ”„ํŒ…์—์„œ์˜ ํšจ๊ณผ๋Š” ๊ฒ€์ฆ๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.
ย