Skip to content

Format testing results

Marcel Wiechmann edited this page Jun 7, 2022 · 2 revisions

This page is outdated!

The research notes in here concern the Sigurd model. Do not apply anything in here to newer models (e.g. Euterpe or Krake).

For Euterpe and Krake there is a guide by pume_ that describes current best practices (which is either write in full prose or use the Attribute Method).

Testing Reliability

First empirical results by Basileus suggest no significant difference between single paragraph or multi-line prose, JSON, single or multi-line caveman (n for each format = 100, ps > 0.19 (not corrected for multiple testing!))

(Note by TravellingRobot: I suspect that we may find once we are taking the type of entry into account that we might find differences. My theory is that the more abstract and the more relational you get the more you need the syntactic sugar of prose. For concrete things you might be better off just using something more condensed. Obv would need more testing)

(Methodology: default generation settings with maxed output length; scenario: Aliens-esque sci-fi action horror; moderately large context (~800 tokens with developed memory, A/N, preceding paragraphs, and a few other lorebook entries triggered as well, to make it as true to a "real world" use case as possible)
(Exact results: Single paragraph prose: 78% consistency, JSON formatted: 78% consistency, Multi-line prose: 74% consistency, Single paragraph caveman:74% consistency, Multi-line caveman: 70% consistency)

Overwrite natural bias (for certain hair colors, etc.)

Akaria with pink hair

  • 23/30 correct with pink (CAT and Default)
  • 30/30 correct with pink (CAT and 1.0 Rep Penalty) 28/30 if you don’t count dark pink.
  • 25/30 correct with pink (Caveman and Default) 24/30 if you don’t include dark pink.
  • 29/30 correct with pink (Caveman and 1.0 Rep Penalty) 28/30 if you don’t include dark pink.
  • 22/30 correct with pink (Prose and Default)
  • 30/30 correct with pink (Prose and 1.0 Rep Penalty) (should be clear, but these differences are all within margin of errors, ps > 0.34. So do not infer one format is more effective than the other from this. Bigger sample sizes would be interesting though...)
Clone this wiki locally