Technical Note: The DU evaluation tests models’ ability to summarize international crisis content while adhering to strict style/tradecraft guidelines. Responses are limited to what is provided in the curated data set, which incudes diverse subject matter in 5+ languages and was acquired after the models’ training knowledge cutoff date. The DU evaluation is a zero-shot and one-pass test (tasks are completed with no exemplars or prior context provided, and the first response is scored).
Outputs are scored blind against criteria including format, length, tone, relevance. Required citations are checked for accuracy. The DU data set also includes spike content to evaluate successful avoidance of topics that are off-limits.
Claude Sonnet v2 output approached expectations for results from a human analyst. Many other LLMs produced material useful for at least triage/drafting assistance.
Outputs are scored blind against criteria including format, length, tone, relevance. Required citations are checked for accuracy. The DU data set also includes spike content to evaluate successful avoidance of topics that are off-limits.
Claude Sonnet v2 output approached expectations for results from a human analyst. Many other LLMs produced material useful for at least triage/drafting assistance.
Technical Note: The CS evaluation tests models’ ability to summarize dense human-centered data while adhering to strict style/tradecraft guidelines. Responses are limited to what is provided in the curated data set, which has been modified to mimic government formats. The CS evaluation is a zero-shot and one-pass test (tasks are completed with no exemplars or prior context provided, and the first response is scored).
Outputs are scored blind against criteria including format, relevant detail, and subjective concepts such as psychological and counterintelligence assessments. Required citations are checked for accuracy.
Claude Sonnet v2 and GPT o1 outputs were commensurate with production from an experienced human intelligence officer. Several other LLMs produced material useful for at least triage/drafting.
Claude Sonnet v2 and GPT o1 outputs were commensurate with production from an experienced human intelligence officer. Several other LLMs produced material useful for at least triage/drafting.
Technical Note: The FDR evaluation tests models’ ability to evaluate diverse raw information regarding a threat and prepare outputs for diverse audiences. This combines elements of summarization and reasoning while adhering to strict style/tradecraft guidelines. Responses are limited to what is provided in the curated data set, which is synthetic and presented in multiple formats that mimic government documents. The FDR evaluation is a zero-shot and one-pass test (tasks are completed with no exemplars or prior context provided, and the first response is scored).
Outputs are scored blind against criteria including relevant detail, obfuscation, and need-to-know. Required citations are checked for accuracy.
None of the LLMs produced outputs comparable to what would be expected from an experienced human intelligence officer. GPT o1 and Gemini 2.0 significantly outperformed the field, likely reflecting their greater reasoning capacities. Most LLMs fared poorly, struggling with how to tailor their responses to distinct audiences and exhibiting recurring confusion/hallucination.
None of the LLMs produced outputs comparable to what would be expected from an experienced human intelligence officer. GPT o1 and Gemini 2.0 significantly outperformed the field, likely reflecting their greater reasoning capacities. Most LLMs fared poorly, struggling with how to tailor their responses to distinct audiences and exhibiting recurring confusion/hallucination.
Acknowledgements:
New North Ventures (Jeremy Hitchcock)
eventregistry.org