TL;DR You get what you pay for...up to a point. Claude 3.7 and Gemini 2.5 were tops for performance/price. Yes, Deep Seek R1 is pretty good and very cheap. GPT o3 fabricated facts and ignored format instructions - not great for intelligence work. Although the best models are meeting or exceeding human performance for "operational" summaries and related tasks, they still aren't impressive (yet) at figuring out how to appropriately share complex information with diverse audiences.
The DU evaluation tests models’ ability to briefly summarize international crisis content while adhering to strict style/tradecraft guidelines. Claude Sonnet 3.6 output approached expectations for results from a human analyst. Many other LLMs produced material useful for at least triage/drafting assistance.
Weaknesses: The models struggled with following explicit instructions for how to reference sources naturally within the text of their outputs. They also had difficulty adhering to instructions about when to include/exclude content related to the United States.
Technical Note: Responses are limited to what is provided in the curated data set, which includes diverse subject matter in 5+ languages and was acquired after the models’ training knowledge cutoff date. The DU evaluation is a zero-shot and one-pass test (tasks are completed with no exemplars or prior context provided, and the first response is scored).
Outputs are scored blind against criteria including format, length, tone, relevance. Required citations are checked for accuracy. The DU data set also includes spiked content to evaluate successful avoidance of topics that are off-limits.
Outputs are scored blind against criteria including format, length, tone, relevance. Required citations are checked for accuracy. The DU data set also includes spiked content to evaluate successful avoidance of topics that are off-limits.
The CS evaluation tests models’ ability to summarize dense human-centered data and provide psychological/behavioral assessments while adhering to strict style guidelines. Gemini 2.5's output was outstanding - exceeding the work of an experienced human intelligence officer. Others like Claude(s), Deep Seek R1 and GPT o1 easily matched what an experienced officer would write - suggesting "saturation" for this type of task (it will no longer pose a challenge for frontier models).
Technical Note: Responses are limited to what is provided in the curated data set, which has been modified to mimic government formats. The CS evaluation is a zero-shot and one-pass test (tasks are completed with no exemplars or prior context provided, and the first response is scored).
Outputs are scored blind against criteria including format, relevant detail, and subjective concepts such as psychological and counterintelligence assessments. Required citations are checked for accuracy.
The FDR evaluation tests models’ ability to evaluate raw multi-source threat information and then to prepare outputs for a diverse range of audiences. This combines elements of summarization and reasoning while requiring adherence to strict source protection and style guidelines.
None of the LLMs produced outputs comparable to what would be expected from an experienced intelligence officer. Recent frontier models significantly outperformed the field, likely reflecting their greater reasoning capacities.
Weaknesses: Most models struggled with how to tailor their responses to distinct audiences and many exhibited recurring confusion/hallucination.
Technical Note: Responses are limited to what is provided in the curated data set, which is synthetic and presented in multiple formats that mimic government documents. The FDR evaluation is a zero-shot and one-pass test (tasks are completed with no exemplars or prior context provided, and the first response is scored).
Outputs are scored blind against criteria including relevant detail, obfuscation, and need-to-know. Required citations are checked for accuracy.
Acknowledgements:
New North Ventures (Jeremy Hitchcock)
eventregistry.org
Artificial Analysis