What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search

Xinhao Zhang, Xi Chen, François Portet, Maxime Peyrard

Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France
Findings of ACL 2026

Paper Code Dataset

Abstract

Recent work has demonstrated the promise of orchestrating large language models (LLMs) within evolutionary and agentic optimization systems. However, the mechanisms driving these optimization gains remain poorly understood. In this work, we present a large-scale study of LLM-guided evolutionary optimization, collecting optimization trajectories for 15 LLMs across 8 optimization problems. While base problem-solving ability, measured via zero-shot performance, correlates with final optimization outcomes, it explains only part of the variance: models with similar zero-shot capability often induce dramatically different search trajectories and final performance. To explain this gap, we analyze breakthrough dynamics and the geometry of optimization trajectories in the semantic space of candidate solutions. We find that effective LLM optimizers behave as strong local refiners, progressively localizing their search while producing frequent, incremental improvements across generations. In contrast, weaker optimizers exhibit large semantic drift, with occasional large breakthroughs followed by prolonged stagnation, reminiscent of behavior observed in classical metaheuristics. Notably, various measures of solution novelty do not predict final performance; novelty is beneficial only when the search remains sufficiently localized around high-performing regions of the solution space. Our results highlight the importance of trajectory-level evaluation for understanding and improving LLM-based agentic optimization systems, and provide actionable insights for future work on learning to search.

Curious to see how different LLMs navigate the solution space? Go straight to explore the trajectories interactively

Method Overview

LLM-guided evolutionary optimization framework

LLM-guided evolutionary optimization framework. We study LLM optimization across generations (left) and within individual generations (right). Our evaluation covers four task families: route optimization (TSP), equation discovery (symbolic regression), prompt optimization, and heuristic design (online bin packing).

Experimental Results

Fitness performance results across different LLMs and tasks

Fitness performance of LLMs across four task families. Results show averaged normalized fitness scores across route optimization, prompt optimization, equation discovery, and heuristic design tasks. Models are ranked by their zero-shot performance, with background shading indicating improvement over initial populations. The initial populations are manually crafted for each task. For clarification, the Initial Generation here denotes the first generation evolved from these task-specific initial populations.

Zero-shot vs. Evolutionary Performance

Correlation between zero-shot and post-evolution performance

Zero-shot performance vs. post-evolution performance across LLMs. While there is a significant correlation (r=0.860, p=3.95e-05) between zero-shot problem-solving ability and final optimization outcomes, being a strong one-shot problem solver does not necessarily imply being an effective evolutionary search operator. Models with similar zero-shot capabilities can exhibit dramatically different optimization trajectories and final performance.

Two Evaluation Dimensions: Refinement vs. Novelty

Refinement and Novelty in evolutionary search

Defining Refinement and Novelty. Left: Parent population and mutation prompts for prompt optimization (summarization task). Right: Offspring distribution in 2D novelty/fitness space (Gen 1, Child 1). Refinement child: offspring that exceed the parent best fitness (dotted line). Novelty: semantic distance of offspring from parent population.

More Novelty Does Not Guarantee Better Optimization

Comparison of efficient vs overly exploratory search behaviors

Efficient exploration vs. overly exploratory search behaviors. Gemini-1.5-Pro (left) demonstrates efficient exploration with sustained improvement and progressive localization. In contrast, Mistral-7B-Instruct (right) exhibits consistent high novelty exploration but fails to convert this exploration into fitness gains, illustrating that novelty without proper exploitation leads to poor optimization outcomes.

Refinement Rate: Strongest Predictor of Success

Average local refinement rate per model across different tasks

Refinement rate across LLMs and task families. This indicator measures the frequency with which a model produces an offspring that is strictly better than its parent. Regression analysis reveals that average refinement rate is by far the strongest predictor of final performance (coefficient = 0.548, z = 4.16, p < 0.001), significantly outperforming zero-shot performance as a predictor. This reinforces our hypothesis that effective LLM optimizers behave as strong local refiners, consistently producing incremental improvements rather than relying on rare breakthrough events.

Geometric Perspective: Distinct Semantic Search Behaviors

Different optimization trajectories for two LLMs with similar zero-shot performance on TSP-60. Each point represents a candidate solution, colored by generation. Gemini-1.5-Pro (left) progressively localizes its search into a smaller semantic region. Mistral-7B-Instruct (right) keeps drifting across distant regions and fails to exploit it into fitness gains.

Interactive Trajectory Explorer

Explore evolution trajectories across different tasks and models. Compare up to 4 models side-by-side.

Task Family

Subtask

View Type

Selection

0 / 8 models selected

Models (select up to 8)

Empirical View:

Each point is a candidate solution in metric space
X: Normalized Fitness (quality)
Y: Normalized Diversity (novelty)

Semantic View:

Candidates shown in a 2D MDS representation space
Distances indicate semantic similarity (closer = more similar)
Point size ∝ Fitness

Key Difference:

Empirical: answers “how good / how novel” numerically
Semantic: answers “how solutions cluster / drift” geometrically

Generation (early → late)

Poster

Download PDF Open in New Tab

Citation

If you find our work useful, please consider citing:

@misc{zhang2026makesllmgoodoptimizer,
      title={What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search}, 
      author={Xinhao Zhang and Xi Chen and François Portet and Maxime Peyrard},
      year={2026},
      eprint={2604.19440},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.19440}, 
}