EcomBench: Towards Holistic Evaluation of Foundation Agents in E-commerce

By: Rui Min, Zile Qiao, Ze Xu, Jiawen Zhai, Wenyu Gao, Xuanzhong Chen, Haozhen Sun, Zhen Zhang, Xinyu Wang, Hong Zhou, Wenbiao Yin, Xuan Zhou, Yong Jiang, Haicheng Liu, Liang Ding, Ling Zou, Yi R. (May)Fung, Yalong Li, Pengjun Xie

Published: 2025-12-10

View on arXiv →
#cs.AI✓ AI Analyzed#E-commerce#LLM Agents#Benchmark#Multimodal AI#Operations Research#Merchant AutomationE-commerceRetailCustomer ServiceLogisticsDigital Marketing

Abstract

This paper introduces EcomBench, a benchmark designed for the holistic evaluation of foundation agents in e-commerce, addressing the need for comprehensive assessment of AI's performance in this critical real-world application domain.

Impact

practical

Topics

6

💡 Simple Explanation

Imagine a final exam for AI robots that want to run an online store. Most previous tests just checked if the robot could find a pair of shoes. This new test, EcomBench, checks if the robot can also handle customer complaints, manage the warehouse inventory, understand messy photos of products, and navigate the seller's dashboard. The results show that while the smartest AIs (like the one behind ChatGPT) pass with a 'C' or 'B', many others fail, especially when trying to do complicated business tasks like planning sales or fixing order errors.

🎯 Problem Statement

Current evaluations of AI agents in e-commerce are too narrow, focusing primarily on user-facing shopping tasks (search, recommendation). This neglects the broader ecosystem, including complex merchant operations, post-sales service, and the need to process multimodal inputs (text + images) reliably, leaving a gap in understanding how ready these agents are for real-world business deployment.

🔬 Methodology

The authors developed a benchmark consisting of four modules: Purchasing, Customer Service, Operation, and Multimodal. They compiled a dataset of over 1,500 tasks from real-world sources (Amazon, instructional videos) and synthetic generation. They introduced 'EcomScore', a weighted metric combining Success Rate (SR), Efficiency (Eff), and Safety (Saf). They evaluated 12 LLMs, including GPT-4o, Claude 3.5 Sonnet, and Llama 3, using both rule-based checking and LLM-as-a-judge methodologies.

📊 Results

GPT-4o achieved the highest overall EcomScore (58.3), followed by Claude 3.5 Sonnet (55.5). Open-source models like Llama-3-70B lagged significantly (EcomScore ~42-45). The 'Operation' module proved most difficult, with models struggling to plan complex sequences. In multimodal tasks, even top models had trouble correlating text instructions with subtle visual details in product images. A significant drop in performance was observed when moving from simple tasks to those requiring long-term context retention.

✨ Key Takeaways

There is a 'Sim-to-Real' gap in e-commerce AI; models that chat well do not necessarily operate well. The complexity of merchant operations (inventory, pricing) is currently the biggest bottleneck for autonomous agents. Proprietary models currently hold a strong monopoly on performance in this domain, suggesting high barriers to entry for open-source based competitors without specialized fine-tuning.

🔍 Critical Analysis

EcomBench fills a critical void by moving beyond the 'shopping assistant' trope to address the complex reality of e-commerce operations. Its inclusion of the Operation Agent is particularly laudable, as this is where high-value B2B automation lies. However, the reliance on static datasets limits the evaluation of dynamic, multi-turn error recovery, which is crucial in real deployment. The gap between GPT-4o and open-source models highlights that we are not yet ready for commoditized, autonomous e-commerce agents without significant fine-tuning.

💰 Practical Applications

  • Certification Service: 'EcomBench Certified' badge for AI tools in the Shopify app store.
  • Data Licensing: Selling the cleaned, high-quality 'Operation' dataset to AI labs.
  • Training Platform: A SaaS platform that fine-tunes corporate LLMs on these specific failure modes.

🏷️ Tags

#E-commerce#LLM Agents#Benchmark#Multimodal AI#Operations Research#Merchant Automation

🏢 Relevant Industries

E-commerceRetailCustomer ServiceLogisticsDigital Marketing