EcomBench: Towards Holistic Evaluation of Foundation Agents in E-commerce
By: Rui Min, Zile Qiao, Ze Xu, Jiawen Zhai, Wenyu Gao, Xuanzhong Chen, Haozhen Sun, Zhen Zhang, Xinyu Wang, Hong Zhou, Wenbiao Yin, Xuan Zhou, Yong Jiang, Haicheng Liu, Liang Ding, Ling Zou, Yi R. (May)Fung, Yalong Li, Pengjun Xie
Published: 2025-12-10
View on arXiv →Abstract
This paper introduces EcomBench, a benchmark designed for the holistic evaluation of foundation agents in e-commerce, addressing the need for comprehensive assessment of AI's performance in this critical real-world application domain.
Impact
practical
Topics
6
💡 Simple Explanation
Imagine a final exam for AI robots that want to run an online store. Most previous tests just checked if the robot could find a pair of shoes. This new test, EcomBench, checks if the robot can also handle customer complaints, manage the warehouse inventory, understand messy photos of products, and navigate the seller's dashboard. The results show that while the smartest AIs (like the one behind ChatGPT) pass with a 'C' or 'B', many others fail, especially when trying to do complicated business tasks like planning sales or fixing order errors.
🎯 Problem Statement
Current evaluations of AI agents in e-commerce are too narrow, focusing primarily on user-facing shopping tasks (search, recommendation). This neglects the broader ecosystem, including complex merchant operations, post-sales service, and the need to process multimodal inputs (text + images) reliably, leaving a gap in understanding how ready these agents are for real-world business deployment.
🔬 Methodology
The authors developed a benchmark consisting of four modules: Purchasing, Customer Service, Operation, and Multimodal. They compiled a dataset of over 1,500 tasks from real-world sources (Amazon, instructional videos) and synthetic generation. They introduced 'EcomScore', a weighted metric combining Success Rate (SR), Efficiency (Eff), and Safety (Saf). They evaluated 12 LLMs, including GPT-4o, Claude 3.5 Sonnet, and Llama 3, using both rule-based checking and LLM-as-a-judge methodologies.
📊 Results
GPT-4o achieved the highest overall EcomScore (58.3), followed by Claude 3.5 Sonnet (55.5). Open-source models like Llama-3-70B lagged significantly (EcomScore ~42-45). The 'Operation' module proved most difficult, with models struggling to plan complex sequences. In multimodal tasks, even top models had trouble correlating text instructions with subtle visual details in product images. A significant drop in performance was observed when moving from simple tasks to those requiring long-term context retention.
✨ Key Takeaways
There is a 'Sim-to-Real' gap in e-commerce AI; models that chat well do not necessarily operate well. The complexity of merchant operations (inventory, pricing) is currently the biggest bottleneck for autonomous agents. Proprietary models currently hold a strong monopoly on performance in this domain, suggesting high barriers to entry for open-source based competitors without specialized fine-tuning.
🔍 Critical Analysis
EcomBench fills a critical void by moving beyond the 'shopping assistant' trope to address the complex reality of e-commerce operations. Its inclusion of the Operation Agent is particularly laudable, as this is where high-value B2B automation lies. However, the reliance on static datasets limits the evaluation of dynamic, multi-turn error recovery, which is crucial in real deployment. The gap between GPT-4o and open-source models highlights that we are not yet ready for commoditized, autonomous e-commerce agents without significant fine-tuning.
💰 Practical Applications
- Certification Service: 'EcomBench Certified' badge for AI tools in the Shopify app store.
- Data Licensing: Selling the cleaned, high-quality 'Operation' dataset to AI labs.
- Training Platform: A SaaS platform that fine-tunes corporate LLMs on these specific failure modes.