An Unconventional Benchmarking Challenge: LLMs Take on Street Fighter III
In an intriguing twist on the traditional benchmarking process, one of my colleagues decided to inject a bit of excitement into the evaluation of large language models (LLMs). Utilizing Amazon Bedrock, they organized a captivating competition in which 14 different LLMs faced off in a grand total of 314 matches of the classic game, Street Fighter III.
To add a competitive flair to the analysis, my coworker devised a Chess-inspired Elo rating system. This innovative approach allowed for the systematic ranking of the models based on their performance throughout the matches. Currently, Claude 3 Haiku has emerged as the leading contender, showcasing its capability in this unusual arena.
For those interested in the details of this fascinating experiment and the results of these epic face-offs, I highly recommend checking out the full discussion here. Dive into the analysis and discover how these LLMs fared against each other in this unique application of gaming and AI!
Leave a Reply