19ore fa
Pytest for LLM Apps is finally here! DeepEval turns LLM evals into a two-line test suite to help you identify the best models, prompts, and architecture for AI workflows (including MCPs). Learn the limitations of G-Eval and an alternative to it in the explainer below:
1giorno fa
Most LLM-powered evals are BROKEN! These evals can easily mislead you to believe that one model is better than the other, primarily due to the way they are set up. G-Eval is one popular example. Here's the core problem with LLM eval techniques and a better alternative to them: Typical evals like G-Eval assume you’re scoring one output at a time in isolation, without understanding the alternative. So when prompt A scores 0.72 and prompt B scores 0.74, you still don’t know which one’s actually better. This is unlike scoring, say, classical ML models, where metrics like accuracy, F1, or RMSE give a clear and objective measure of performance. There’s no room for subjectivity, and the results are grounded in hard numbers, not opinions. LLM Arena-as-a-Judge is a new technique that addresses this issue with LLM evals. In a gist, instead of assigning scores, you just run A vs. B comparisons and pick the better output. Just like G-Eeval, you can define what “better” means (e.g., more helpful, more concise, more polite), and use any LLM to act as the judge. LLM Arena-as-a-Judge is actually implemented in @deepeval (open-source with 12k stars), and you can use it in just three steps: - Create an ArenaTestCase, with a list of “contestants” and their respective LLM interactions. - Next, define your criteria for comparison using the Arena G-Eval metric, which incorporates the G-Eval algorithm for a comparison use case. - Finally, run the evaluation and print the scores. This gives you an accurate head-to-head comparison. Note that LLM Arena-as-a-Judge can either be referenceless (like shown in the snippet below) or reference-based. If needed, you can specify an expected output as well for the given input test case and specify that in the evaluation parameters. Why DeepEval? It's 100% open-source with 12k+ stars and implements everything you need to define metrics, create test cases, and run evals like: - component-level evals - multi-turn evals - LLM Arena-as-a-judge, etc. Moreover, tracing LLM apps is as simple as adding one Python decorator. And you can run everything 100% locally. I have shared the repo in the replies.
6.767
56
Il contenuto di questa pagina è fornito da terze parti. Salvo diversa indicazione, OKX non è l'autore degli articoli citati e non rivendica alcun copyright sui materiali. Il contenuto è fornito solo a scopo informativo e non rappresenta le opinioni di OKX. Non intende essere un'approvazione di alcun tipo e non deve essere considerato un consiglio di investimento o una sollecitazione all'acquisto o alla vendita di asset digitali. Nella misura in cui l'IA generativa viene utilizzata per fornire riepiloghi o altre informazioni, tale contenuto generato dall'IA potrebbe essere impreciso o incoerente. Leggi l'articolo collegato per ulteriori dettagli e informazioni. OKX non è responsabile per i contenuti ospitati su siti di terze parti. Gli holding di asset digitali, tra cui stablecoin e NFT, comportano un elevato grado di rischio e possono fluttuare notevolmente. Dovresti valutare attentamente se effettuare il trading o detenere asset digitali è adatto a te alla luce della tua situazione finanziaria.