Hacker News Clone

How are you running evals for AI agents?

by aneeqdhk on 1/3/2025, 11:22:31 AM with 0 comments

I have a couple of projects in my company where wer are creating AI agents to generate code and/or help people in designing software. The agents themselves are conversational. The code generated is most often UI code.

How are people going about evaluating the responses of AI agents these days? Particularly for conversational flows - the problem seems more complex because it could require keeping the entire conversation in context.

Any help or resources will be quite appreciated!