Show HN: Opik, an open source LLM evaluation framework

(github.com)

80 points | by calebkaiser 2 days ago ago

13 comments

trolan 18 hours ago ago
I'm in a University course related to AI testing and quality assurance. This is something I'll definitely bring up and see how it can be used.
With OpenAI comparability, hoping it supports OpenRouter out of the box, which means it supports Anthropic and Google too, along with a host of open models hosted elsewhere.
[-]
- calebkaiser 18 hours ago ago
  Fantastic to hear! Opik should work with OpenRouter out of the box, particularly if you are using the OpenAI Python client to interface with OpenRouter. Opik's integration with OpenAI is implemented via their Python library, and so it is agnostic with respect to actual backend serving the model.
  Your course sounds interesting! If you're doing any research around testing and evaluation, particularly regarding applied LLM applications, we have several researchers and engineers on our team who I'm sure would be happy to connect (myself included).
smcleod a day ago ago
Looks interesting, great to see it specifically calls out supporting LLM servers as first class citizens!
I see some of the code is Java, that strikes me as an interesting choice - is there a reason behind that or simply the language that the devs were already familiar with?
[-]
- calebkaiser 17 hours ago ago
  The decision to go with Java for the backend was because we feel Java is a bit more battle tested than Python for production (dependency management, concurrency, compilation etc). Go was another strong contender, but we felt like it's a bit easier to contribute to a Java codebase for OSS contributors. At the end of the day, I'm sure we could have gone with several different options. Our cloud codebase has services in multiple languages (Python, Java, Go, TS), and Opik uses TypeScript for FE, Python for SDKs, and Java for backend.
tcsizmadia 2 days ago ago
It looks very promising! Congratulations, great tool! I can't wait to start experimenting with it. I plan to use it locally, with Ollama.
[-]
- calebkaiser 2 days ago ago
  Awesome, thanks! If you run into any issues, you can open a ticket on the repo or ping me directly at caleb[at]comet.com
hrpnk 2 days ago ago
Is there a reason you didn't just implement OpenTelemetry (OT) straight away? Curious about the trade offs to opt for a home-grown telemetry inspired by OT instead.
[-]
- calebkaiser 2 days ago ago
  Good question! It mostly came down to implementation speed, as well as some uncertainty about performance/overhead. We will be releasing OpenTelemetry compatible ingestion endpoints in the near future, but since Opik has so many features that aren't related to OT, we decided to move forward without it for the initial release. It is a great project though and something we will be implementing soon—it will especially be useful for building out integrations with frameworks that are OpenTelemetry compatible.
  [-]
  - baggiponte 2 days ago ago
    Have you seen two “prototypes” of standard for LLM telemetry? One is openllmetry, maintained by the folks at TraceLoop. Seems the more popular. The other one is openinference IIRC, by Arize AI.
    [-]
    - calebkaiser 2 days ago ago
      Of the two, the only one I've ever personally explored is OpenLLMetry. Extremely cool project. In general, this is one of those areas where the field still needs to "shake out" a bit.
    - kakaly0403 2 days ago ago
      There is a GenAI standard spec from OpenTelemetry for tracing LLM based applications. Currently there are 3 library implementations of this spec - Langtrace, OpenLLMetry and OpenLit. Microsoft has an implementation for .NET aswell. OpenInference, though opentelemetry compatible does not adhere to the standard spec.
yu3zhou4 2 days ago ago
Hello! How does it compare to DeepEval (open source)?
[-]
- calebkaiser a day ago ago
  Great question. First, we have a ton of respect for the work the DeepEval team is doing. That said, we took a fundamentally different approach in building Opik as an open source project. With DeepEval, if you want to log your data or use the UI, you need to use Confident AI's cloud platform (which as far as I'm aware, has no free plan). So, if you want to visualize traces, do production monitoring, labeling, etc, you can't just use the DeepEval open source library.
  All of Opik's functionality, including the UI and logging, is available in the open source version. The only "features" that are inaccessible from the open source version of Opik are things that are actually features of the Comet platform. For example, Comet Artifacts allow you to store your datasets as versioned assets, preserved as an immutable series of snapshots, which automatically track any experiments they've been a part of in order to preserve their full data lineage. You can use Opik with Artifacts, but that will require a free Comet account. Any Opik-specific feature, however, is fully available in the open source version.