Advances in architectures, fine-tuning, RAG, and scaling.
Performance optimization, latency reduction, and cost efficiency.
Retrieval pipelines, indexing, and hybrid search.
Fine-tuning for applications and emerging alternatives.
Feature stores, pipelines, and workflow orchestration.
Governance, adversarial attacks, compliance, and bias mitigation.
Architectures, orchestration, and real-world deployments.
Reliability, observability, and benchmarking.
A diverse community of data and AI professionals explore best practices for ML/Gen AI deployment in production.
See a snapshot of the technical AI practitioners and Committee that helps drive our initiatives.
9:15 AM to 9:45 AM ET
Beyond Gemini: Using RL to Unlock Reliable AI Agents with Open LLMs
Julien Launay, CEO
9:50 AM to 10:20 AM ET
Workflows Then Agents: The Practical Approach to Enterprise AI
Julian LaNeve, CTO
10:20 AM to 10:50 AM ET
All the Flavors of AI-Powered Search
Kacper Łukawski, Senior Developer Advocate
10:50 AM to 11:20 AM ET
Teaching AI to Reason: Reinforcement Fine-Tuning for Multi-Turn Agentic Workflows
Sameer Reddy, Research Engineer
11:20 AM to 11:50 AM ET
Trading Copilot - Smarter Insights, Confident Trades
Bhaskarjit Sarmah, Head RQA AI Lab
12:20 PM to 12:50 PM ET
Leveraging LLMs to Build Mixed-Data ML Pipelines and Search Git Hub with Natural Language
Andrew Moreland, Co-Founder
12:50 PM to 1:20 PM ET
Eval++: Why LLM Evaluation Alone Isn’t Enough
Chinar Movsisyan, Founder & CEO
1:20 PM to 1:50 PM ET
Knowledge Graphs + Semantic Search: Unlocking Smarter LLMs
Alessandro Pireno, Solutions Director
1:50 PM to 2:20 PM ET
Beyond Benchmarks: Rethinking How We Evaluate LLMs in High-Stakes Environments
Rajat Verma, Senior Staff Product Manager
2:20 PM to 2:50 PM ET
Safety Testing of the AI Agent: Vulnerabilities and Attacks Beyond the Chatbot
Alexander Borodetskiyl, VP of Growth, AI Safety
2:50 PM to 3:20 PM ET
One Does Not Simply Run Code
Tereza Tizkova, Head of Developer Relations
3:20 PM to 3:50 PM ET
Reward Function As A Service: A (Relatively) Easy Recipe for Training Your Own Reasoning Model
Ville Tuulos, CEO / Co-Founder
3:50 PM to 4:20 PM ET
Closing the Reliability Gap: Practical Strategies with Guarantees for Trustworthy GenAI
Curtis Northcutt, CEO & Co-Founder
4:20 PM to 4:50 PM ET
Building a Deep Research with Open-Source
Stefan Webb, Developer Advocate
We’ll have various slotted times for presentations on April 15th, 2025. The schedule and links will be provided, and the session will take place on Google Meets.
Once you register you will get access links
We have space for 15
The MLOps World Community, & GenAI World Summit Organizers and group (see Who We Are Section)
This event is free to attend
Laptop or personal computer, and a strong, reliable wifi connection. Google Chrome is recommended.
No, the joint 6th GenAI World / MLOps World Conference and Expo are taking place on Oct. 8th & 9th in Austin Texas. You can find more info here. This is a free virtual event we are offering.
All sessions will be recorded during the event (provided speaker permissions) and will be made available to attendees approximately 2-4 weeks after the event for those who signed up.
Please email at [email protected]
Yes, you can inquire at [email protected]
Subject to minor changes.
Alon Gubkin, CTO
Duncan Blythe, Co-Founder & CTO
Rahm Hafiz, CTO
Jim Dowling, CEO
Yubo Gao, Research Software Development Engineer
Shreya Rajpal, CEO
Eric O. Korman, Co-Founder / Chief Science Officer
Presenter:
Julien Launay, CEO, Adaptive ML
About the Speaker:
Julien is the CEO and co-founder of Adaptive ML, a company focused on democratizing reinforcement fine-tuning. Prior to founding Adaptive ML, Julien was the technical lead behind both Falcon 40B and 180B the popular open-source LLMs, as well as the RefinedWeb dataset used to train them. Julien was also a key contributor within BigScience to the development of the open-source LLM BLOOM and lead the Extreme Scale team at Hugging Face.
Category: Quality Tuning
Abstract:
An EdTech organization, faced the ambitious challenge of developing an AI agent for student support superior to Khanmigo, specifically tailored to improve student outcomes. Achieving an agent that met their performance expectations required embedding decades of domain expertise—a task unattainable through prompt engineering alone.
This talk explores how an EdTech overcame these limitations by fine-tuning language models using reinforcement learning (RL) techniques with primarily synthetic data. Remarkably, these approaches enabled the fine-tuning of smaller, open-weight models that consistently outperform state-of-the-art models, including specialized variants like Gemini Coach.
Specifically, we examine how RL facilitated sophisticated behaviors such as dynamic retrieval-augmented generation (agentic RAG), adaptive communication strategies refined through iterative feedback, and advanced synthetic data techniques like self-play—eliminating the necessity for extensive real-world data collection.
Presenter:
Julian LaNeve, CTO, Astronomer
About the Speaker:
Julian LaNeve is Chief Technology Officer at Astronomer.
Category: LLMs
Abstract:
We’ve seen countless teams chase AI, only to get lost in its complexity without delivering value. The truth? Most organizations don’t need agents talking to agents to get immediate ROI out of generative AI – they need reliable LLM workflows that solve real business problems today.
Related to this blog.
Presenter:
Kacper Łukawski, Senior Developer Advocate, Qdrant
About the Speaker:
Software developer and data scientist at heart, with an inclination to teach others. Public speaker, working in DevRel.
Category: Embeddings
Abstract:
Machine Learning has revolutionized how we find relevant documents given a query based on its semantics, not keywords. In its basic form, vector search relies on single vector representations using the same model for queries and documents. Currently, we’re experiencing a second wave of vector search, enabling new modalities to be searchable. Do you want to search over vast amounts of PDFs? OCR is no longer needed, as modern vector search architectures can handle that with no additional parsing.
Why should you care about search? The “R” in RAG stands for Retrieval, which is just a different name for search. The better you can find what’s relevant, the higher the quality of AI responses you may expect.
Let’s review the current state of AI-powered search, including multivector representations such as ColPali.
Presenter:
Sameer Reddy, Research Engineer, Predibase
About the Speaker:
Sameer Reddy is a Research Engineer at Predibase, where he works on fine-tuning and serving efficient language models for real-world agentic applications. His background spans reinforcement learning, LLM infrastructure, and ML efficiency, with prior research at Cisco and Georgia Tech focused on scalable model training and inference systems.
Category: LLMs
Abstract:
Multi-turn agent workflows—where models must reason across multiple steps, gather context iteratively, and make decisions over time—pose a unique challenge for LLMs fine-tuned only on static, one-shot data. In this talk, I’ll demonstrate how reinforcement fine-tuning (RFT) unlocks more reliable, controllable performance in complex agentic tasks by letting developers define reward functions that shape model behavior across multiple turns.
In this talk, we’ll share how reinforcement fine-tuning (RFT) can be used to train small, specialized models (1B–3B parameters) that act as lightweight decision engines within larger agentic workflows. We’ll demonstrate how to fine-tune a model to select tools accurately using just a reward function—no hand-labeling required—and how this architecture can reduce both latency and cost while improving precision.
While the live demo will focus on a single-turn decision task, we’ll explore how this approach can generalize to multi-turn agent behavior, such as:
– Deferring tool selection to a compact RFT model before invoking a larger orchestrator LLM
– Teaching models to reason (via chain-of-thought) before making decision
– Building modular, low-latency components that plug into existing agent stacks
This talk is ideal for ML engineers and infra teams building production-grade agents who want to reduce costs, increase reliability, and take greater control over how their models reason and act.
Presenter:
Bhaskarjit Sarmah, Head RQA AI Lab, BlackRock
About the Speaker:
Bhaskarjit is a Director and Principal Data Scientist at BlackRock, where he applies machine learning skills and domain knowledge to build innovative solutions for the world’s largest asset manager. He has over 10 years of experience in data science, spanning multiple industries and domains such as retail, airlines, media, entertainment, and BFSI.
At BlackRock, he is responsible for developing and deploying machine learning algorithms to enhance the liquidity risk analytics framework, identify price-making opportunities in the securities lending market, and create an early warning system using network science to detect regime change in markets. He also leverages his expertise in natural language processing and computer vision to extract insights from unstructured data sources and generate actionable reports.
Category: Agent Frameworks
Abstract:
This project aims to reduce the cognitive burden faced by professional traders. By developing a multi-agent framework called FinSage, we can help traders analyze complex market information by efficiently filtering through noise to deliver the most relevant and accurate data within minutes. Our approach seeks to enable traders to quickly uncover actionable insights for informed trading decisions.
FinSage consists of six specialized agents, each with a distinct role:
Supervisor Agent: Orchestrates overall operations and manages which agents to activate for each query.
Financial Metrics Agent: Provides comprehensive technical metrics for analyzing a company’s financial health and stock performance.
News Sentiment Agent: Analyzes sentiment across company news and specific market sectors.
SQL Agent: Accesses a historical database containing data from 2009 to 2022 for 57 NYSE-listed companies.
Synthesizer Agent: Combines data collected by other technical agents to generate concise, actionable answers to user queries.
Users can personalize their experience by entering their trading profile. FinSage then tailors its responses and recommendations based on the user’s risk tolerance, investment style, and expected return time horizon. The application is intuitive—users simply input their trading profile and a specific question like, Should I buy META stocks today?” and FinSage will come with a response.
Presenter:
Claire Longo, Lead AI Researcher, Comet
About the Speaker:
Claire Longo is an AI Leader and Researcher at Comet with over a decade of experience in Data Science, Machine Learning, and GenAI. From coding in R as a Statistician at DOE National Laboratories to building recommender systems at Trunk Club, to leading customer success organizations at AI Startups, she has navigated the evolving AI landscape—teaching herself Python and ML along the way. She has led cross-functional AI teams at Twilio, Opendoor, and Arize AI, focusing on the engineering best practices required to bring AI models from ideation to production at scale. In her current role at Comet, Claire researches AI trends and shares best practices with the developer community. She holds a Bachelor’s in Applied Mathematics and a Master’s in Statistics from The University of New Mexico.
Category: LLM
Abstract:
As AI Agents become increasingly more autonomous and complex, ensuring these systems can detect and root cause live issues becomes more critical than ever to ensure we are shipping reliable and high quality AI Agents. Observability—the ability to monitor, debug, AI systems—is a key component for deploying AI Agents we can trust and improve.
This talk will explore the emerging field of AI Agent Observability, covering best practices and challenges in tracking Agent behavior, understanding decision-making processes, and identifying failures points or biases. We will discuss the role of logging traces, monitoring,, and LLM eval metrics in improving visibility into your Agent operations, as well as novel approaches inspired by Reinforcement Learning.
Through real-world case studies, we will examine how AI Observability enhances AI-driven systems in industry today. This talk will provide you with strategies in your toolbox to implement Comet Opik to implement Observability for your AI Agent.
Presenter:
Andrew Moreland, Co-Founder, Chalk
About the Speaker:
Andrew is a Co-Founder at Chalk, the data platform for inference. Previously, he built large government data infrastructure projects at Palantir, and co-founded Haven Money, later acquired by Credit Karma. He holds a B.S. and M.S. in Computer Science from Stanford.
Category: LLMs
Abstract:
In this talk, we break down how to integrate LLMs into production-grade machine learning pipelines that blend structured and unstructured data (natural language search queries, contextual embeddings).
Using real-time feature engineering to surface relevant results (fraud detection, recommender systems, inference) with minimal latency
Optimizing for speed and efficiency, dynamically adjusting search complexity based on structured data cues.
Leveraging LLM retrieval at scale and continually conducting evals to improve recommendations over time
Presenter:
Chinar Movsisyan, Founder & CEO, Feedback Intelligence
About the Speaker:
Chinar Movsisyan is the founder and CEO of Feedback Intelligence (formerly Manot), an LLMOps company based in San Francisco that enables enterprises to make sure that LLM-based products are reliable and that the output is aligned with end-user expectations and needs. Chinar has extensive experience in building AI solutions from 0 to 1 in different mission-critical applications including drones, satellites, and healthcare. She has led engineering and research initiatives at different venture-backed startups (Amaros, Vineti) and research labs (LCIS Lab at Grenoble University). Her PhD was in ML in healthcare supervised by Sos Agaian in CUNY.
Category: Quality Assessment
Abstract:
You wouldn’t serve a cake without tasting it — so why ship LLM products without real user feedback?
Traditional evaluation stops at internal data science metrics and offline benchmarks. But that’s not enough to ensure your LLM-powered product is aligned with real user needs and expectations.
In this talk, we introduce a new approach that puts users into the center of development and optimization. We call it Eval++.
You’ll learn how to turn implicit feedback into actionable insights, why closing the feedback loop is critical for moving from PoC to ROI, and how enterprises and startups alike are adopting Feedback Intelligence to build LLM-powered products that users love (and trust).
Presenter:
Alessandro Pireno, Solutions Director, SurrealDB
About the Speaker:
Alessandro is a seasoned product development and solutions leader with a proven track record of building and scaling data-driven solutions across diverse industries. He has led product strategy and development at companies like HUMAN and Omnicom Media Group, optimized data collection and distribution at GroupM, and was an early leader of success at Snowflake. With a deep understanding of the challenges and opportunities facing today’s tech landscape, Alessandro is passionate about empowering organizations to unlock the full potential of their data through innovative database solutions.
Category: Vector DBs
Abstract:
Machine learning teams are increasingly experimenting with AI “agents” powered by large language models (LLMs) to autonomously complete tasks or make decisions. These agentic AI workflows promise dynamic, adaptive behavior, but they also introduce serious challenges for MLOps in production: unpredictable decision paths, novel failure modes (like hallucinated outputs or endless loops), and difficulty in monitoring and debugging. **How can we harness the power of LLM-driven agents without losing control of our pipelines?**
This talk tackles that question head-on by sharing lessons learned from implementing agent-based workflows at scale and demonstrating a structured approach to keep them reliable. Attendees will learn how to design workflows that delegate work to LLM agents in a controlled manner – defining clear task boundaries, adding guardrails (such as timeouts and retries), and capturing detailed traces of agent decisions. They will also see how an open-source framework (built on Prefect) makes it easier to orchestrate and observe these complex workflows. By the end of this session, you’ll know how to turn the “black box” of an AI agent into a transparent, trustworthy part of your ML operations.
Presenter:
Rajat Verma, Senior Staff Product Manager, ServiceNow
About the Speaker:
With over 11 years of experience at the intersection of Data Science, Product Management, and Artificial Intelligence in B2B SaaS companies, Rajat Verma specializes in building data and AI/ML products that transform data into measurable business impact. Previously at Autodesk, Rajat led the development of a data platform processing over one million daily transactions and created data products that contributed approximately $10 million in both revenue growth and cost savings. Currently at ServiceNow, Rajat is focused on building and operationalizing AI products that drive business growth and enhance enterprise decision-making. His experience spans structured data pipelines to complex AI systems and models.
Category: Quality Tuning
Abstract:
Evaluating Large Language Models (LLMs) presents challenges far beyond traditional machine learning metrics like accuracy or F1 scores. As LLMs become integral to high-stakes applications—ranging from enterprise automation to medical and legal decision-making—the need for robust, multi-faceted evaluation frameworks has never been greater.
Unlike conventional models, LLMs must be assessed across diverse dimensions, including factual accuracy, reasoning depth, coherence, safety, and ethical alignment.
This talk explores three core challenges in quality tuning: (1) the inherent subjectivity of open-ended responses, making traditional benchmarking difficult; (2) the absence of definitive ground truth in generative tasks, complicating evaluation; and (3) the dynamic, context-dependent nature of correctness, which shifts based on application needs.
Building on existing evaluation benchmarks like HELM and EleutherAI’s Evaluation Harness, we’ll discuss a compositional quality tuning framework that adapts scoring weights dynamically to balance trade-offs between factuality, creativity, and safety. This approach includes novel mechanisms for detecting hallucinations in domain-specific contexts and quantifying the robustness of model outputs under varying prompt conditions.
Finally, as LLMs grow more agentic, demonstrating long-horizon reasoning and autonomous decision-making, quality tuning methodologies must evolve in parallel. We will discuss some practical experiences implementing these evaluation approaches across different industries, highlighting how targeted quality tuning dramatically improves model performance in specialized domains.
Presenter:
Alexander Borodetskiy, VP of Growth, AI Safety, Toloka AI
About the Speaker:
TBA
Category: Guardrails, trust, security, Risk Mitigation
Abstract:
AI agents are evolving beyond simple chat, browsing the web and interacting with your computer. But how do we ensure they’re safe? We’ll explore a new safety evaluation framework, revealing how specially crafted web pages, files, and OS environments can be used to expose agent vulnerabilities. See how prompt injections can manipulate agents into leaking sensitive data, how to organize testing of an agent for performing unsafe mistakes – and how they might even be hijacked for malicious activities. This technical deep dive reveals the approach to conduct safety testing of today’s leading AI agents and provides insights into building more robust and safe AI systems.
Presenter:
Tereza Tizkova, Head of Developer Relations, E2B
About the Speaker:
Tereza was with E2B from the beginning. She focused on developer relations, explaining the product, and creating content to help developers build the best AI apps. Before that she was a management consultant at McKinsey, and her background is abstract Math. She loves to test new AI tools, talk to developers, and write a blog in her free time.
Category: LLMs
Abstract:
We are in the era of LLM-powered software. The AI-powered apps and products being built require very specific tools. We will talk about building infrastructure for AI agents and apps, and what challenges we need to overcome if we want to make agents more reliable. We introduce different types of code execution environments for agents, consider advantages and disadvantages, and talk about important questions like security or scalability.
Presenter:
Ville Tuulos, CEO / Co-Founder, Outerbound
About the Speaker:
Ville Tuulos has been developing infrastructure for ML and AI for over two decades. He has worked as an ML researcher in academia and as a leader at a number of companies, including Netflix where he led the ML infrastructure team that created Metaflow, a popular open-source framework for ML infrastructure. He is a co-founder and CEO of Outerbounds, a developer-friendly platform for real-world ML/AI systems.
Category: Quality Tuning
Abstract:
In this session, we will walk through a template that allows you to post-train a Deepseek-style LLM with your own reward functions. In contrast to traditional supervised finetuning, this approach allows you to customize highly performing LLMs for domains that are amenable to programmatic evaluation through simple Python functions. This technique is commonly applied to math and coding, but as shown in this session, with some creativity you can apply the template to many other domains as well.
Presenter:
Curtis Northcutt, CEO & Co-Founder, Cleanlab
About the Speaker:
Curtis Northcutt is an American computer scientist, artist, and entrepreneur focusing on using machine learning and artificial intelligence to empower people. Curtis completed his PhD at MIT where he invented Cleanlab’s algorithms.
He is the CEO and Co-Founder of Cleanlab, used by 80+ of the top Fortune-500 companies to “detect, observe, and immediately resolve RAG and LLM failures—such as hallucinations, retrieval failures, and knowledge gaps—in real time. ” Curtis is the recipient of the MIT Morris Levin Thesis Award, the NSF Fellowship, and the Goldwater Scholarship and has worked at several leading AI research groups, including Google, Oculus, Amazon, Facebook, Microsoft, and NASA
Category: Quality Tuning
Abstract:
As GenAI becomes embedded in production systems across industries, reliability challenges are surfacing that threaten its long-term value. From simple factual errors—like misidentifying the third month of the year alphabetically—to complex reasoning failures, these issues undermine user trust and limit the effectiveness of AI applications. In this talk, we will survey the current landscape of AI reliability, including the latest research and tools for identifying and classifying misbehaviors of AI systems. We will delve into how these tools fit in to cutting-edge architectures, such as RAG and Agentic systems. We will conclude by discussing strategies for remediation, including our vision for how AI can be designed to collaborate effectively with humans, delivering optimal outcomes with minimal oversight. Join us to learn how the future of AI hinges not just on capability, but on dependable, trustworthy systems.
Presenter:
Stefan Webb, Developer Advocate, Zilliz
About the Speaker:
Stefan Webb is a Developer Advocate at Zilliz, where he advocates for the open-source vector database, Milvus. Prior to this, he spent three years in industry as an Applied ML Researcher at Twitter and Meta, collaborating with product teams to tackle their most complex challenges. Stefan holds a PhD from the University of Oxford and has published papers at leading machine learning conferences such as NeurIPS, ICLR, and ICML. He is passionate about Generative AI and is eager to leverage his deep technical expertise to contribute to the open-source community.
Category: LLMs
Abstract:
Unless you live under a rock, you will have heard about OpenAI’s release of Deep Research on Feb 2, 2025. This new product promises to revolutionize how we answer questions requiring the synthesis of large amounts of diverse information. But how does this technology work, and why is Deep Research a noticeable improvement over previous attempts? In this webinar, we will examine the concepts underpinning modern agents, especially research agents, using our open-source clone DeepSearcher as an example.