Organized by

Alongside

Organized by

Alongside

Agents & GenAI Infrastructure and Tooling Summit

Free for Everyone

Virtually April 15th, 2025

Explore cutting-edge AI Agent systems and the GenAI infrastructure that powers them.

Join technical experts for 15+ live demos showcasing the full technology stack - from foundational components to advanced agent frameworks.

Whether you're building autonomous AI solutions or strengthening your GenAI infrastructure, this virtual summit delivers practical insights to accelerate your implementation.

At this virtual event you’ll meet innovators at the forefront of areas

01

LLMs

Advances in architectures, fine-tuning, RAG, and scaling.

02

Quality Tuning

Performance optimization, latency reduction, and cost efficiency.

03

Vector DBs

Retrieval pipelines, indexing, and hybrid search.

04

Embeddings

Fine-tuning for applications and emerging alternatives.

05

Data Tools

Feature stores, pipelines, and workflow orchestration.

06

Trust, Risk & Security Management

Governance, adversarial attacks, compliance, and bias mitigation.

07

Agent Frameworks

Architectures, orchestration, and real-world deployments.

08

Evaluation & Monitoring

Reliability, observability, and benchmarking.

Who we are

A diverse community of data and AI professionals explore best practices for ML/Gen AI deployment in production.

Read more.

Our committee

See a snapshot of the technical AI practitioners and Committee that helps drive our initiatives.

Read more.

Event Presenters

Julien Launay

CEO, Adaptive ML
Talk: Beyond Gemini: Using RL to Unlock Reliable AI Agents with Open LLMs

Julian LaNeve

CTO, Astronomer
Talk: Workflows Then Agents: The Practical Approach to Enterprise AI

Kacper Łukawski

Senior Developer Advocate, Qdrant
Talk: All the Flavors of AI-Powered Search

Sameer Reddy

Research Engineer, Predibase
Talk: Teaching AI to Reason: Reinforcement Fine-Tuning for Multi-Turn Agentic Workflows

Bhaskarjit Sarmah

Head RQA AI Lab, BlackRock
Talk: Trading Copilot - Smarter Insights, Confident Trades

Claire Longo

Lead AI Researcher, Comet
Talk: AI Agent Observability

Andrew Moreland

Co-Founder, Chalk
Talk: Leveraging LLMs to Build Mixed-Data ML Pipelines and Search Git Hub with Natural Language

Chinar Movsisyan

Founder & CEO, Feedback Intelligence
Talk: Eval++: Why LLM Evaluation Alone Isn’t Enough

Alessandro Pireno

Solutions Director, SurrealDB
Talk: Knowledge Graphs + Semantic Search: Unlocking Smarter LLMs

Rajat Verma

Senior Staff Product Manager, ServiceNow
Talk: Beyond Benchmarks: Rethinking How We Evaluate LLMs in High-Stakes Environments

Alexander Borodetskiy

VP of Growth, AI Safety, Toloka AI
Talk: Safety Testing of the AI Agent: Vulnerabilities and Attacks Beyond the Chatbot

Tereza Tizkova

Head of Developer Relations, E2B
Talk: One Does Not Simply Run Code

Ville Tuulos

CEO / Co-Founder, Outerbound
Talk: Reward Function As A Service: A (Relatively) Easy Recipe for Training Your Own Reasoning Model

Curtis Northcutt

CEO & Co-Founder, Cleanlab
Talk: Closing the Reliability Gap: Practical Strategies with Guarantees for Trustworthy GenAI

Stefan Webb

Developer Advocate, Zilliz
Talk: Building a Deep Research with Open-Source

Event Schedule

9:10 AM - 9:15 AM ET

Opening Remarks

9:15 AM to 9:45 AM ET

Beyond Gemini: Using RL to Unlock Reliable AI Agents with Open LLMs

Julien Launay, CEO

9:50 AM to 10:20 AM ET

Workflows Then Agents: The Practical Approach to Enterprise AI

Julian LaNeve, CTO

10:20 AM to 10:50 AM ET

All the Flavors of AI-Powered Search

Kacper Łukawski, Senior Developer Advocate

10:50 AM to 11:20 AM ET

Teaching AI to Reason: Reinforcement Fine-Tuning for Multi-Turn Agentic Workflows

Sameer Reddy, Research Engineer

11:20 AM to 11:50 AM ET

Trading Copilot - Smarter Insights, Confident Trades

Bhaskarjit Sarmah, Head RQA AI Lab

11:50 AM to 12:20 PM ET

AI Agent Observability

Claire Longo, Lead AI Researcher

12:20 PM to 12:50 PM ET

Leveraging LLMs to Build Mixed-Data ML Pipelines and Search Git Hub with Natural Language

Andrew Moreland, Co-Founder

12:50 PM to 1:20 PM ET

Eval++: Why LLM Evaluation Alone Isn’t Enough

Chinar Movsisyan, Founder & CEO

1:20 PM to 1:50 PM ET

Knowledge Graphs + Semantic Search: Unlocking Smarter LLMs

Alessandro Pireno, Solutions Director

1:50 PM to 2:20 PM ET

Beyond Benchmarks: Rethinking How We Evaluate LLMs in High-Stakes Environments

Rajat Verma, Senior Staff Product Manager

2:20 PM to 2:50 PM ET

Safety Testing of the AI Agent: Vulnerabilities and Attacks Beyond the Chatbot

Alexander Borodetskiyl, VP of Growth, AI Safety

2:50 PM to 3:20 PM ET

One Does Not Simply Run Code

Tereza Tizkova, Head of Developer Relations

3:20 PM to 3:50 PM ET

Reward Function As A Service: A (Relatively) Easy Recipe for Training Your Own Reasoning Model

Ville Tuulos, CEO / Co-Founder

3:50 PM to 4:20 PM ET

Closing the Reliability Gap: Practical Strategies with Guarantees for Trustworthy GenAI

Curtis Northcutt, CEO & Co-Founder

4:20 PM to 4:50 PM ET

Building a Deep Research with Open-Source

Stefan Webb, Developer Advocate

How Does it Work?

  • Explore Advanced AI Agent Systems – Discover the latest developments in autonomous, goal-directed AI systems and their practical applications.
  • Evaluate GenAI Infrastructure Options – Compare different foundational technologies that support all generative AI applications.
  • Learn From Technical Experts – Gain insights from practitioners with hands-on experience in both agent development and infrastructure deployment.
  • Discover the Latest Tooling – See demonstrations of specialized tools for development, deployment, and monitoring of GenAI systems.
  • Understand Security Approaches – Learn best practices for implementing guardrails for AI agents and securing GenAI infrastructure.
  • Make Informed Technology Decisions – Cut through marketing hype with honest demonstrations of what various technologies can actually deliver.
  • Accelerate Your Implementation – Identify the right components for your specific use cases, whether focused on agents or broader GenAI applications.
  • AI Engineers & Developers building agent-based systems or working with foundational GenAI technologies
  • ML/MLOps Professionals responsible for implementing and maintaining AI infrastructure
  • Technical Architects designing systems that incorporate AI agents or GenAI capabilities
  • Engineering Leaders making strategic decisions about AI technology investments
  • AI Product Managers seeking to understand the technical capabilities and limitations of current tools
  • Technical Founders building startups in the AI agent or infrastructure space
  • Enterprise Innovation Teams evaluating AI agent implementations or GenAI infrastructure options
  • Open Source Contributors working on AI agent frameworks or GenAI tooling projects
  • Academic Researchers interested in practical applications and industry implementations

We’ll have various slotted times for presentations on April 15th, 2025. The schedule and links will be provided, and the session will take place on Google Meets.

  • You can drop in the company’s or group presentation anytime during the allocated time.
  • Each session will help you better understand the tooling and approaches via demos/Q&A with specialists. You can ask questions/share with others with similar interests.
  • Move around the space freely, Chat 1-1 via the direct messaging feature or simply watch the host session present. The platform is easy to navigate.
  • We will also give the presenters the opportunity to reach out for 1-1 meetings and vice versa.

Once you register you will get access links

The MLOps World Community, & GenAI World Summit Organizers and group (see Who We Are Section)

This event is free to attend

Laptop or personal computer, and a strong, reliable wifi connection. Google Chrome is recommended.

No, the joint 6th GenAI World / MLOps World Conference and Expo are taking place on Oct. 8th & 9th in Austin Texas. You can find more info here. This is a free virtual event we are offering.

All sessions will be recorded during the event (provided speaker permissions) and will be made available to attendees approximately 2-4 weeks after the event for those who signed up.

Tickets

Tickets

Gen AI Tools
Infra & Open Source
Demo Days

Subject to minor changes.

May 8th

Registration
Mitigating RAG Hallucinations with Aporia Guardrails

Alon Gubkin, CTO

Customizable RAG workflows with your own Data
Christy Bergman, Developer Advocate
The Secret Sauce for Deploying LLM Applications into Production
Josh Reini, Developer Advocate, OSS Lead
The Who, What, and Why of Data Lake Table Formats
Alex Merced, Developer Advocate
Introducing Arize-Phoenix and OpenInference
Mikyo King, Head of Open Source & Founding Engineer
Integrating AI Directly with Your Existing Databases. Build, Deploy and Manage AI Apps Easily, without Moving Your Data Into Complex Pipelines and Specialized Vector Databases.

Duncan Blythe, Co-Founder & CTO

Keynote Sessions
Making Enterprise GenAI Safe and Effective - Tools and Approaches

Rahm Hafiz, CTO

Running multiple models on the same GPU, on spot instances
Oscar Rovira, Co-Founder
LLMs From Dream to Deployed
Josh Goldstein, Solutions Engineer
Towards Robust GenAI: Techniques for Evaluating Enterprise LLM Applications
Dhruv Singh, Co-Founder & CTO
The Journey of Building a Leading Open Source LLM Security Toolkit
Oleksandr Yaremchuk, Principal Engineer LLMs and Open-Source Initiatives
Private, Local AI
Christian Crowley, Co-Founder
Beyond Benchmarks: Measuring Success for Your AI Initiatives
Salma Mayorquin, Co-Founder
Evaluation Engineering: Iterative Strategies to Testing Prompts
Jared Zoneraich, Founder & Founder
Apache Airflow: Where Data Engineers and Ml Engineers Meet
Tamara Fingerlin, Developer Advocate
Function Calling for LLMs: RAG Without a Vector Database

Jim Dowling, CEO

Lessons Learned from Scaling Large Language Models in Production
Matt Squire, CTO
Finding Training Inefficiencies with CentML DeepView

Yubo Gao, Research Software Development Engineer

Keynote Sessions
Why Ai Apps Don't Work in Prod: AI Reliability Survey

Shreya Rajpal, CEO

Evaluating LLMs and RAG Pipelines at Scale

Eric O. Korman, Co-Founder / Chief Science Officer

Better Chatbots with Advanced RAG Techniques
Zain Hasan, Developer Advocate
Running Prompts at CI Does Not Make Your Gen Ai App Enterprise Ready
Jakob Frick, CTO
Data Versioning in Generative AI: A Pathway to Cost-effective ML
Dmitry Petrov, CEO
Building ML and GenAI Systems with Metaflow
Ville Tuulos, CEO
Talk: Beyond Gemini: Using RL to Unlock Reliable AI Agents with Open LLMs

Presenter:
Julien Launay, CEO, Adaptive ML

About the Speaker:
Julien is the CEO and co-founder of Adaptive ML, a company focused on democratizing reinforcement fine-tuning. Prior to founding Adaptive ML, Julien was the technical lead behind both Falcon 40B and 180B the popular open-source LLMs, as well as the RefinedWeb dataset used to train them. Julien was also a key contributor within BigScience to the development of the open-source LLM BLOOM and lead the Extreme Scale team at Hugging Face.

Category: Quality Tuning

Abstract:

An EdTech organization, faced the ambitious challenge of developing an AI agent for student support superior to Khanmigo, specifically tailored to improve student outcomes. Achieving an agent that met their performance expectations required embedding decades of domain expertise—a task unattainable through prompt engineering alone.

This talk explores how an EdTech overcame these limitations by fine-tuning language models using reinforcement learning (RL) techniques with primarily synthetic data. Remarkably, these approaches enabled the fine-tuning of smaller, open-weight models that consistently outperform state-of-the-art models, including specialized variants like Gemini Coach.

Specifically, we examine how RL facilitated sophisticated behaviors such as dynamic retrieval-augmented generation (agentic RAG), adaptive communication strategies refined through iterative feedback, and advanced synthetic data techniques like self-play—eliminating the necessity for extensive real-world data collection.

Talk: Workflows Then Agents: The Practical Approach to Enterprise AI

Presenter:
Julian LaNeve, CTO, Astronomer

About the Speaker:
Julian LaNeve is Chief Technology Officer at Astronomer.

Category: LLMs

Abstract:
We’ve seen countless teams chase AI, only to get lost in its complexity without delivering value. The truth? Most organizations don’t need agents talking to agents to get immediate ROI out of generative AI – they need reliable LLM workflows that solve real business problems today.

Related to this blog.

Talk: All the Flavors of AI-Powered Search

Presenter:
Kacper Łukawski, Senior Developer Advocate, Qdrant

About the Speaker:
Software developer and data scientist at heart, with an inclination to teach others. Public speaker, working in DevRel.

Category: Embeddings

Abstract:
Machine Learning has revolutionized how we find relevant documents given a query based on its semantics, not keywords. In its basic form, vector search relies on single vector representations using the same model for queries and documents. Currently, we’re experiencing a second wave of vector search, enabling new modalities to be searchable. Do you want to search over vast amounts of PDFs? OCR is no longer needed, as modern vector search architectures can handle that with no additional parsing.

Why should you care about search? The “R” in RAG stands for Retrieval, which is just a different name for search. The better you can find what’s relevant, the higher the quality of AI responses you may expect.

Let’s review the current state of AI-powered search, including multivector representations such as ColPali.

Talk: Teaching AI to Reason: Reinforcement Fine-Tuning for Multi-Turn Agentic Workflows

Presenter:
Sameer Reddy, Research Engineer, Predibase

About the Speaker:
Sameer Reddy is a Research Engineer at Predibase, where he works on fine-tuning and serving efficient language models for real-world agentic applications. His background spans reinforcement learning, LLM infrastructure, and ML efficiency, with prior research at Cisco and Georgia Tech focused on scalable model training and inference systems.

Category: LLMs

Abstract:
Multi-turn agent workflows—where models must reason across multiple steps, gather context iteratively, and make decisions over time—pose a unique challenge for LLMs fine-tuned only on static, one-shot data. In this talk, I’ll demonstrate how reinforcement fine-tuning (RFT) unlocks more reliable, controllable performance in complex agentic tasks by letting developers define reward functions that shape model behavior across multiple turns.

In this talk, we’ll share how reinforcement fine-tuning (RFT) can be used to train small, specialized models (1B–3B parameters) that act as lightweight decision engines within larger agentic workflows. We’ll demonstrate how to fine-tune a model to select tools accurately using just a reward function—no hand-labeling required—and how this architecture can reduce both latency and cost while improving precision.

While the live demo will focus on a single-turn decision task, we’ll explore how this approach can generalize to multi-turn agent behavior, such as:

– Deferring tool selection to a compact RFT model before invoking a larger orchestrator LLM
– Teaching models to reason (via chain-of-thought) before making decision
– Building modular, low-latency components that plug into existing agent stacks

This talk is ideal for ML engineers and infra teams building production-grade agents who want to reduce costs, increase reliability, and take greater control over how their models reason and act.

Talk: Trading Copilot - Smarter Insights, Confident Trades

Presenter:
Bhaskarjit Sarmah, Head RQA AI Lab, BlackRock

About the Speaker:
Bhaskarjit is a Director and Principal Data Scientist at BlackRock, where he applies machine learning skills and domain knowledge to build innovative solutions for the world’s largest asset manager. He has over 10 years of experience in data science, spanning multiple industries and domains such as retail, airlines, media, entertainment, and BFSI.

At BlackRock, he is responsible for developing and deploying machine learning algorithms to enhance the liquidity risk analytics framework, identify price-making opportunities in the securities lending market, and create an early warning system using network science to detect regime change in markets. He also leverages his expertise in natural language processing and computer vision to extract insights from unstructured data sources and generate actionable reports.

Category: Agent Frameworks

Abstract:
This project aims to reduce the cognitive burden faced by professional traders. By developing a multi-agent framework called FinSage, we can help traders analyze complex market information by efficiently filtering through noise to deliver the most relevant and accurate data within minutes. Our approach seeks to enable traders to quickly uncover actionable insights for informed trading decisions.

FinSage consists of six specialized agents, each with a distinct role:
Supervisor Agent: Orchestrates overall operations and manages which agents to activate for each query.

Financial Metrics Agent: Provides comprehensive technical metrics for analyzing a company’s financial health and stock performance.

News Sentiment Agent: Analyzes sentiment across company news and specific market sectors.

SQL Agent: Accesses a historical database containing data from 2009 to 2022 for 57 NYSE-listed companies.

Synthesizer Agent: Combines data collected by other technical agents to generate concise, actionable answers to user queries.

Users can personalize their experience by entering their trading profile. FinSage then tailors its responses and recommendations based on the user’s risk tolerance, investment style, and expected return time horizon. The application is intuitive—users simply input their trading profile and a specific question like, Should I buy META stocks today?” and FinSage will come with a response.

Talk: AI Agent Observability

Presenter:
Claire Longo, Lead AI Researcher, Comet

About the Speaker:
Claire Longo is an AI Leader and Researcher at Comet with over a decade of experience in Data Science, Machine Learning, and GenAI. From coding in R as a Statistician at DOE National Laboratories to building recommender systems at Trunk Club, to leading customer success organizations at AI Startups, she has navigated the evolving AI landscape—teaching herself Python and ML along the way. She has led cross-functional AI teams at Twilio, Opendoor, and Arize AI, focusing on the engineering best practices required to bring AI models from ideation to production at scale. In her current role at Comet, Claire researches AI trends and shares best practices with the developer community. She holds a Bachelor’s in Applied Mathematics and a Master’s in Statistics from The University of New Mexico.

Category: LLM

Abstract:
As AI Agents become increasingly more autonomous and complex, ensuring these systems can detect and root cause live issues becomes more critical than ever to ensure we are shipping reliable and high quality AI Agents. Observability—the ability to monitor, debug, AI systems—is a key component for deploying AI Agents we can trust and improve.

This talk will explore the emerging field of AI Agent Observability, covering best practices and challenges in tracking Agent behavior, understanding decision-making processes, and identifying failures points or biases. We will discuss the role of logging traces, monitoring,, and LLM eval metrics in improving visibility into your Agent operations, as well as novel approaches inspired by Reinforcement Learning.
Through real-world case studies, we will examine how AI Observability enhances AI-driven systems in industry today. This talk will provide you with strategies in your toolbox to implement Comet Opik to implement Observability for your AI Agent.

Talk: Leveraging LLMs to Build Mixed-Data ML Pipelines and Search Git Hub with Natural Language

Presenter:
Andrew Moreland, Co-Founder, Chalk

About the Speaker:
Andrew is a Co-Founder at Chalk, the data platform for inference. Previously, he built large government data infrastructure projects at Palantir, and co-founded Haven Money, later acquired by Credit Karma. He holds a B.S. and M.S. in Computer Science from Stanford.

Category: LLMs

Abstract:
In this talk, we break down how to integrate LLMs into production-grade machine learning pipelines that blend structured and unstructured data (natural language search queries, contextual embeddings).
Using real-time feature engineering to surface relevant results (fraud detection, recommender systems, inference) with minimal latency
Optimizing for speed and efficiency, dynamically adjusting search complexity based on structured data cues.
Leveraging LLM retrieval at scale and continually conducting evals to improve recommendations over time

Talk: Eval++: Why LLM Evaluation Alone Isn’t Enough

Presenter:
Chinar Movsisyan, Founder & CEO, Feedback Intelligence

About the Speaker:
Chinar Movsisyan is the founder and CEO of Feedback Intelligence (formerly Manot), an LLMOps company based in San Francisco that enables enterprises to make sure that LLM-based products are reliable and that the output is aligned with end-user expectations and needs. Chinar has extensive experience in building AI solutions from 0 to 1 in different mission-critical applications including drones, satellites, and healthcare. She has led engineering and research initiatives at different venture-backed startups (Amaros, Vineti) and research labs (LCIS Lab at Grenoble University). Her PhD was in ML in healthcare supervised by Sos Agaian in CUNY.

Category: Quality Assessment

Abstract:
You wouldn’t serve a cake without tasting it — so why ship LLM products without real user feedback?

Traditional evaluation stops at internal data science metrics and offline benchmarks. But that’s not enough to ensure your LLM-powered product is aligned with real user needs and expectations.

In this talk, we introduce a new approach that puts users into the center of development and optimization. We call it Eval++.

You’ll learn how to turn implicit feedback into actionable insights, why closing the feedback loop is critical for moving from PoC to ROI, and how enterprises and startups alike are adopting Feedback Intelligence to build LLM-powered products that users love (and trust).

Talk: Knowledge Graphs + Semantic Search: Unlocking Smarter LLMs

Presenter:
Alessandro Pireno, Solutions Director, SurrealDB

About the Speaker:
Alessandro is a seasoned product development and solutions leader with a proven track record of building and scaling data-driven solutions across diverse industries. He has led product strategy and development at companies like HUMAN and Omnicom Media Group, optimized data collection and distribution at GroupM, and was an early leader of success at Snowflake. With a deep understanding of the challenges and opportunities facing today’s tech landscape, Alessandro is passionate about empowering organizations to unlock the full potential of their data through innovative database solutions.

Category: Vector DBs

Abstract:
Machine learning teams are increasingly experimenting with AI “agents” powered by large language models (LLMs) to autonomously complete tasks or make decisions. These agentic AI workflows promise dynamic, adaptive behavior, but they also introduce serious challenges for MLOps in production: unpredictable decision paths, novel failure modes (like hallucinated outputs or endless loops), and difficulty in monitoring and debugging. **How can we harness the power of LLM-driven agents without losing control of our pipelines?**

This talk tackles that question head-on by sharing lessons learned from implementing agent-based workflows at scale and demonstrating a structured approach to keep them reliable. Attendees will learn how to design workflows that delegate work to LLM agents in a controlled manner – defining clear task boundaries, adding guardrails (such as timeouts and retries), and capturing detailed traces of agent decisions. They will also see how an open-source framework (built on Prefect) makes it easier to orchestrate and observe these complex workflows. By the end of this session, you’ll know how to turn the “black box” of an AI agent into a transparent, trustworthy part of your ML operations.

Talk: Beyond Benchmarks: Rethinking How We Evaluate LLMs in High-Stakes Environments

Presenter:
Rajat Verma, Senior Staff Product Manager, ServiceNow

About the Speaker:
With over 11 years of experience at the intersection of Data Science, Product Management, and Artificial Intelligence in B2B SaaS companies, Rajat Verma specializes in building data and AI/ML products that transform data into measurable business impact. Previously at Autodesk, Rajat led the development of a data platform processing over one million daily transactions and created data products that contributed approximately $10 million in both revenue growth and cost savings. Currently at ServiceNow, Rajat is focused on building and operationalizing AI products that drive business growth and enhance enterprise decision-making. His experience spans structured data pipelines to complex AI systems and models.

Category: Quality Tuning

Abstract:
Evaluating Large Language Models (LLMs) presents challenges far beyond traditional machine learning metrics like accuracy or F1 scores. As LLMs become integral to high-stakes applications—ranging from enterprise automation to medical and legal decision-making—the need for robust, multi-faceted evaluation frameworks has never been greater.
Unlike conventional models, LLMs must be assessed across diverse dimensions, including factual accuracy, reasoning depth, coherence, safety, and ethical alignment.

This talk explores three core challenges in quality tuning: (1) the inherent subjectivity of open-ended responses, making traditional benchmarking difficult; (2) the absence of definitive ground truth in generative tasks, complicating evaluation; and (3) the dynamic, context-dependent nature of correctness, which shifts based on application needs.

Building on existing evaluation benchmarks like HELM and EleutherAI’s Evaluation Harness, we’ll discuss a compositional quality tuning framework that adapts scoring weights dynamically to balance trade-offs between factuality, creativity, and safety. This approach includes novel mechanisms for detecting hallucinations in domain-specific contexts and quantifying the robustness of model outputs under varying prompt conditions.

Finally, as LLMs grow more agentic, demonstrating long-horizon reasoning and autonomous decision-making, quality tuning methodologies must evolve in parallel. We will discuss some practical experiences implementing these evaluation approaches across different industries, highlighting how targeted quality tuning dramatically improves model performance in specialized domains.

Talk: Safety Testing of the AI Agent: Vulnerabilities and Attacks Beyond the Chatbot

Presenter:
Alexander Borodetskiy, VP of Growth, AI Safety, Toloka AI

About the Speaker:
TBA

Category: Guardrails, trust, security, Risk Mitigation

Abstract:
AI agents are evolving beyond simple chat, browsing the web and interacting with your computer. But how do we ensure they’re safe? We’ll explore a new safety evaluation framework, revealing how specially crafted web pages, files, and OS environments can be used to expose agent vulnerabilities. See how prompt injections can manipulate agents into leaking sensitive data, how to organize testing of an agent for performing unsafe mistakes – and how they might even be hijacked for malicious activities. This technical deep dive reveals the approach to conduct safety testing of today’s leading AI agents and provides insights into building more robust and safe AI systems.

Talk: One Does Not Simply Run Code

Presenter:
Tereza Tizkova, Head of Developer Relations, E2B

About the Speaker:
Tereza was with E2B from the beginning. She focused on developer relations, explaining the product, and creating content to help developers build the best AI apps. Before that she was a management consultant at McKinsey, and her background is abstract Math. She loves to test new AI tools, talk to developers, and write a blog in her free time.

Category: LLMs

Abstract:
We are in the era of LLM-powered software. The AI-powered apps and products being built require very specific tools. We will talk about building infrastructure for AI agents and apps, and what challenges we need to overcome if we want to make agents more reliable. We introduce different types of code execution environments for agents, consider advantages and disadvantages, and talk about important questions like security or scalability.

Talk: Reward Function As A Service: A (Relatively) Easy Recipe for Training Your Own Reasoning Model

Presenter:
Ville Tuulos, CEO / Co-Founder, Outerbound

About the Speaker:
Ville Tuulos has been developing infrastructure for ML and AI for over two decades. He has worked as an ML researcher in academia and as a leader at a number of companies, including Netflix where he led the ML infrastructure team that created Metaflow, a popular open-source framework for ML infrastructure. He is a co-founder and CEO of Outerbounds, a developer-friendly platform for real-world ML/AI systems.

Category: Quality Tuning

Abstract:
In this session, we will walk through a template that allows you to post-train a Deepseek-style LLM with your own reward functions. In contrast to traditional supervised finetuning, this approach allows you to customize highly performing LLMs for domains that are amenable to programmatic evaluation through simple Python functions. This technique is commonly applied to math and coding, but as shown in this session, with some creativity you can apply the template to many other domains as well.

Talk: Closing the Reliability Gap: Practical Strategies with Guarantees for Trustworthy GenAI

Presenter:
Curtis Northcutt, CEO & Co-Founder, Cleanlab

About the Speaker:
Curtis Northcutt is an American computer scientist, artist, and entrepreneur focusing on using machine learning and artificial intelligence to empower people. Curtis completed his PhD at MIT where he invented Cleanlab’s algorithms.

He is the CEO and Co-Founder of Cleanlab, used by 80+ of the top Fortune-500 companies to “detect, observe, and immediately resolve RAG and LLM failures—such as hallucinations, retrieval failures, and knowledge gaps—in real time. ” Curtis is the recipient of the MIT Morris Levin Thesis Award, the NSF Fellowship, and the Goldwater Scholarship and has worked at several leading AI research groups, including Google, Oculus, Amazon, Facebook, Microsoft, and NASA

Category: Quality Tuning

Abstract:
As GenAI becomes embedded in production systems across industries, reliability challenges are surfacing that threaten its long-term value. From simple factual errors—like misidentifying the third month of the year alphabetically—to complex reasoning failures, these issues undermine user trust and limit the effectiveness of AI applications. In this talk, we will survey the current landscape of AI reliability, including the latest research and tools for identifying and classifying misbehaviors of AI systems. We will delve into how these tools fit in to cutting-edge architectures, such as RAG and Agentic systems. We will conclude by discussing strategies for remediation, including our vision for how AI can be designed to collaborate effectively with humans, delivering optimal outcomes with minimal oversight. Join us to learn how the future of AI hinges not just on capability, but on dependable, trustworthy systems.

Talk: Building a Deep Research with Open-Source

Presenter:
Stefan Webb, Developer Advocate, Zilliz

About the Speaker:
Stefan Webb is a Developer Advocate at Zilliz, where he advocates for the open-source vector database, Milvus. Prior to this, he spent three years in industry as an Applied ML Researcher at Twitter and Meta, collaborating with product teams to tackle their most complex challenges. Stefan holds a PhD from the University of Oxford and has published papers at leading machine learning conferences such as NeurIPS, ICLR, and ICML. He is passionate about Generative AI and is eager to leverage his deep technical expertise to contribute to the open-source community.

Category: LLMs

Abstract:
Unless you live under a rock, you will have heard about OpenAI’s release of Deep Research on Feb 2, 2025. This new product promises to revolutionize how we answer questions requiring the synthesis of large amounts of diverse information. But how does this technology work, and why is Deep Research a noticeable improvement over previous attempts? In this webinar, we will examine the concepts underpinning modern agents, especially research agents, using our open-source clone DeepSearcher as an example.

Get Notified!

Registration Opening Soon!