My Google Summer of Code Experience - Francisco Amoros Cubells

Integrating Vision Large Language Models into Anomalib

During my time as a Software Engineer - Machine Learning in the Google Summer of Code program (May 2024 - Aug 2024), I worked on an exciting project with OpenVINO's Anomalib. My main focus was on integrating Vision Large Language Models (VLLMs) into the Anomalib framework to achieve Zero/Few Shot anomaly detection models.

Project Goals

Implement VLLM integration within the Anomalib framework
Develop Zero-Shot and Few-Shot learning capabilities for anomaly detection
Optimize the performance of the integrated models
Create comprehensive documentation and examples for future users

Key Achievements

Successfully integrated state-of-the-art VLLMs into Anomalib
Developed a flexible architecture that supports both Zero-Shot and Few-Shot learning paradigms
Achieved significant improvements in anomaly detection accuracy compared to traditional methods
Contributed to the open-source community by making the integration publicly available
Created extensive documentation, including tutorials and code examples

Challenges and Learning

Throughout the project, I encountered specific challenges related to model performance. The OpenAI ChatGPT model worked effectively out of the box, delivering satisfactory results. However, the open-source models presented significant difficulties—they were not trained to handle multiple images effectively or did not perform well in the tasks required. These challenges highlighted the limitations of current open-source models in comparison to proprietary solutions and underscored the importance of continued development and training for such models.

Code and Documentation

You can find the code I worked on during this project at the following repositories:

Code Not Merged

During the project, several models were explored but ultimately not merged due to performance issues. Below are the details of these models and the reasons they were not integrated into the main branch:

LLaVA Model - This model demonstrated poor performance and was unable to handle multi-shot scenarios effectively.
LLaVA Next Model - Although this model performed better in zero-shot scenarios, it still struggled with multi-shot tasks, leading to its exclusion from the final integration.
Ollama Wrapper - A wrapper for the Ollama model zoo, which similarly lacked the capability to handle multi-shot scenarios, resulting in suboptimal performance.

What's Left to Do

While significant progress has been made, there are still several tasks left to complete:

Real-Time Inference Enhancements: Additional work is required to reduce latency and improve the speed of the anomaly detection models in real-time applications.
Extended Testing and Validation: Comprehensive testing across a broader range of datasets and anomaly types is needed to validate the robustness and generalizability of the models.
User Documentation and Tutorials: Although initial documentation has been created, more in-depth tutorials and guides are needed to help users fully leverage the new features.
Community Feedback and Iteration: Engage with the open-source community to gather feedback, address issues, and iteratively improve the integration based on real-world use cases.

Overall, my Google Summer of Code experience was incredibly rewarding, allowing me to contribute to cutting-edge machine learning technology while collaborating with talented developers from around the world.