Model Card: Scentience-OVLE-Large-v1

Colab Paper Open in Spaces

We offer a foundational series of larger multimodal joint embedding models trained on olfaction, vision, and language data. These models are built specifically for prototyping and exploratory tasks within AR/VR, robotics, and embodied artificial intelligence. Analogous to how CLIP and SigLIP embeddings give vision-language relationships, our embeddings models here give olfaction-vision-language (OVL) relationships.

Please note that all Scentience machine learning models are for research purposes only. Scentience does not claim any specific performance beyond the model cards nor for any specific applications.

For more information on Scentience privacy and data policies, please observe the Scentience Privacy Policy.

Model Details


Intended Use


Training Data

For more information on how the training data was accumulated, please see the HuggingFace dataset URL here


Evaluation


Limitations of Evaluation

To the best of our knowledge, there are currently no open-source datasets that provide aligned olfactory, visual, and linguistic annotations. A “true” multimodal evaluation would require measuring the chemical composition of scenes (e.g., using gas chromatography mass spectrometry) while simultaneously capturing images and collecting perceptual descriptors from human olfactory judges. Such a benchmark would demand substantial new data collection efforts and instrumentation. Consequently, we evaluate our models indirectly, using surrogate metrics (e.g., cross-modal retrieval performance, odor descriptor classification accuracy, clustering quality). While these evaluations do not provide ground-truth verification of odor presence in images, they offer a first step toward demonstrating alignment between modalities. We draw analogy from past successes in ML datasets such as precursors to CLIP that lacked large paired datasets and were evaluated on retrieval-like tasks. As a result, we release this model to catalyze further research and encourage the community to contribute to building standardized datasets and evaluation protocols for olfaction-vision-language learning.


Limitations


Ethical Considerations


Environmental Impact


Citation

If you use this model, please cite:

    @misc{france2025ovlembeddings,
        title = {Scentience-OVLE-Base-v1: Joint Olfaction-Vision-Language Embeddings},
        author = {Kordel Kade France},
        year = {2025},
        howpublished = {Hugging Face},
        url = {https://huggingface.co/kordelfrance/Olfaction-Vision-Language-Embeddings}
    }
    @misc{radford2021clip,
        title        = {Learning Transferable Visual Models From Natural Language Supervision},
        author       = {Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
        year         = 2021,
        url          = {https://arxiv.org/abs/2103.00020},
        eprint       = {2103.00020},
        archiveprefix = {arXiv},
        primaryclass = {cs.CV}
    }
    @misc{zhai2023siglip,
          title={Sigmoid Loss for Language Image Pre-Training}, 
          author={Xiaohua Zhai and Basil Mustafa and Alexander Kolesnikov and Lucas Beyer},
          year={2023},
          eprint={2303.15343},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2303.15343}, 
}

License

This dataset is released under the MIT License.