
Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 4.1: Report on Analysis of multimodal cultural heritage
Date of reporting: 20-11-2025
Report author: Ilkka Lähteenmäki (University of Oulu)
Contributor: Ilkka Lähteenmäki (University of Oulu)
Deliverable location: 10.5281/zenodo.17700648
This paper examines if historians and cultural heritage researchers can justifiably depend on multimodal AI systems for accessing large visual collections from social epistemology point of view. Building on Inkeri Koskinen’s “necessary trust view” and Jakob Ortmann’s account of task-specific epistemic reliance, it argues that digital history and cultural heritage form a non-typical setting for current social epistemology of AI. In contrast to the physical sciences, where AI tools such as AlphaFold are embedded in long-standing evaluation regimes and well-defined tasks, historical research involves open-ended, exploratory questions, fuzzy and historically shifting concepts, and interpretive practices centred on individual researchers and small teams.
The paper uses examples from recent proposals for using multimodal AI for text-to-image, image-to-text and image-to-image retrieval, and for AI-assisted metadata generation and “distant viewing” of images. It shows how hopes for a multimodal turn in digital humanities confronts the essential epistemic opacity of deep neural networks and the difficulty of evaluating reliability for complex open ended retrieval tasks. Three suggested mitigation strategies are discussed: critical analyses of models and training data; historically informed reflection on bias and concept change; and fine-tuning or post-processing of models for specific purposes. From a social epistemology perspective, each strategy encounters limits when generalised to research infrastructure meant to support many corpora, tasks and user communities.
The paper then turns to approaches that argue for using multimodality theory to design metadata schemas and guide AI-based annotation. It shows how this is a attempt to shift epistemic trust from AI systems back to scholars (at least partially) in effort to make use of the developing technology. However, this brings into discussion old debates of between theories of meaning. Especially with image data the theoretical discussion of how images meanings should be established and if these theories are implementable to computational models need to be explored. Couple examples from contemporary photography and medieval manuscript research illustrate both the potential of AI-supported exploration and the need for additional contextual and theoretical work to render outputs historically interpretable.
The central claim is that, given the essential epistemic opacity of AI, it currently looks like justified epistemic dependence in history and cultural heritage research needs be organised around situated, task-specific, and accountable uses of multimodal models rather than general-purpose models. The options for research infrastructures for establishing trust are therefore focus on building mechanisms for task-specific reliability assessment, or embedding trusted identifiable human agents or institutions between users and models.
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
