BDRC Launches Major Initiative to Build Open Buddhist Datasets for AI

BDRC is making a major effort to ensure that artificial intelligence (AI) is better informed about Buddhism. With generous financial support and inspiration from Khyentse Foundation, in December 2025 BDRC launched a major project to create open-source Tibetan Buddhist datasets for AI. For over three years we have shared on our blog a series of AI initiatives undertaken by BDRC, including the development of Tibetan OCR models and public-facing tools designed to benefit the broader scholarly and Buddhist communities, alongside applications that have improved our own internal workflows. Building upon those experiences–and the rapid developments in the wider AI field–we are now leading one of the largest AI-focused initiatives in the Buddhist world.

This project contributes authoritative Buddhist source material directly into the data ecosystems that shape tomorrow's AI systems. By preparing millions of pages of Tibetan Buddhist writings in machine-readable form and publishing them on platforms used by major AI developers, we are helping ensure these systems are trained on authentic primary sources. At the same time, this work expands and strengthens the BDRC archive. In the years ahead, when scholars, practitioners, or curious readers ask AI systems about Buddhist philosophy, historical figures, or Tibetan terminology, they are far more likely to receive accurate and textually grounded answers. In short, we aim to ensure that everyday AI tools reflect authentic Buddhist sources rather than partial or distorted representations.

Tibetan and international team members train together in cutting-edge techniques that are bringing centuries of Buddhist literature into the AI age.

Khyentse Foundation and BDRC share the conviction that, in a moment when AI is increasingly shaping how knowledge is accessed and interpreted, we must focus on what we can responsibly control: the quality and integrity of the sources that enter these systems. For this reason, the grant centers on the creation of a foundational open-access corpus representing a standardized and cross-validated digital version of each unique Tibetan Buddhist text, fully prepared for integration into modern AI systems. By producing tens of thousands of reliable, standardized e-texts drawn from a comprehensive collection of Tibetan-language Buddhist works, we aim to ensure that AI systems encounter Buddhism through reliable, carefully prepared primary sources.

To achieve this, the project brings together two complementary streams of work: carefully prepared manual transcriptions, which provide the highest level of textual precision and serve as trusted reference points, and large-scale Optical Character Recognition (OCR), which unlocks millions of scanned pages from BDRC's archive by transforming them into searchable digital text. The results of these two workflows will be systematically cataloged, compared, cross-validated, and merged–combining manual input, OCR outputs, and multiple editions of the same work–to generate accurate versions of each unique text. The result will be the largest and most reliable open-access corpus of Tibetan Buddhist e-texts ever assembled, fully prepared to ground modern AI systems in authentic primary sources.

Beyond building this foundational corpus, the project also includes:

  • Significant improvements to Tibetan OCR technology, including the development of rigorous evaluation benchmarks and expanded training datasets to steadily increase accuracy across different scripts and print styles.
  • The creation of a detailed text-level catalog, identifying individual works within volumes, merging duplicate versions, and enriching each text with structured bibliographic metadata.

In just the first two months of the grant, substantial progress has already been made. The Gold Standard corpus has grown from 1.9GB to 3.4GB of normalized XML files. More than 26 million Tibetan images have been processed through OCR pipelines, yielding an estimated 10% improvement in accuracy. Nearly 35,000 manuscript and woodblock images have been aligned with high-quality transcriptions to train the next generation of models. We have laid the technical foundation for significant advances in the months ahead.

 The project is led by BDRC's Chief Technology Officer, Élie Roux, who serves as Principal Investigator and directs its technical architecture, OCR development, and long-term digital strategy, in close coordination with Khyentse Foundation's Wisdom and AI Committee.

BDRC is working on this project with experts and technicians around the globe. Our central partners are a Tibetan tech startup in India called Dharmaduta, "Dharma Messenger." Dharmaduta is a Tibetan-founded mission-driven technology organization dedicated to applying advanced digital tools to the preservation and flourishing of Tibetan Buddhist knowledge. Their team combines Tibetan language expertise with modern software engineering, AI development, and large-scale annotation management. Over the past several years, they have become one of the leading groups working at the intersection of Tibetan language, OCR, and AI.

Training OCR to read scribal abbreviations: The Dharmaduta and BDRC teams follow detailed annotation guidelines to transcribe manuscript contractions exactly as written, preserving the complex letter stacks that have long challenged OCR systems.

The final release of the datasets will be placed in the primary repositories from which commercial AI systems derive their training data, most notably HuggingFace. It is our aspiration that these systems will ingest and "train on" the Buddhist writings we make available, so that in the near future they will be more accurate and informed in conversations and research queries about topics in Buddhism. At the same time, this work will help advance Tibetan-language AI more broadly by providing high-quality training material at a scale never before available.

Importantly, the datasets and OCR models will be released into the public domain and therefore freely available to independent, community-led Buddhist and Tibetan AI initiatives developing language models and other innovative tools. Translators, scholars, and publishers will also find many productive uses for the datasets, from search and textual comparison to new forms of large-scale literary and historical analysis.

The grant runs through the summer of 2027, and you can expect many exciting announcements over the next year and a half as major milestones are reached. We are deeply grateful to Dzongsar Khyentse Rinpoche and Khyentse Foundation for bringing BDRC into their vast vision for Buddhist AI, and for empowering us to grow as an organization and contribute meaningfully to this era-defining technology.

No Comments

Sorry, the comment form is closed at this time.