Tibetan Search Enhancements: Part 2

26 1 月, 2021

BDRC's users are pleased with the accuracy and nimbleness of the BDRC search engine. When querying longer strings of text in either Wylie or Unicode the search engine is remarkably intuitive. The intelligence of the search engine is a result of it being somewhat lenient with transliterations and also flexible with grammatical particles, in addition to other nuances. These helpful features were custom-built by the BDRC team, which developed linguistically-informed algorithms in-house. A major update to the search engine was released in 2014 and today we are releasing a second major improvement to the search engine.

This post will assume the reader is at least somewhat familiar with the system described in the 2014 post Tibetan Search Enhancements.

Changes from tbrc.org

One of the more knotty challenges in Tibetan search is to create consistent interchangeability between Unicode and transliteration. Searching for the same term in Unicode and transliteration on tbrc.org, our legacy site, would sometimes return different search results. These inconsistencies were especially noticeable when searching on our corpus of e-texts because the e-texts were created in Unicode while the rest of the database was encoded in Wylie transliteration. For example, compare the search results on tbrc.org for པདྨ་གླིང་པ and padma gling pa. They are generally similar except in the e-texts where:

པདྨ་གླིང་པ has 92 results in e-texts
padma gling pa has 0 result in e-texts
pad+ma gling pa has 92 results

This is due to the original data being in Unicode and containing པདྨ (= pad+ma) and not པདམ (= padma). Thus when searching on padma (= པདམ), there's no result.
In 2020, BDRC launched its entirely new web platform known as BUDA, Buddhist Digital Archives. BUDA was newly built from the ground up and one of its many changes is a new way of processing Tibetan text. We now index everything in Unicode so that we can perform only one search (for the sake of performance and consistency). This creates some issues because we can no longer ignore the + sign as we did on tbrc.org. So "sloppy" transliterations such as padma (= པདམ, not the intended པདྨ) are not always equivalent to their correct form. We have identified a few forms (padma, ratna, pandi, etc.) that we normalize to their intended form, but it's best to mark the + in your search if you want to make sure you get exactly what you need.

The transliteration is also always lower-cased on tbrc.org, but because of the new search engine on BUDA, capitalization is now handled differently. For instance in order to make Na (= ཎ) and na (= ན) yield the same search results, we have to normalize their Unicode characters: ཎ → ན, which changes the search in Unicode. This also affects the vowels: in order to make A (= ཨཱ) and a (ཨ) equivalent in the search, we have to ignore the achung, even in Unicode.

While on tbrc.org lower case is the default, BUDA adds two exceptions for certain forms of M and H because their results are just too different to be considered equivalent:

dMar → དམར (the M is lower cased and converted to མ), but
aM → ཨཾ (the M is not lower cased and converted to an anusvara)
lHan → ལྷན (the H is lower cased and converted to ཧ)
kaH → ཀཿ (the H is not lower cased and converted to a visarga)

Two other small changes have been implemented:

tbrc.org also had a difference in search results in Unicode when searching on two syllables with or without a tshek at the end (ex: "སྤྱོད་འཇུག" vs. "སྤྱོད་འཇུག་"); this has been fixed in BUDA
tbrc.org considered kaHthog and kaH thog as two different forms, they are considered in the same way on BUDA

Finally, tbrc.org searches on all combinations of syllable order, which leads to more results but causes performance issues and returns irrelevant search results. BUDA only searches on the given syllables, optionally separated by another one (ex: སྤྱོད་པ་འཇུག་པ will find སྤྱོད་པ་ལ་འཇུག་པ). This leads to fewer but more precise results.

Old Tibetan / Archaic forms normalization

In order to index more Old Tibetan on BUDA, we added rules to handle it properly. The new search normalizes most of the patterns that have been listed in the context of Faggionato, C. & Garrett E., Constraint Grammars for Tibetan Language Processing. The list of patterns can be found here.

Here are a few examples of normalization:

གྱུརད → གྱུར (remove second suffix ད)
མདོའ → མདོ (remove archaic use of suffix འ)
གའས → གས (remove archaic medial འ)
མྱེད → མེད (ma yata with gigu or drengbu)
གསྩན → གསན (སྩ variant, in some cases)
གྀ → གི (ignore gigu direction)
སྑ → སྐ (normalize aspiration when possible)
དྲངསྟེ → དྲངས་ཏེ (split syllables)
པགི → པག་གི (split syllables)
བཀུམོ → བཀུམ་མོ (split syllables)

Verbs stemming

Verb morphology in Tibetan is interesting to normalize for two reasons:

there is a lot of variation in the tense of verbs for the same phrase, especially between older and newer texts
morphological variants are often homophones, and thus prone to misspellings

The new BUDA search engine normalizes a lot of verb forms, based on the work by Nathan Hill, described in "A Lexicon of Tibetan Verb Stems as Reported by the Grammatical Tradition" (2010, Munich: Bayerische Akademie der Wissenschaften, ISBN 978-3-7696-1004-8). The list is derived from this digital version, with very minor adjustments and reformatting. This means that searching a phrase with སྐྱོབས will yield the same results as the phrase with སྐྱོབ.

More ideas?

If you want to give us feedback on any aspect of this new search, or on forms that you feel should also be normalized, please contact us as help@bdrc.io.