By Élie Roux
Tibetan Alphabetical Order in Computers
Many of us take for granted that with the click of a single button we can bring order to an unruly list by sorting it in alphabetical order, numerical order, chronological order, or various other sequences. On a spreadsheet, for instance, a long column of names in Romance languages can be instantly alphabetized. Such an operation is not currently possible for Tibetan language content in applications such as Excel or Google Docs. Those of you who know Tibetan can try this at home: take a spreadsheet containing words in Tibetan and select the alphabetical order display option. The results won't be pretty. This limitation to Tibetan computing will soon change thanks to an effort by BDRC to encode Tibetan alphabetical sorting rules; and to have these rules approved by the international standards community. This will encourage widespread adoption by Google, Apple, Microsoft, and others. We'd like to share with you here the research, innovations, and networking behind this breakthrough.
Dr. Tashi Samphel, Director of Songtsen Library in Dehradun, India
The sorting in most software is driven by language-specific algorithms (called collation rules) that are standardized in the Unicode Common Locale Data Repository (CLDR). As of October 2021, CLDR didn't have any rules for Tibetan. To remedy this situation, BDRC is pleased to announce that we have successfully added Tibetan Collation Rules to the CLDR. This step will bring Tibetan closer to being fully supported on websites and smartphones, and will improve the digital experience of Tibetans worldwide.
The CLDR hosts international standards that are shared among all Operating Systems, apps, websites, etc. They released the new Tibetan Collation Rules (Tibetan alphabetical ordering, in other words) on October 27th 2021, enabling these rules to become part of the general software environment in the following months and years.
Many websites and apps display lists of Tibetan words (names, titles, dictionary entries, etc.) and need to sort them in a way that users can navigate easily. Unlike with English, which simply sorts character by character, Tibetan alphabetical ordering is complicated by prefixes, superscripts, suffixes, Sanskrit features, etc. In order for software to fully support the Tibetan language, rules representing the widely-used Tibetan alphabetical order found in modern dictionaries needed to be converted into computer code and added to the CLDR, the repository of international standards.
This achievement is the result of many years of documentation, testing and patient effort, as well as key support from a network of dedicated language hackers. Without the help of these supporters, whom we thank in acknowledgements, this achievement would not have been possible.
This effort is part of BDRC's practice of contributing back to the Buddhist communities in Asia who produced the precious texts we bring online. Stay tuned for more such contributions to international standards, such as support for Tibetan calendars and resources for Khmer, Burmese, and other languages important to Buddhism.
The rest of this post will dive into the historical origins of the rules we have implemented, as well as a few technical details.
Tibetan Alphabetical Order in History
Since the modern Tibetan alphabetical order is the only one encountered in the dictionaries in use today, it can be tempting to consider it to be an original feature of the Tibetan language. As it turns out, throughout history there have in fact been various types of lexicographical organization, with the alphabetical order starting to be used only in the 15th century. The initial variety of alphabetical orders then settle into the order we use today, which became prevalent in the 20th century.
See A Brief History of the Tibetan Alphabetical Order for more on this topic.
Computer Implementation
The Tibetan alphabetical order requires an analysis of the syllables in order to find its various components, which makes it tricky to implement. The rules emulate a detection of the parts of the syllable and uses a multi-weight algorithm to sort in the following order of precedence:
- main consonant
- superscript
- prefix
- subscript
- vowel
- suffix
Here are some examples:
- comparing མགྲིན and མག: in the first one, the main consonant is ག; in the second one, the main consonant is མ; they differ on the first weight, so we have མགྲིན < མག due to the order of the main consonant
- comparing བསྒ and མགོ: main consonants are the same (ག), so we compare the superscripts. The first one has ས, the second has no superscript, so according the order of superscripts (no superscript < superscript ས) we have མགོ < བསྒ
- comparing འགྲ and འགྱི: main consonants are the same (ག), superscripts are the same (no superscript), prefixes are the same (འ), so we compare the fourth weight (subscripts). The first has subscript ར, the second has subscript ཡ, so according to the order of subscript, we have འགྱི < འགྲ
Among the many details that were implemented and tested in the process, some are interesting from a computational or linguistic perspective, for instance:
- we had to carefully sort the syllables that can have multiple analysis, ex: དགས, མངས, etc. (see [Roux2018])
- we sorted the anusvara with མ, although according to the context it should be with ན, ང or ཉ (this would not have been possible with the way CLDR rules can be designed)
- loan words (ex: ཀརྨ, པདྨ, etc.) are sorted in the same order as in [Yisun1985] : ཀར < ཀརྨ < ཀལ
These new rules will be part of the next version of CLDR data, coming out by the end of 2021. Over the following months, they will be implemented into the global software ecosystem.
See also
- A repository with some references, tests, a copy of the rules, etc.
- The CLDR v40 collation file / The Pull Request to CLDR
- The integration into GLIbC
- some code in JavaScript and Python to sort Tibetan alphabetically
Acknowledgements
We would like to thank everyone who has been involved in this very long international effort:
- John Bray, for his help on Tibetan-Italian dictionaries
- Robert Chilton (ACIP), for his support
- Peter Edberg (Apple/CLDR Committee), for his support
- Mike Fabian (Red Hat), for his help on technical aspects and integration in Glibc
- Chris Fynn, for his contributions to digital Tibetan
- Lauran Hartley (Columbia University, BDRC board member) for her review
- Peter Lofting (Apple), for advice and support
- Åke Persson (Mimer), for his support and help with the details of the rules
- Ngawang Trinley and Drupchen (BDRC, Esukhia)
- Dorji Wangchuk (University of Hamburg), for advice
We would also like to express gratitude towards our founder: at the end of [Ruegg1998], Ruegg notes that E. Gene Smith pointed him to two important Tibetan-Sanskrit lexicons which Ruegg couldn't access. These are now available on BDRC ([Namgyal] and [Tenzinwangpo]) and allows us to gain more insights on this topic. This is a testament to the immense benefits of Gene's vision and activity.
References (in Alphabetical Order!)
[Chilton2003] Chilton, Robert. Sorting Unicode Tibetan using a Multi-Weight Collation Algorithm. In Proceedings of Tenth Seminar of the International Association for Tibetan Studies, Oxford University, 2003. https://download.mimer.com/pub/developer/charts/Chilton_slides.pdf
[Flanders2020] Flanders, Judith. A Place for Everything: The Curious History of Alphabetical Order. Basic Books, 2020. See also this review in the Guardian.
[Geylek] Geylek, Pema. Collation in Dzongkha. Retrieved from https://www.dit.gov.bt/sites/default/files/Collation_in_Dzongkha.pdf
[Huang2012] Huang, Heming and Da, Feipeng. Collation of Transliterating Tibetan Characters. In Natural Language Processing and Chinese Computing: First CCF Conference, NLPCC 2012, Beijing, China, October 31-November 5, 2012. Proceedings, 2012, Springer Berlin Heidelberg, http://dx.doi.org/10.1007/978-3-642-34456-5_8
[Jiang2004] Jiang, Di, and Kang, Cai-Jun. The Sorting Mathematical Model and Algorithm of Written Tibetan Language. In Chinese Journal of Computers, volume 4, 2004. http://cjc.ict.ac.cn/eng/qwjse/view.asp?id=1502
[Jiang2006] Jiang, Di. The Current Status of Sorting Order of Tibetan Dictionaries andStandardization. In Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation, 2006. https://aclanthology.org/Y06-1030.pdf
[Roux2018] Roux, Élie, & Hildt, H. D.Algorithmic description of the decomposition and checking of a Classical Tibetan syllable. Himalayan Linguistics, 17(1), 2018. doi:10.5070/H917135529 . Retrieved from https://escholarship.org/uc/item/70z8069f
[Roux2022] Roux, É. A Brief History of the Tibetan Alphabetical oder. Revue d'Etudes Tibétaines, no. 63, Avril 2022, pp. 49-61. Retrieved from https://himalaya.socanth.cam.ac.uk/collections/journals/ret/pdf/ret_63_02.pdf
[Tashi2018] Nyima Tashi. Research on Tibetan Spelling Formal Language and Automata with Application. Springer, 2018, ISBN 9789811306716
[Walter2006] Walter, Michael. A Bibliography of Tibetan Dictionaries. In Walravens, Harmut (editor). Bibliographies of Mongolian, Manchu-Tungus, and Tibetan Dictionaries. Orientalistik Bibliographien und Dokumentationen, Band 20. Harrasowitz Verlag, 2006, pp. 174-235.
[Yisun1985] Zhang, Yisun. Bod rgya tshig mdzod chen mo / 藏漢大辭典. 1985. Buddhist Digital Resource Center (BDRC), http://purl.bdrc.io/resource/MW29329. [BDRC bdr:MW29329]. Tseten Shabdrung, cited in this article, was part of the editorial board of this dictionary.
Sorry, the comment form is closed at this time.