Tibetan Search Enhancements

02 5 月, 2014

TBRC has just released on its online library, enhancements for Tibetan language searching.

XQuery, Lucene & eXist

TBRC is heavily invested in using and extending open source tools for the benefit of our community. Such technical contributions are essential to support Tibetan language in the digital space.

Most notably, we use the eXist XML database to store and deliver all our Tibetan cultural metadata. Since version 1.4, eXist has integrated Apache Lucene into its native XML framework. This integration makes searching XML collections intuitive and powerful.

Take for instance the following XQuery search:

declare namespace o="http://www.tbrc.org/models/outline#";
let $query-term := "rdo rje snying po"
for $title in collection("/db/tbrc/outlines/")//o:title [ft:query(., $query-term)]
order by $title collation "java:org.tbrc.common.server.TibetanCollator"
return $title

This search returns all the titles in the collection of dkar chags created by TBRC where the title contains the phrase $query-term. For instance,

dus 'khor ba rdo rje snying po la dri ba yi ger bskur

The order by $title collation "java:org.tbrc.common.server.TibetanCollator" command sorts the o:title result set in Tibetan alphabetical order (using our java:org.tbrc.common.server.TibetanCollator)

Technically, the full text query ft:query is a Lucene function integrated into eXist that searches a Lucene index. The index is generated from a configuration file for the XML collection. For instance, see the following section of the old configuration file for outlines,

<analyzer id="ws" "class="org.tbrc.lucene.analysis.ChunkAnalyzer"/>
<text qname="o:title" analyzer="ws"/>

org.tbrc.lucene.analysis.ChunkAnalyzer creates a Lucene index from all the XML elements called o:title in the XML collection outline.

In the past, for a variety of reasons, all the searches on TBRC treated terms such as rdo rje snying po as a phrase. This meant that you would only find matches where a title contained the exact phrase – a match of all the tokens rdo | rje | snying | po, spelled exactly like that, in that exact order.

It would not find rdo rje snying po'i or rdo rje'i snying po. Nor would the search find rdo rje snying.

Obviously this created a very rigid search.

Improved Searching

We created a series of improvements designed to more accurately analyze Tibetan language elements in our XML The same search:

declare namespace o="http://www.tbrc.org/models/outline#";
let $query-term := "rdo rje snying po"
for $title in collection("/db/tbrc/outlines/")//o:title[ft:query(., $query-term)]
order by $title collation "java:org.tbrc.common.server.TibetanCollator"
return $title

now returns

                rdo rje snying po
                rdo rje snying po'i
                rdo rje'i snying po

Tibetan Indexing

These search enhancements are accomplished by creating more accurate indices based on Tibetan language. The following analyzers are used to generate Lucene indices for Tibetan and Wylie across every collection in the TBRC Library.

<analyzer class="org.tbrc.lucene.analysis.WylieAnalyzer"/>
<analyzer class="org.tbrc.lucene.analysis.TibetanAnalyzer"/>

Note: there are two analyzers – one for Extended Wylie and one for Unicode Tibetan.

Whitespace

We analyze our XML data using these new analyzers. Essentially analyzers create tokens, semantically meaningful chunks of characters, separated by whitespace. In our implementation, any character that is not a Tibetan letter or digit is considered whitespace, and is therefore used to create tokens.

Tokens

Lucene indices are composed of tokens.

With the example rdo rje snying po (རྡོ་རྗེ་སྙི་པོ), there are 4 tokens

rdo
rje
snying
po

A couple more instructive examples each with 5 tokens:

kha cig gis 'dod pa
sangs rgyas kyis chos bstan
las ngan gyis 'phangs pa
'phags pa yis mkhyen pa
nyi ma shar ba na.

Now with proper Tibetan tokens, we can implement a more flexible search environment.

Ignoring "Stop-Tokens" (stop-words)

Common in search engine design is the evaluation of stop-words – those words that are not semantically meaningful. Analyzers filter out stop-words. Actually for our Tibetan analyzers, stop-words are more like "stop-tokens".

The following Tibetan stop-tokens are filtered out (Wylie with their corresponding Unicode values). Note: this list has been updated on September 15, 2014

// "gis", "kyis", "gyis", "yis","na"
"\u0F42\u0F72\u0F66", "\u0F40\u0FB1\u0F72\u0F66", "\u0F42\u0FB1\u0F72\u0F66", "\u0F61\u0F72\u0F66″,"\u0F53"

What this means in practice is that these tokens are ignored, filtered out, during the creation of the index. Searching on sangs rgyas chos bstan will find sangs rgyas kyis chos bstan, etc.

Stemming & Additional Filters

We also implemented a series of filters that support stemming.

The འ character is consider a stem for Tibetan grammatical particles. Therefore we filter out the following instances 'i , 'o and 's, which equate to Unicode in the Tibetan analyzer "\u0F60", "\u0F72", "\u0F7C", "\u0F66", respectively.

This means that searching rdo rje snying po will match rdo rje'i snying po

We are also filtering "+" and "-" to support making Extended Wylie searching more flexible. What this means in practice is that searching pad ma will match pad+ma, and vice versa, pad+ma will match pad ma. In the same way, searching pan di ta will match paN+Di ta and vice versa.

Additional filters can be added.

Bringing It All Together

Here are some common searches to help illustrate how this works in practice:

Searching paN+Di ta retrieves paN+Di ta'i, pan di ta and pandi ta. Or vice versa, searching pan di ta finds paN+Di ta and paN+Di ta'i. It won't find pandita, since pandita is not a token in Wylie or Tibetan.
Searching dharma shrI matches d+harma shrI, dhar+ma shrI and d+har+ma shrI

There are many, many examples that can be explored. Please let us know what you discover!

Future Directions

Supporting Common Short Form Names and Titles

We plan to support common short form names and titles.

Take for example: spyod pa la 'jug pa

This work is often referred to as:

spyod 'jug

but since "pa" is not a stop word we don't find those occurrences. Further,

sbyin 'jug

spyad 'jug

are alternatives used to refer to this famous work. These would be treated as synonyms all mapping to:

byang chub sems dpa'i spyod pa la 'jug pa

There are many examples of this in names and titles.

This is accomplished by developing what amounts to synonym filters – filters that translate a common form to a canonical form. This work will extend the work on Tibetan tokenization.

Supporting Phonetics

The same strategy can be used to support phonetics – use our existing phonetic table and map to canonical forms.

Feedback Wanted

The indices are currently deployed on tbrc.org. Please continue to use the site and let us know if you find anything strange or unexpected. Also, you can drop us a message if you find anything helpful as well!