International digital Sanskrit library integration

Major activities

The International Digital Sanskrit Library Integration project created a globally distributed, internet-based digital library in Sanskrit from formerly independent projects. The project integrated projects to create Sanskrit digital archives, digital lexica, and linguistic software; to establish text-encoding standards; to enhance ancient and medieval manuscript access; and to develop OCR technology, display software, and Unicode-compliant text-editing software for Devanāgarī script. The resulting integrated information system enriches access to digital content in Sanskrit located worldwide and thus enables broad use of this material for research and education. The ready accessibility of web-based materials is especially significant for less commonly taught languages such as Sanskrit.

The project standardized Sanskrit text-encoding, revised the Unicode Standard to include characters necessary for Indic cultural heritage, supplied truthed data for optical character recognition, prepared the major digital Sanskrit-English lexicon for integration with linguistic software, produced several other digital lexical resources, produced a full-form Sanskrit lexicon and morphological analyzer, and fostered international collaboration in the area of Sanskrit computational linguistics. The Sanskrit Computational Linguistics Consortium, founded at the Second International Sanskrit Computational Linguistics Symposium held at Brown under the project, continues to culture progress in the development of OCR of Indic scripts, critical editing software, generative grammars, parsing software, semantic networks, machine translation, tagged corpora, and integrated Sanskrit library software.

Results

Standardization of Sanskrit text-encoding

Scharf and Hyman completed a comprehensive survey of linguistic and theoretical issues related to the encoding of Sanskrit language and Devanāgarī script entitled Linguistic Issues in Encoding Sanskrit published by the premier Indian Indological book publisher Motilal Banarsidass. The Unicode 5.2 Character Code Charts page includes the Devanagari Extended and Vedic Extensions code charts under South Asian Scripts and the Sanskrit Library Vedic Unicode page details the history of the proposal with links to relevant documents.

While investigating phonetic distinctions made in the linguistic treatises proper to various Vedic traditions and collecting Vedic passages that illustrate each distinction, Scharf found evidence of the correspondence of accentuation marks with surface pitch rather than with underlying pitch. These findings support arguments previously put forward by Michael Witzel and George Cardona concerning linguistic variation in Vedic and the value of accentual details for the linguistic and general history of India. Scharf presented a paper on the topic called, “Vedic accent: underlying versus surface,” at the Fouth International Vedic Workshop, 24-27 May 2007 in Austin, TX (see the abstract under publications) and another “L'accent védique dans les traités de phonétique, les manuscrits et la récitation,” at the École des hautes études en sciences sociales, Paris, 1 December 2008.

Devanāgarī OCR

Govindaraju and Kompalli produced a prototype stochastic recognition-driven OCR engine for Devanāgarī that incorporates novel methodologies and promises to serve as the basis for future research in OCR for complex scripts. Scharf and Hyman contributed a chapter, “Enhancing Access to Primary Cultural Heritage Materials of India,” to the Guide to OCR for Indic Scripts: Document Recognition and Retrieval, edited by Venu Govindaraju and Srirangaraj Setlur, published by Springer, 2009.

Digital lexicography

The Cologne Digital Sanskrit Lexicon site now hosts an enhanced interface to three dictionaries that was developed in collaboration with project personnel at Brown in the International Digital Sanskrit Library Integration project. These dictionaries include, besides MW, the Sanskrit-Wörterbuch in kürzerer Fassung by Otto Böhtlingk with the Nachträge by Richard Schmidt, and Apte's English-Sanskrit dictionary. Moreover, the whole collection of Cologne's 10 dictionaries is accessible in the form of page images linked to headwords by the technique Funderburk developed in collaboration with project personnel. These lexical sources are all accessible directly on the site of the Cologne Digital Sanskrit Lexicon project as well as from the Sanskrit Library Reference page.

The canonical indices of the Mādhavīya Dhātuvṛtti Scharf developed with the collaboration of Funderburk are also accessible from the Sanskrit Library Reference page. The nominal form morphological analyzer developed by Scharf and Hyman is accessible on the Sanskrit Library Tools page as is the general inflectional morphology analyzer Scharf developed in collaboration with Funderburk. The Sanskrit reference works in the list Scharf provided Crane for scanning by the Toronto University Library under the Perseus Project are now part of the Open Content Alliance archive.

Generative grammar and phonology

Scharf and Hyman gave presentations at the First International Sanskrit Computational Linguistics Symposium, 29-31 October 2007 in Paris. Scharf presented a paper on modeling the procedures of linguistic description used in Paninian generative grammar entitled, “Modelling Pāṇinian Grammar,” and Hyman on the automated transformation of our XML implementation of Panini's phonetic rules into a finite state transducer, entitled “From Pāṇinian Sandhi to Finite State Calculus.” Both papers appear in the volume of papers of the first two Sanskrit computational linguistics symposia edited by Huet, Kulkarni and Scharf. Scharf's paper, “Levels in Pāṇini’s Aṣṭādhyāyī,” appears in the volume of papers of the Third International Symposium edited by Huet and Kulkarni.

Related to his work in generation of Sanskrit inflected forms to build the full-form lexicon is another paper by Scharf. “Pāṇinian accounts of the class eight presents,” a paper presented at the 216th Meeting of the American Oriental Society, 17-20 March 2006, Seattle, Washington, appeared in Journal of the American Oriental Society 128.3 (2008).

Sanskrit library integration

The integration of texts with grammatical analysis and lexical sources, which was the premier goal of the International Digital Sanskrit Library Integration project is available on the Texts page of the Sanskrit Library website.

Project personnel

Peter M. Scharf, Principal Investigator
Malcolm D. Hyman, Co-Principal Investigator
Ramaswamy Chandrashekar, Post-Doctoral Research Associate
Susan J. Moore, Post-Doctoral Research Associate

Grant details

period: 1 January 2006 -- 30 December 2009
funding agency: National Science Foundation, Division of Information and Intelligent Systems
funding: $247,350
location: Brown University