सोमवार, 16 जून 2008

GLOBAL INTEGRATION OF INDIANS THROUGH INDIC LANGUAGE COMPUTING

GLOBAL INTEGRATION OF INDIANS THROUGH INDIC LANGUAGE COMPUTING

Vijay K.Malhotra

Former Director (Official Languages), Ministry of Railways, Govt. of India, New Delhi (India)
E-mailID:malhotravk@gmail.com

ABSTRACT

It is not a coincidence that the developed countries are dominantly monolingual where as the developing countries are multilingual. In the so-called developed world, two languages are considered nuisance and many languages absurd. But the keyword for multilingual, multiethnic and multi-cultural society of India is Unity in Diversity. This is also true with regards to the languages and scripts of India. Inspite of the fact that some Indian languages and scripts look so different with each other that it is difficult to locate the underlying thread amongst these languages and scripts. It is almost impossible to believe that all the scripts of Indian languages (Except Urdu) have been originated from the common source of Brahmi script. The computer scientists of IIT, Kanpur realized this fact in 1971-72 while developing a common keyboard for all Indian languages based on GIST technology. The Pan Indian character of Indian languages also unites and integrates Indians through this Indian ethos of Unity in Diversity. While developing various Language Tools for Indian languages, the computer scientists were amazed to see the commonality amongst Indian languages not only in respect of terminology but also in respect of Syntax, Morphology and Semantics paving way for developing various language tools such as common code, common keyboard and common formalism based on Paninian framework.

INTRODUCTION
Language plays an important role in the gamut of politics. A cursory glance at world history is sufficient to show that it is not a coincidence that the developed countries are dominantly monolingual where as the developing countries are multilingual, multi-ethnic and multi-cultural. In the so-called developed world, two languages are considered nuisance and many languages absurd. In fact, the management of diversity is the greatest challenge before humanity today.The democratic world stands divided and isolated from one another because of Nation State built around Monolingual setting of mind. In Canada, mutual intolerance of French and English has brought Quebec on the verge of succession. Pakistan and Bangladesh parted company when insensitive politicians imposed Urdu as the sole official language.Recognition of many languages not only permits free flow of communication, but leads to the empowerment of people at grassroots. Linguistic diversity is as much in need of protection as bio-diversity. Like Chinese and English, Hindi also subsumes a number of dialects under its generic name.

Discussion/Proposal
Ø All Indic languages are National languages.
If you look at the debate of the Constituent Assembly of India on Sept 14, 1949, the
resolution to declare Hindi as the Official Language of Republic India was moved
by none other than a Tamil leader Gopalaswami Ayangar and this resolution was
passed unanimously without any abstention or opposition, but a clause was also
added to ensure the development of other Indian languages along with Hindi. Accordingly, all Indian languages including English and Sanskrit were declared as National Languages and they were included in the Constitution of India under Schedule 8 and the number of such languages has now gone up to 22. Article 351 also envisages developing Hindi as the language representing Composite Culture of India by way of assimilating the forms, style and the expressions of other National languages enumerated in the 8th Schedule of Constitution of India.
Ø Unity in Diversity
Ø India is a Linguistic Zone
Ø Common Code and Common Keyboard for all Indic scripts
Unity in diversity is the keyword for multilingual, multi-ethnic and multi-cultural society
of India. In spite of India having over 1650 languages and dialects prevalent across the
country, it is a Linguistic Zone. This is also true with regards to the languages and scripts
of India. In spite of the fact that some Indian languages and scripts look so different with each other that it is difficult to locate the underlying thread amongst these languages and scripts. For example, the scripts of Aryan and Dravidian languages look so different that it is almost impossible to believe that all Indic scripts (Except that of Urdu) have been originated from the common source of Brahmi script. By the time of Ashok , we find Brahmi used extensively in both the North and the South of India.Historically,it was revealed in 1837,but the computer scientists of IIT, Kanpur realized this fact in 1983 while developing a common keyboard for all Indian languages based on GIST (Graphics and Intelligence based Script Technology) technology. It was first demonstrated at the International Hindi Conference held in 1983 at New Delhi. The stupendous task of accommodating all the Indic scripts on a Qwerty Keyboard was scientifically based on the phonetic nature of Indic scripts derived from Brahmi script. GIST supported a common phonetic overlay and common code for all Indian languages.
Ø ISCII code and INscript keyboard for Indian languages.
Indian scripts are syllabic in nature, but their alphabets are phonetic and due to evolution of the scripts from the common source of Brahmi, they share a common heritage. There are few variations where certain scripts have additional or few alphabets. This aspect was incorporated into ISCII (Indian Standard Code for Information Interchange) code evolved in 1986-88 and the same was accepted as the standard by Bureau of Indian Standard. When it comes to the use of computers, the options available for data entry are a major concern. For the data entry in Indian languages, the default option is INSCRIPT (INdian SCRIPT) layout. This layout uses the standard 101 keyboard. The mapping of the characters is such that it remains common for all the Indian languages (written left to right). This is because of the fact that the basic character set of the Indian languages is common. We can divide the characters of Indian language alphabets into Consonants, Vowels, Nasals and Conjuncts. Every consonant represents a combination of a particular
sound and a vowel. The vowels are representations of pure sounds. The Nasals are
characters representing nasal sounds along with vowels. The conjuncts are combinations
of two or more characters. The Indian language alphabet table is divided into Vowels
(Swar) and Consonants (Vyanjan). The vowels are divided into long and short vowels
and the consonants are divided into vargs.The INSCRIPT layout takes advantage of
these observations and thus the organization are simple. In the Inscript keyboard layout,
all the vowels are placed on the left side of the keyboard layout and the consonants, on
the right side. The placement is such that the characters of one varg are split over two
keys. This is how the common keyboard and the common code for all Indian languages
were evolved because of the common alphabetic order of these languages. Due to the
common coding for all Indian languages the facility of mutual transliteration amongst
Indic scripts were also made possible. Since ISCII also includes Roman script along with
Indic scripts, transliteration is also possible from Indic scripts to Roman script.
Ø Commonality amongst the Lexical, Syntactic and Semantic features of Indian languages
After Indic scripts, now let’s talk about the commonality amongst the Lexical, Syntactic and Semantic features of Indian languages. This is true that most of the Indian languages have borrowed extensively from Sanskrit and these words are commonly used in all Indian languages. This Pan Indian character of Indian languages unites Indian languages through a common bond and heritage of Sanskrit. This is truer with the lexical features used in Ayurveda, GaNit and Jyotish shastraas written primarily in Sanskrit, but the lexical features of other regional languages are also common to a great extent. While finalizing the technical terms in various Indian languages, the Govt of India issued directives to ensure that the technical terminology coined in various Indian languages should be primarily derived from Sanskrit and this is the reason that the technical terminology coined in various Indian languages is common to a great extent. While developing various language tools for Indian languages, the computer scientists were amazed to see the commonality amongst Indian languages not only in respect of lexical features but also in respect of syntactic and semantic features. In fact they found that in spite of India having over 1650 languages and dialects prevalent across the country, it is a one Linguistic Zone. The areas of commonality amongst Indian languages and scripts observed by the computer scientists working in the filed of Computational Linguistics while developing various tools in Hindi and other Indian languages.
Ø Language group specific features of Indian languages
If you look at the core of the languages, you will find two distinct features in any language of the world: Universal features as well as language-specific features. Universal features are those that are common in all languages belonging to even extremely different languages across the globe such as Hindi, Tamil, English, Chinese and Arabic etc. For example, 'khaayaa' (ate) is a transitive verb, which requires an object to be eaten and a subject who eats. Similarly it requires its subject to be an animate. The tree structure of this verb as well as its universal features such as transitivity is common to all languages of the world, but the use of 'ne' with the subject of this verb is a language-specific feature of Hindi. There are certain features which are language group-specific also. For example, the use of 'ko' along with the subject of a specific sentence pattern is common to all languages in India; i.e. 'Raam Ko Bukhaar Hai' in Hindi 'Malaa Taap Aahe' in Marathi 'Ramakku Jwaram' in Tamil, 'Raamannu Paniyaa Nu' in Malayalam, ‘Ramanige Jvar Ide' in Kannada and 'Ramer Taap Aachhe' in Bengali, but in English it is translated as 'Ram has a fever’. Here you will notice the conspicuous absence of the corresponding use of 'Ko'. This shows that India is one linguistic zone. If we make use of the Computer technology to analyze these features, we can come out with the most sophisticated language tools such as Language Tutor, Auto Correct, Grammar Checker and even machine translation systems in Indian languages.

Ø Commonality amongst the Lexical ,Syntactic and Semantic features of Indian languages
The commonality amongst Indian Indian languages also paved the way for developing Language Tools such as Machine Translation System in Hindi and other Indian languages. Anusaaraka machine translation system for 5 pairs of Indian languages, a computational Paninian framework has been developed under the guidance of Dr. Rajeev Sangal,the Director of the International Institute of Information Technology,Hyderabad.another algorithm named as TAG (Tree Adjoining Grammar) has been developed by Prof. Aravind Joshi, Head of Computer Science Department of the University of Pennsylvania,USA. This algorithm is quite suitable for multiple languages having different syntactic structures such as Hindi and English. Whereas English is a positional language and Hindi a language of relatively free word order. For example, if you change the word order of the sentence "Ram (Subject) killed (verb) Ravan (Object)" and replace Subject with Object "Ravan (Subject) killed (verb) Ram (Object)" the meaning changes completely, but in Hindi the meaning remains the same even after changing the word order. "Raam (Subject) ne RaavaN (Object) ko maaraa (verb)". "raavaN (Object) ko Raam (Subject) ne maaraa (verb)".The TAG handles both languages on the basis of its verb and picks up the universal features as SVO in English and SOV in Hindi. The domain selected is that of Officialese (The language used in administration). The features of the Officialese are almost similar across the languages. For example, the use of past participles is quite common in Officialese. "Mr.Verma has been transferred from Delhi to Mumbai with effect from March1, 2005 and posted as Director (Operations)".But the typical feature of Hindi is the use of honorific use of Shri Verma. In English, it is enough to use Mr. before the name of the person to show respect and the verb remains singular. But in Hindi even the verb changes into plural. Shree Verma nideshak ho gaye (plural). With these examples, it is clear that unless language specific are addressed while developing the Parser, Language Tools such as Machine Translation system can not developed successfully. This Parser has been found quite useful to analyze language specific features in Hindi and other languages of the world such as English, French, Marathi, Japanese and Chinese belonging to different family groups.

Ø Script specific features may not necessarily be the common features between two languages such as Hindi and Marathi sharing the same Devanagari script.

In spite of the commonality amongst a few Indian languages sharing the common Devanagari Script, one has to keep in mind a few facts while developing the tools like Auto Correct that it is not a script specific feature but it’s a language specific feature. For example, Devanagari script is used for both Hindi and Marathi, but their spelling structure is quite different. Even the words commonly derived from Sanskrit are spelt differently in both languages. Most of the words ending with small "i" in Hindi are spelt with long "ii" in Marathi. For example, kavi is spelt in Hindi with small "i", but in Marathi it is spelt as kavee with long "ii". How can there be a common Auto Correct for both Hindi and Marathi? Besides, one can collect the samples from various Hindi speaking regions as well as non-Hindi speaking regions to understand the pattern of errors committed by different language speakers. The errors committed by Marathi speakers are quite different than that of Punjabi speakers. Since Hindi is used in most parts of India, there is lot of variations in its pattern of errors. If Marathi and Gujarati speakers commit mistakes of long and small "ii" and "i", the speakers of South Indian languages commit mistakes for aspirated sounds. They write bhaaShaa as baaShaa and kaanaa as kaanaa.This is because of the mother tongue interference.

Ø UNICODE is the answer to all problems with regards to non-compatibility, multiple fonts and various Operating Systems amongst all languages of the world across the globe.

In the present scenario, most of the users of Indic languages are still using non-standard fonts and due to the non-compatibility across the systems and fonts, they do not attempt to use multiple applications such as e-mail, chat, templates, auto text, thesaurus, spell checkers etc. Very few users attempt data applications such as Excel, Access in Hindi. Power Point is also not commonly used in Indic languages. This is due to the fact that there was no common standard in Indic languages across the systems. ISCII was a good beginning in this direction, but in the world of globalization, we need a global standard where all languages of the world can co-exist with each other irrespective of multiple platforms, fonts and systems. UNICODE is the answer to all these problems. Hence our endeavor should be to make Indic language users aware of the advantages of Language computing in Unicode. As far as Indian languages are considered, ISCII has been taken as the basis for encoding in Unicode. All Indian languages are grouped together in the Unicode character chart. Each language is given a codepage. Indic language codepages contain blocks of 128 code points. Rendering of Unicode text is handled by Open type fonts –a joint initiative by Adobe and Microsoft. Open type font is an open standard and not a proprietary of any company. The sort order for Indian language is different for each language, even if they use the same script. This is an important difference from ISCII which uses one sorting order for entire India that is not true in reality. That means, the sorting order will be slightly different amongst Indic scripts due to slight variations or the addition and deletion of some syllables. Unicode gives the freedom for the individual applications to handle the sorting.
Ø If you want to be with rest of the world, then move to Unicode.
Unicode is increasingly being accepted as a standard for Information Interchange worldwide as most of the major IT Companies have declared their support for it. Unicode standard provides the capacity to encode all of the characters used for the written languages of the world. The Unicode standards provide information about the character and their use. Unicode Standards are very useful for Computer users who deal with multilingual text, Business people, Linguists, Researchers, Scientists, Mathematicians and Technicians.
Unicode uses a 16 bit encoding that provides code point for more than 65000 characters (65536). Unicode Standards assigns each character a unique numeric value and name. On the contrary, ISCII uses 8 bit code which is an extension of the 7 bit ASCII code containing the basic alphabet required for the 10 Indian scripts which have originated from the Brahmi script..Traditionally,computer applications dealt with text corresponding to only one language. Subsequently the need to work with multilingual text was felt and this brought in additional requirements in respect of codes. The letters from different languages cannot be normally distinguished on the basis of their codes, for across different languages, the numerical values assigned for the codes fall in the same range. Thus one might find that the code assigned to the letter "a" in English is really the same as the code assigned for the Greek letter "alpha" or an equivalent letter in the Cyrillic alphabet. A multilingual document with text from different languages cannot really be identified as one, unless a mechanism is available to specifically mark sections of the text as belonging to a specific language/script.
The basic idea in Unicode was to assign codes over a much larger range of numbers from 0 to nearly 65000. This large set includes not only the letters of the alphabet from many different languages of the world but also punctuation, special shapes such as mathematical symbols, Currency symbols etc.This large range would be apportioned to different languages/scripts by assigning chunks of 128 consecutive numbers to each script which may also include a group of special symbols. The size of the alphabet in many languages is much less than 50 and so this minimal range of 128 is quite adequate even to cover additional symbols, punctuation etc.An important concept in Unicode is that codes are assigned to a language on the basis of linguistic requirements. Thus, for most languages of the world which use the letters of their alphabet in the writing system, the linguistic requirement is basically satisfied if all the letters are covered along with special symbols. Display of text would proceed by identifying the letters through their assigned Unicode values both in the input string and the displayed string, which for most languages/scripts would be identical. Thus a Unicode font for a language need incorporate only the glyphs corresponding to the letters of the alphabet and the glyphs in the font would be identified with the same codes used for the letters the represent. Here is the example of the same text saved in Unicode but displayed in various languages of the world.
ما هي الشفرة الموحدة "يونِكود" ؟ in Arabic
Какво е Unicode ? in Bulgarian
什麽是Unicode(統一碼/標準萬國碼)? in Trad'l Chinese
什么是Unicode(统一码)? in Simplified Chinese
Što je Unicode? in Croatian
Co je Unicode? in Czech
Hvad er Unicode? in Danish
Wat is Unicode? in Dutch
Kio estas Unikodo? in Esperanto
Mikä on Unicode? in Finnish
Qu'est ce qu'Unicode? in French
რა არის უნიკოდი? in Georgian
Was ist Unicode? in German
Τι είναι το Unicode; in Greek (Monotonic)
Τί εἶναι τὸ Unicode; in Greek (Polytonic)
מה זה יוניקוד (Unicode)? in Hebrew
यूनिकोड क्या है? in Hindi
Hvað er Unicode? in Icelandic
Que es Unicode? in Interlingua
Cos'è Unicode? in Italian
ユニコードとは何か?in Japanese
유니코드에 대해? in Korean
Kas tai yra Unikodas? in Lithuanian
Што е Unicode? in Macedonian
X'inhu l-Unicode? in Maltese
يونی‌کُد چيست؟ in Persian
Czym jest Unikod? in Polish
O que é Unicode? in Portuguese
Ce este Unicode? in Romanian
Что такое Unicode? in Russian
Kaj je Unicode? in Slovenian
¿Qué es Unicode? in Spanish
Vad är Unicode? in Swedish
Unicode คืออะไร? in Thai
ዩኒኮድ እንታይ ኢዩ? in Tigrigna
Što je Unicode? in Upper Sorbian
Evrensel Kod Nedir? in Turkish
ﻳﯘﻧﯩﻜﻮﺩ ﺩﯨﮕﻪﻥ ﻧﯩﻤﻪ؟ in Uyghur
Unicode dégen néme? in Uyghur (latin)
Unicode là gì? in Vietnamese
Beth yw Unicode? in Welsh
The main purpose of the Unicode is to transport information across computer systems.


Conclusions

Unity in Diversity is the key of Indian ethos.This is also reflected in the scripts and languages of Indian subcontinent.This is due to the fact that all Indic scripts have been originated from Brahmi script based
on the common phonetic order.Taking the commonality amongst Indic scripts and languages across the country ,computer scientists succeeded to develop a common code such as ISCII and common keyboard such as INSCRIPT for Indian languages.This has also paved the way of developing various Language
Tools in Hindi and other India languages such as mutual transliteration among Indian scripts.No doubt
that ISCII code did help to bring about national integration amongst Indians speaking multiple languages across the country,but with the advent of UNICODE,all the languages of the world have been covered
under a single uniform code,paving the way for global integration of Indians through Indic language computing.In fact Unicode has really transformed the entire world community into a global village in
the letter and spirit, thus converting the following Sanskrit saying into reality:vasudhaiv kuTumbakam…
the whole world is like a family.




Acknowledgements


Mr. Mahendra K. Verma, University of York (UK)
Dr. Aravind Joshi,University of Pennsylvania,(USA)
Dr. U.B. Pavanaja,CEO, Vishva Kannada Softech,Bangalore
Dr.Suraj Bhan Singh,Former Chairman,Commission for Scientic
& Technical Teminology in Hindi,New Delhi
Dr Krishna Kumar,Gitanjali Multilingual Literary Circle,Birmingham,UK




References

1. Language and Politics by D.P.Pattanayak published in GaveshaNaa:
63-64/1994/191-199
2. Hindi kaa vaakyaatmak vyaakaraN by Prof. S.B. Singh
3. www.cdac.in
4. www.unicode.org
5. www.bhashaindia.com








------------------------------------

कोई टिप्पणी नहीं: