If the Maharaja Sayajirao University (MSU) has its way, Gujarati literature will be no further than a click of the mouse. Experts at MSU are trying to develop a software named ‘Optical Character Recognition (OCR)’, which could convert the images of Gujarati books into text form for convenient storage and speedy retrieval.
It will be no mean feat as the software would have to be equipped to recognise more than 1,000 different symbols. Jignesh Dholakia, Director of the project told The Indian Express : “OCR software that can convert scanned images of pages in European and English languages into text are easily available. These are fairly accurate as they have to deal with only around 70-80 different symbols and the writing style is also linear. In the case of Indian language books, there are hundreds of different symbols to be recognised and the modifiers for the basic characters can occur on all four sides of the character. The situation becomes even more complex due to the occurrence of similar-looking characters. Gujarati language OCR has to deal with more than 1,000 different symbols of vowels, consonants and conjunctions with different vowel modifiers.”
MSU is a participant in a pioneering national-level consortium for the development of OCR technology for Indian languages along with other institutes such as IIT Delhi, Indian Institute of Science (IISC) Bangalore, Indian Statistical Institute (ISI) Kolkata, IIIT Hyderabad and CDAC. MSU is targeting the development of the technology for Gujarati script. For the first time, a large corpus of annotated images of 25 printed Gujarati books is being prepared for training and testing of the OCR system. The study is part of a major project funded by the Ministry of Communications and Information Technology (MCIT), Government of India.
... contd.