Natural language processing is a key technology for highly computerized communities. To tackle thisimportant technology, we started our collaboration project on NLP research from 1996. The Communications Research Laboratory (CRL) of The Japanese Ministry of Posts and Telecommunications and The National Electronics and Computer Technology Center (NECTEC) of Thailand started this joint project originally, while being supported by the Electrotechnical Laboratory (ETL) of Japan and the Kasetsart University of Thailand.
"Research and Development Cooperation Project on a Machine Translation System for Japan and Neighboring Countries", this so-called Multi-lingual Machine Translation Project (MMT project) began at the 1987 Japanese fiscal year and continued through the 1992 fiscal year, followed by a two year follow up Program. The project consisted of five countries, i.e., Japan, Thailand, China, Indonesia and Malaysia. Our colloboration project is a amall successor of this project.
We wanted to continue collaborations on NLP among these countries. We first started the collaboration between Japan and Thailand. This is because both Thai and Japanese use characters peculiar to eachlanguage and there are no delimeters between words in both languages. However, the languages themseleves, e.g. grammar rules and other linguistic phenomena, are completely different. We believe we can ontain very interesting insights from this kind of bilingual collaboration.
What is the most significant result of the MMT Project for future research on NLP? The MMT project ended, developing its prototype of the multilingual machine translation systems, tools for NLP and linguistic data, especiallly corpora, are re-usable by other researchers and/or in other projects. However, the most significant result of the MMT project was that it trained researchers in these Asian countries, including Japan, on NLP research.
Therefore, we decided to develop our tagged Thai corpus as a starting point of our collaboration project and also, we focus on technological and personnel interchange between Japan and Thailand. (see Fig.1)

Linguistic and Knowledge Science Laboratory (LINKS) of NECTEC and Kansai Advanced Research Center (KARC) of CRL are developing a tagged corpus for Thai, named ORCHID corpus. The corpus is tagged with LINKS' original part-of-speech (POS) tagset which is the improved version of the tagset in the MMT System. The ORCHID corpus consists of about 2MB (or about 400K words) of the proceedings of the NECTEC annual conference. It is scheduled to be released at the end of 1997.
CRL is doing research on automatic POS tagging technology using neural networks and automatic extraction of linguistic knowledge from tagged corpora. NECTEC is performing research on natural language processing of Thai and preparing linguistic resources in Thai for studying language processing to develop either a whole system of machine translation or applications of natural language processing. ETL is developing a multilingual editor, Mule, while Kasetsart university is working on corpus development tools.
| Hitoshi ISAHARA Kansai Advanced Research Center Communications Research Laboratory Ministry of Posts and Telecommunications |