Harmonized string-based and unification-based methodology for text mining and processing

  • Keselj, Vlado (PI)

Proyecto: Proyecto de Investigación

Detalles del proyecto

Description

The importance of information availability and Internet is clearly evident in many areas, from a personal level to a general society effect, such as in economy, health care, or scientific research. While general search engines seem to provide a frequently satisfactory document and site retrieval, based on short ad-hoc queries, there is still an enormous time-consuming manual effort required to gather data, filter it, organize it, and use it in the actual tasks that a user wants to get accomplished. Here, we propose to develop three different core natural language processing methodologies that will make a strong contribution to solving this information management problem. Beside the theoretical results, we develop several tools that actually implement designed solutions, and we also apply these methodologies and tools to specific application areas. Our focus is on the methods for (1) Common N-gram analysis, (2) Regular Expression based and finite state processing, and (3) unification-based processing and matching of typed feature structures; with an effort to harmonize these fairly different techniques. The developed software systems include tools for n-gram based text mining (Ngrams.pm and Swordfish), Starfish -- a text preprocessing and embedded programming tool, the question answering system Jellyfish, the system with typed feature structures in Java -- Stefs, and Shrack -- a peer-to-peer system for scientific information dissemination. The application areas include textual mining, authorship attribution, web usage and web content mining, dementia detection of Alzheimer type from spontaneous speech, phylogenetic tree generation, malicious code detection, bio-medical semantic text annotation, and knowledge management in eScience. The N-gram model, regular expressions, and unification-base approach have been known concepts in NLP. The novelty of our approach lies in a specific methodologies developed on top of these models: specific profiling and distance functions used with n-grams, iterative regular expression substitutions, and modifications to classical unification, such as relaxed unification.

EstadoActivo
Fecha de inicio/Fecha fin1/1/10 → …

Financiación

  • Natural Sciences and Engineering Research Council of Canada: US$ 20.392,00

ASJC Scopus Subject Areas

  • Clinical Neurology
  • Development