Project Details
Description
This research proposal aims at advancing the state of the art in the natural language processing at three levels in order to meet demands for better information processing. At the lowest level, the character and word n-gram level processing, our objectives are to improve n-gram based text mining through the use of variable-length n-gram profiles, n-gram based visual text analytics through visualization of n-gram profiles and corresponding Eulerian graphs, comparison of current CNG distance measure with other measures (e.g., Jaccard, Dice) at a deeper model level, use of Google N-grams data in improving the standard language n-gram profiles, and adaptation of Normalized google Distance to achieve an off-line distance. At the middle level of processing (RegEx based), we will advance development of regular expression patterns for directed sentiment analysis and parsing of noisy text, examining the ways to generate RegEx-based patterns, generating patterns from Google N-grams data, and extending the Starfish system for text-embedded processing. At the third level, the unification level, our bojectives are: to transfer sub-graph isomorphism technique from analysis in biomedical scientific domain to information gathering from social media, concept semantic relationship generation from Wikipedia data, and semantic-based visualization of stream textual data, such as visualization of e-mail streams. Our Approach is based on the previous work ot these three levels of language processing: (1) Common N-Gram analysis (CNG), where the text data is modelled using character n-gram profiles; (2) Regular Expression based processing of textual data, based on applying RegEx rewriting patterns, and matching the data with similar patterns, and (3) at the Unification level, we apply information extraction and matching using unification or sub-graph isomorphism, and the structural data itself is generated by parsing using the stochastic unification-based grammars. Novelty and Expected Significance of the approach is based on improving methodology to provide for visual text analysis, i.e., visualization and closer interaction with the user, and for better adaptation and development of methodology for new kind of textual data and novel applications coming from the expansion of Internet data and social media. The significance of the approaches is supported by strong interest coming from industrial partners in the area of summarized analysis of social media data.
Status | Active |
---|---|
Effective start/end date | 1/1/23 → … |
Funding
- Natural Sciences and Engineering Research Council of Canada: US$41,500.00
ASJC Scopus Subject Areas
- Artificial Intelligence
- Information Systems