Graph-based total recall information retrieval on text document corpora

Lopes, Alneu De Andrade A.D.A. (PI)
Milios, Evangelos E E.E. (CoPI)

Dalhousie University

Projet: Research project

Description

In the real world, textual format is the common way of storing information. Thus automated techniques which help 10 group, extract topic, and classify textual documents, minimizing the need of human intervention, remain a worthwhile research topic. In this context, the Brazilian and Canadian groups have developed a number of lechniques related to network-based text mining, to complement the traditional vector space model for representing textual corpora. More specifically, representing textual collections as networks of terms and documents. Algorithms that use a graph representation have several advantages since a graph representation: (1) avoids sparsity and ensures low memory consumption; (2) enables an optimal description of the topological structure of a dataset and associated operations; (3) provides local and global statistics of the dataset's structure; and (4) allows extracting patterns which are not extracted by algorithms based on vector-space model (Breve et al., 2012). By using such representations, a number of techniques has been developed for supervised, unsupervised, and semi-supervised learning by both groups. The Brazilian group's methods are based on information propagation in bipartite networks and can be applied to difterent domains. In the textual domains, in which a collection of documents may be represented by document-term bipartite networks, the proposals range from text classification to soft clustering, including semi-supervised classification and topic extraction. The counterpart Canadian team is involved in a major ongoing project on total recall information retrieval (IR) in large noisy text datasets funded by NSERC and Boeing Canada. A difterent project that received funding from the Digging into Data program untillate 2015 and continues under NSERC Discovery grant funding addresses total recall (lR) on a large corpus of biodiversity heritage text. As a notivatinq practical problem, this project also aims to expand the functionality and the utility of the Biodiversity-Heritaqe Library (BHL) [BHL], a digital library of over 170 thousand volumes, and 49 million pages of biodiversity literature, dating since the 16th century, openly available to the global biodiversity community. The collaboration between the two teams will aim for novel approaches so that each team can improve their knowledge and usage of strateqies, techniques and tools employed by the other, in the context of total recall IR for the BHL corpus. These opportunities will extend to the students working in these topics, who will experience international collaboration and internships at the partner institutions as part of the masters or doctoral projects. (AU)

Statut	Terminé
Date de début/de fin réelle	6/1/18 → 5/31/20

Financement

Fundação de Amparo à Pesquisa do Estado de São Paulo

ASJC Scopus Subject Areas

Artificial Intelligence
Physics and Astronomy(all)
Chemistry(all)
Mathematics(all)
Computer Science(all)

Accéder au projet

https://bv.fapesp.br/en/auxilios/100538/graph-based-total-recall-information-retrieval-on-text-document-corpora/

Graph-based total recall information retrieval on text document corpora

Détails sur le projet

Description

Financement

ASJC Scopus Subject Areas

Accéder au projet