Detalles del proyecto
Description
Consider the tasks of organizing a collection of research papers for the purpose of writing a thesis; organizing the set of accepted papers at a conference into meaningful and coherent sessions; looking up a corpus of incident reports in customer service to locate the most relevant cases and their resolution to the new case on hand; or discovering novel treatments for diseases through implicit connections in the biomedical literature. A core problem underlying such tasks is that of semantic relatedness of documents. Semantic relatedness of documents should not be limited to the sharing of words, as two documents may be about the same topic, but using different vocabulary (for example a medical document for experts versus a medical document for the layperson). Given a domain-specific corpus, topic models have been fit to documents and terms, leading to the representation of documents as instances of generative probabilistic models of mixtures of topics. Topic models require corpora and documents of sufficient size to be robust. In real life, documents may be short (e.g. titles or abstracts) and document corpora may contain a small number of documents (tens or hundreds instead of thousands), rendering topic models unreliable. The proposed research program will investigate semantic relatedness measures that are applicable to any domain and rely on readily available external knowledge sources, such as the Google n-gram corpus and Wikipedia. Organizing document collections into semantically coherent clusters has typically relied on bag-of-word document representations, with a focus more on mathematical sophistication than the interpretability of the document representation by the user. In the proposed research program we will seek algorithms and processes that support the human user in her sense making process, providing support to her in interactively steering the document representation and clustering process to fit her objectives. In collaboration with industrial partners, we will test the proposed methods in different applications of practical significance, such as interactive clustering of corporate document sets, automatic ranking of resumes against job ads, expertise mapping and matchmaking, paper referee assignment, and content-based recommendation of news to digital newspaper subscribers. A long term objective is to support document-based discovery in the majority of scientific fields that lack the sophistication of terminological and ontological resources currently available in the biomedical field.
Estado | Activo |
---|---|
Fecha de inicio/Fecha fin | 1/1/17 → … |
Financiación
- Natural Sciences and Engineering Research Council of Canada: US$ 33.118,00
ASJC Scopus Subject Areas
- Medicine(all)
- Information Systems
- Information Systems and Management
- Management Information Systems