Much of the work in the information processing and retrieval literature for mining social content is focused on probabilistic frequency-based techniques that are sensitive to the lexico-syntactic structures used by the users. Such methods are agnostic to the semantics of the content and rather look for recurrent discriminative patterns. Our work strongly advocates for the need to develop methods that are cognizant of the semantics of the content that are being processed. To this end, we have developed techniques that are able to automatically provide knowledge and semantic grounding for user-generated textual content. This includes domain-independent semantic entity linking techniques that are able to ground textual content in well-established knowledge graphs such as DBpedia (with Wikipedia’s 5+ million entities) and Unified Medical Language System ([UMLS] with 3+ million entities).
The development of such semantic linking techniques has enabled (a) collaboration with Women’s College Hospital where we have been investigating how semantics of biomedical social data can be analyzed in contrast to peer-reviewed literature, and (b) joint work with St. Michael’s Hospital, where we are improving knowledge synthesis processes based on the semantic interpretation of medical literature.
We have also systematically extended a strong recurrent model for mappings UMLS and DBpedia entities onto each other. This is the first work to map these knowledge graphs on such a large scale. Furthermore, and with the purpose of integrating semantics within social content, we have explored knowledge base-agnostic entity linking methods. By mining senses from text rather than by searching an existing knowledge graph, this type of entity linking reduces the disambiguation search space. Additionally, we have worked on implicit entity linking techniques within an ad hoc retrieval framework to identify the central concept of a short, informal, user-generated text without an explicit clue, e.g., an implicit entity linking model would interpret a tweet saying “I wish my phone wasn’t bent” as referencing an iPhone 6. Recently, we have studied how different features within the context of a learn to rank framework can be used to effectively perform implicit entity linking. Implicit entity linking enables access to information about implied subjects that are missing an explicit reference to be legible, e.g., 40% of tweets about books contain implicit references but do not explicitly mention the book itself. Finally, we have built techniques that perform open information extraction for relation identification in textual content based on both grammatical clause patterns and feature-enhanced matrix factorization. Such work enables the extraction of semantically meaningful relations from textual content such those in social user-generated textual content.