Word Sense Disambiguation


About | Software | Data | Publications | Presentations

About

Word Sense Disambiguation (WSD) is the task of automatically identifying the intended sense (or concept) of an ambiguous word based on the context in which the word is used. In our work, the set of possible meanings for a word are defined by Concept Unique Identifiers (CUIs) associated with a particular term in the Unified Medical Language System (UMLS). Thus, when performing WSD of biomedical terms, our more specific goal is to assign a term one of its possible CUIs based on its surrounding context. For example, the term cold could refer to the temperature (C0009264) or the common cold (C0009443), depending on the context in which it occurs.

Automatically identifying the intended concept of ambiguous words improves the performance of clinical and biomedical applications such as medical coding and indexing for quality assessment, cohort discovery and other secondary uses of data. These capabilities are becoming essential tasks due to the growing amount of information available to researchers, the transition of health care documentation towards electronic health records, and the push for quality and efficiency in health care.

In this work, we are exploring three types types of methods: supervised, unsupervised and knowledge-based. Supervised methods use machine learning algorithms (e.g. SVMs, Naive Bayes) to learn from manually tagged training data; unsupervised methods rely on the distributional characteristics of the terms in large unannotated corpora; and lastly, knowledge-based methods use information from an external knowledge source.

Software

  • CuiTools -- A freely available suite of Perl programs for supervised and unsupervised WSD experiments.
  • UMLS-SenseRelate -- A freely available suite of Perl programs for exploring the use of semantic similarity and relatedness between UMLS concepts to disambiguate terms in biomedical text.
  • Data

  • NLM-WSD dataset
  • MSH-WSD dataset
  • Abbrev dataset
  • Conflate dataset
  • Publications

  • Challenges and Practical Approaches with Word Sense Disambiguation of Acronyms and Abbreviations in the Clinical Domain. Sungrim Moon, Bridget T. McInnes, and Genevieve B Melton. Healthcare informatics research, 2015, 21 (1), 35-42.
  • Determining the Difficulty of Word Sense Disambiguation. Bridget T. McInnes and Mark Stevenson. Journal of Biomedical Informatics. 2014 Feb; 47:83-90.
  • Evaluating Measures of Semantic Similarity and Relatedness to Disambiguate Terms in Biomedical Text. Bridget T. McInnes and Ted Pedersen. Journal of Biomedical Informatics. 2013 December; 46(6):1116-24.
  • Knowledge-based Method for Determining the Meaning of Ambiguous Biomedical Terms Using Information Content Measures of Similarity. Bridget T. McInnes, Ted Pedersen, Ying Liu, Serguei Pakhomov, and Genevieve B. Melton. Appears in the Proceedings of the Annual Symposium of the American Medical Informatics Association (AMIA). Oct. 2011, Washington DC.
  • Exploiting MeSH Indexing in MEDLINE to Generate a Data set For Word Sense Disambiguation. Antonio Jimen-Yepes, Bridget T. McInnes and Alan R. Aronson. BMC Bioinformatics. 2011 Jun 2;12(1):223.
  • Using Second-order Vectors in a Knowledge-based Method for Acronym Disambiguation. Bridget T. McInnes, Ted Pedersen, Ying Liu, Serguei Pakhomov, and Genevieve B. Melton. Appears in the Proceedings of the Fifteenth Conference on Computational Natural Language Learning (CoNLL 2011), June 23-24, 2011, pp. 145 - 153, Portland, Oregon.
  • Collocation Analysis for UMLS Knowledge-based Word Sense Disambiguation Antonio Jimen-Yepes, Bridget T. McInnes and Alan R. Aronson. BMC Bioinformatics. 2011, 12(Suppl 3):S4.
  • Supervised and Knowledge-based Methods for Disambiguating Terms in Biomedical Text using the UMLS and MetaMap. Bridget T. McInnes. Doctor of Philosophy Dissertation, Department of Computer Science, University of Minnesota, Twin Cities, September, 2009.
  • An Unsupervised Vector Approach to Biomedical Term Disambiguation: Integrating UMLS and Medline. Bridget T. McInnes. In Proceedings of the Assocation for Computational Linguistics Student Research Workshop (ACL-SRW) 2008.
  • Using UMLS Concept Unique Identifiers (CUIs) for Word Sense Disambiguation in the Biomedical Domain. Bridget T. McInnes, Ted Pedersen, and John Carlis. In Proceedings of the Annual Symposium of the American Medical Informatics Association (AMIA), pages 533-37, Nov. 2007, Chicago, IL.
  • Presentations

  • Right Arm and Right Atrium: How to distinguish between the two. Institue of Heath Informatics Seminar Series, University of Minnesota, March 2011.
  • Representing Meaning in Unsupervised WSD. Bridget T. McInnes. National Library of Medicine's Brown Bag Series. September 2008.






  • Last modified 25/08/2014