Collocations are units of words in which if the words were separated they would have a different definition than the unit itself. Some examples of collocations are: "a little bit", "United States of America", and "school bus". These are strings of words that represent a single concept but whose individual components represent a different concept. The goal of this work is to develop methods to identify collocation in raw text. In this work, we explored using measures of assocation.


  • Ngram Statistics Package (NSP)
  • LogLikelihood Module for 3-grams
  • LogLikelihood Module for 4-grams
  • LogLikelihood Module for 5-grams
  • LogLikelihood Modeling Module for 3-grams
  • LogLikelihood Modeling Module for 4-grams
  • The Ngram Statistics Package (Text::NSP) - A Flexible Tool for Identifying Ngrams, Collocations, and Word Associations. Ted Pedersen, Satanjeev Banerjee, Bridget T. McInnes, Saiyam Kohli, Mahesh Joshi, and Ying Liu. Appears in the Proceedings of Multiword Expressions: from Parsing and generation to the Real World (MWE), an ACL HLT 2011 Workshop. June 23, 2011, pp. 131 - 133, Portland, Oregon. (Demonstration System).
  • Extending the Log Likelihood Measure to Improve Collocation Identification Bridget Thomson McInnes. Master of Science Thesis. Department of Computer Science, University of Minnesota, Duluth, December, 2004.

