The IdiomSearch Project
The IdiomSearch Project (beta) makes it possible
for linguists, students, translators and language professionals to test a new algorithm for extracting most set phrases from any text. Set phrases are taken here in the broadest sense possible, covering the whole spectrum of phraseology from simple collocations and proper nouns (named entities) to idioms
and proverbs. For the theorerical background and a description of the algorithm, see:
Colson, Jean-Pierre (2016). Set phrases around globalization: an experiment in corpus-based computational phraseology. In: F. Alonso Almeida, I. Ortega Barrera, E. Quintana Toledo & M.E. Sánchez Cuervo (eds), Input a Word, Analyze the World. Selected Approaches to Corpus Linguistics. Newcastle, Cambridge Scholars Publishing, p.141-152.
Colson, Jean-Pierre (2017). The IdiomSearch Experiment: Extracting Phraseology from a Probabilistic Network of Constructions. In: R. Mitkov (ed.), Computational and Corpus-based phraseology, Lecture Notes in Artificial Intelligence 10596. Cham, Springer International Publishing, 2017, p. 16-28.
Colson, Jean-Pierre (2018). From Chinese Word Segmentation to Extraction of Constructions: Two Sides of the Same Algorithmic Coin. In: SAVARY, A., RAMISCH, C., HWANG, J. D., SCHNEIDER, N., ANDERSEN, M., PRADHAN, S., PETRUCK, M. R. L. ; "Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)", 25-26 August, Santa Fe (New Mexico, USA), Association for Computational Linguistics, p. 41-50 (ISBN : 978-1-948087-51-3).
Usage and interpretation of the results
Type or paste a text and select the language. For Chinese, texts in Mandarin Chinese with the simplified spelling should be used. Click on "Start search" to start the extraction of phrases from the corpora. The processing time may be slightly longer for Chinese.
As explained in Colson (2016), the corpora used for the extraction of phrases are web corpora of about 200 million tokens. Therefore, less common phrases may be absent from the results. They will inevitably include a lot of lexical phrases (compounds) and grammatical phrases. As always in the automatic extraction of phraseology, there is also some noise in the results (false positives), estimated at about 10 percent.
The methodology and the tool presented here should be considered as work in progress. Also, there will be more noise or errors for Chinese, French and Spanish than for English, simply because the reference corpora that were specially assembled for the purpose of this experiment are web corpora. On the web, text encoding remains an issue for computational linguistics, because some web pages are wrongly encoded, stating for instance that they are in Unicode when this is actually not (quite) the case. This will inevitably result in some errors in the web corpora assembled, and they may also appear in the results.
Clicking on the Meter button makes it possible to check (only for English) the association score (cpr-score, see Colson 2016, 2017, 2018) and the frequency (on a 1.4 billion word corpus) of any n-gram, from bigrams to 10-grams). A frequency of 1,000 means that the number of occurrences in the corpus is at least 1,000.
The results page displays the set phrases identified in the input text. Some punctuation marks may have been deleted in the process. The colors used correspond to the degree of fixedness according to the cpr-score (see Colson 2016), ranging from pale yellow to red. The legend shows an estimation of the relationship between fixedness and frequency in the reference corpus. Under the colored text appear the number of partly fixed, fixed and very fixed combinations, as well as the number of words (as tokens; for Chinese: the number of characters or hans). The PW ratio corresponds to the number of phrases, divided by the number of words (tokens). The PT ratio, on the other hand, is a better indication of the proportion of phraseology in the text, because it corresponds to the percentage of words (tokens) that are in the colored zones. In other words, a PT ratio of 45 percent means that only 55 percent of the words are not part of a set phrase.
The association scores and frequencies are computed for English, French and German by using the WaCKy corpora, resp. ukWaC, frWaC and deWaC. The Dutch scores are computed by using the Araneum Nederlandicum Maius corpus. The other corpora were assembled from the Web by means of the WebBootCat tool provided by the Sketch Engine.
The indexation system is optimized by a query likelihood model, implemented by The Lemur Project.
The meter is based on the RGraph5 Library, provided by rgraph.
ABOUT THE AUTHOR
Dr. Jean-Pierre Colson
Professor at the University of Louvain
the European Society for Phraseology Europhras