29 de septiembre de 2016

WordNet – open access data in linguistics

The biggest multilingual dictionary existing today is openly available to everyone with an Internet connection. At the same time it is a very important source of open data for linguists. María Álvarez de la Granja, Xosé María Gómez Clemente and Xavier Gómez Guinovart* recount, in their guest post, a development of WordNet, a wonderful example of how open data may be useful for scholars, business and general audience. If you are searching for more details concerning WordNets editing, have a look at a recent paper published in Open Linguistics.

Words and their associated concepts appear to be stored in our brain not randomly and independently, but organized into networks linking some words with others.

WordNet is an open-access lexical-semantic data base that was created to reflect how vocabulary is organized in our mind. Therefore WordNet is structured as a network in which every node is a concept and the lines connecting nodes are semantic relations between concepts such as hyponymy, antonymy, part-whole, cause-effect etc. Each concept in the network is represented by the group of synonyms that can express it. In WordNet terminology, each group of synonyms is a synset, and a synonym that forms part of a synset is a lexical variant of the same concept. For example, in the network above, which is built on information in WordNet, Word of God, Word, Scripture, Holy Writ, Holy Scripture, Good Book, Christian Bible and Bible make up the synset that corresponds to the concept ‘Bible’, and each of these forms is a lexical variant. WordNet also includes, for each synset, a brief definition or gloss of the meaning shared by the synset’s variants, in some cases with examples showing the use of the variants.

WordNet was originally developed for English, and although there are now versions in many languages, the English WordNet is still the most highly developed version and the point of reference at the present time. Since 1985, work on WordNet for English has been carried out at Princeton University, and was directed by George A. Miller until the time of his death in 2012.

Initially WordNet was conceived of from a psycholinguistic perspective. The Princeton team who started the project aimed to create a computer model of the way in which humans process the lexicon in our brain. As time passed, given the use that was being made of the English WordNet, it evolved into a project led by computer scientists whose main focus was on intelligent language processing. WordNet currently constitutes the most important computational resource of lexical semantics, particularly in the domain of natural language processing, being used for example in automatic semantic disambiguation, i.e. automatic selection of the meaning to be attributed to a word or structure within a text; information retrieval, i.e. selection of textual information pertinent to users’ queries; automatic text classification, where documents are assigned to one or more thematic categories depending on their content; automated summarization, and so on.

The biggest dictionary

In addition, WordNet is a very important translation resource, for it represents the biggest multilingual dictionary in existence today for the number of languages and the number of words and concepts covered. It is even utilized by Google Translate as a part of the translation process between languages which possess WordNets. At present there are versions of WordNet in various stages of development for an enormous range of languages. This includes not just the most widespread and best known ones such as French, German, Chinese, Spanish or Portuguese, but less well known and minority languages too, such as Albanian, Kannada, Croatian, Basque, Catalan and Galician. The WordNet for Galician (called Galnet), was created by the TALG group (Tecnoloxías e Aplicacións da Lingua Galega [Galician Language Technologies and Applications]) at the University of Vigo, with which two of the authors of this post are affiliated: Gómez Guinovart and Gómez Clemente.

Most WordNet versions follow the EuroWordNet model, with synsets expressing the same concept in different languages linked to each other through a unique ILI (Interlingual Index) for each concept. These synsets are mainly based on the English WordNet.

The example above shows how the English, Basque, Spanish, Galician, Catalan and Portuguese synsets corresponding to the concept ‘train,’ taken from the English WordNet, are linked. The way this works is by associating the synset in all six languages with the ILI ili-30-04468005-n. At present the Multilingual Central Repository (MCR, where the search shown was performed) only contains WordNets of these six languages. Nevertheless, clicking on this ILI will show that it gives access to translation equivalents in other languages using the same Interlingual Index too.

Thus, WordNet can be used to map onto each other vocabularies of languages which would not otherwise be linked lexicographically, like Galician and Polish, say, a language pair for which there are no bilingual dictionaries to provide translation equivalents between the languages. Clearly, then, the usefulness of WordNet goes well beyond the interests of specialists in computational linguistics and natural language processing, by allowing any user to look up such lexicographical information.

Challenges in the development of language versions of WordNet

Most synsets in different versions of WordNet are based on the English version and linked together. This offers some definite advantages, since each language version starts out with a network of previously defined and structured concepts (allowing different WordNets to be connected in this way); but it presents some problems too.

An obvious problem has to do with the fact that different languages do not always agree about how to look at and organize elements of reality, so there can be differences in how the corresponding concepts get lexicalized in them. For example, if WordNet were to be developed for the African language Wolof, it would be necessary to create special synsets not existing in the English version to cover the verb gaparou (‘to sit with one’s legs folded towards one side’) and other Wolof verbs designating equally specific ways of sitting, which are not expressed lexically in English.

But the most striking divergences show up where there are “cultural gaps” occasioned by elements of reality that may be unknown in a different culture that consequently possesses no way to designate the associated concept lexically. For instance, where are we to put the Galician word queimada ‘a characteristic Galician drink made by burning pomace spirit with sugar until much of the alcohol has been consumed, to which may be added coffee beans, pieces of fruit etc.’ in the Galician WordNet, unless a new synset is created, given that English (and more specifically the American culture in which WordNet was developed) lacks this concept, and as a result no such conceptual node was established?[1]

Language is “flexible”

Sometimes there are minor interlinguistic differences which, while not impeding the creation of a link between languages’ synsets through a shared ILI, produce results which may seem odd to users and generate some problems for machine translation. For example, if we look up sombrerería (‘shop selling hats’) in the Spanish WordNet and examine its equivalents in English, we shall find the Spanish word associated with hat shop and millinery together with the English gloss ‘shop selling women’s hats.’ Now the Spanish word is gender-neutral; it doesn’t specify the sex of the intended wearer of the hats sold. Again, if we look up the word carnicería ‘establishment that sells meat,’ we shall find it associated with the equivalents butcher shop and meat market along with an English gloss that says ‘a shop in which meat and poultry (and sometimes fish) are sold’,[2] yet in fact fish is never sold at a carnicería.

These examples show that there need not be a total equivalence between lexical variants in different languages that are linked through a common ILI. This “flexibility” may also be found between variants making up a synset in a given language: when several synonyms form part of a synset, they usually tend to be partial synonyms which are only truly interchangeable in a limited number of cases; they may differ conceptually, in register (the level of language in which they are used) or even expressively (such as where there is a contrast between neutral words and pejorative ones). In the English WordNet, for example, we find under the gloss ‘censure severely’ the verbs castigate, chasten, chastise, correct and objurgate, although these differ in their semantic nuances and register. An attempt has been made in some WordNets, including that of Galician, to incorporate markers (such as vulgar, colloquial etc.) to make up for a lack of information about register which may cause problems by leading to contextually inappropriate translations.

Shortcomings of the English WordNet

Other issues that need to be dealt with by WordNet creators arise from shortcomings of the English WordNet itself. Many phraseological units have no place in WordNet because the English WordNet includes few of them and in a very unsystematic manner. Thus, the Galician expressions mover ceo e terra (“move heaven and earth”), pechar filas (“close ranks”), matar o mensaxeiro (“kill the messenger”) or poñer a outra meixela (“offer the other cheek”) could not be included in Galnet because their equivalents, move heaven and earth, close ranks, kill the messenger, turn the other cheek, were not included in the English WordNet and the conceptual slots for them were not created.

There are also glosses in WordNet that are flawed in terms of lexicographical technique. Moreover, rather than a uniform pattern followed in all synsets, different types of definition are applied inconsistently. For example, the variants break, cave in, collapse, fall in, founder, give, give way are “defined” using another synonym that is barely explanatory (‘break down, literally or metaphorically’), whereas the variants break away, bunk, escape, fly the coop etc. are glossed by a list of three expressions (‘flee; take to one’s heels; cut and run’) which might just as well be lexical variants themselves of the same synset. Often, definitions do contain an explanation of sorts but one that is not sufficiently helpful or suitable to pin down the meaning, e.g. backbite, bitch ‘say mean things’; feminism ‘a doctrine that advocates equal rights for women’ (where the second term of the comparison has been omitted).

Inconsistency also plagues the examples. Sometimes none are given, and when they are, their number varies. In many cases the examples are not particularly useful (for example, feminist critique is given as an example to illustrate the entry feminist, but this does little to clarify either the meaning or the use of the adjective).

Hence creators of WordNets in different languages must decide whether to translate the glosses and examples from the English WordNet literally, thereby replicating the shortcomings but maintaining a parallel between languages, or to improve them by providing new definitions and examples.

However, such problems should not blind us to the relevance of WordNet as a monumental, open-access, multilingual dictionary linking languages from around the world, many of which are minority languages, which facilitates mutual understanding among many millions of people.

Main image: The result of a search for book on http://wordvis.com/

1 – It is possible to create new synsets besides those in the English WordNet, but not all versions implement this possibility on account of the complexity it involves.

2 – The Spanish WordNet does not provide glosses of its own in either of the above examples.

* – María Álvarez de la Granja works at Instituto da Lingua Galega, Universidade de Santiago de Compostela. Xosé María Gómez Clemente at Departamento de Filoloxía Galega e Latina, Universidade de Vigo. Xavier Gómez Guinovart at Departamento de Tradución e Lingüística, Universidade de Vigo.

Fuente: <http://openscience.com/>
Publicar un comentario