Additionally, the current volume of television viewing makes it likely that for many people, television viewing represents a plurality or even the majority of their daily linguistic experience. While film and television speech could be considered only pseudo-conversational in that it is often scripted and does not contain many disfluencies and other markers of natural speech, the semantic content of TV and movie subtitles better reflects the semantic content of natural speech than the commonly used corpora of Wikipedia articles or newspaper articles. Mandera et al., ( 2017) have previously used this subtitle corpus to train word embeddings in English and Dutch, arguing that the reasons for using subtitle corpora also apply to distributional semantics. The SUBTLEX word frequencies use the same OpenSubtitles corpus used in the present study. the Google Books corpus and others Brysbaert and New ( 2009), Keuleers et al., ( 2010), and Brysbaert et al., ( 2011)). Such subtitle-derived word frequencies have since been demonstrated to have better predictive validity for human behavior (e.g., lexical decision times) than word frequencies derived from various other sources (e.g. That subtitles are a more valid representation of linguistic experience, and thus a better source of distributional statistics, was first suggested by New et al., ( 2007) who used a subtitle corpus to estimate word frequencies. Therefore, instead of actual conversation transcripts, we used television and film subtitles since these are available in large quantities. However, since transcribing conversational speech is labor-intensive, corpora of real conversation transcripts are generally too small to yield high quality word embeddings. In many research contexts, a more appropriate training corpus would be one based on conversational data of the sort that represents the majority of daily linguistic experience. While word embedding algorithms do not necessarily reflect human learning of lexical semantics in a mechanistic sense, the semantic representations induced by any effective (human or machine) learning process should ultimately reflect the latent semantic structure of the corpus it was learned from. The linguistic experience over the lifetime of the average person typically does not include extensive reading of encyclopedias. However, from a psychological perspective, these corpora may not represent the kind of linguistic experience from which people learn a language, raising concerns about psychological validity. This has the benefit that even obscure words and semantic relationships are often relatively well attested. Al-Rfou et al., ( 2013), Bojanowski et al., ( 2017), and Grave et al., ( 2018)) Footnote 1 large written corpora meant as repositories of knowledge. This has produced steady improvements in embedding quality across the many languages in which Wikipedia is available (see e.g. To meet this need for large, multilingual training corpora, word embeddings are often trained on Wikipedia, sometimes supplemented with other text scraped from web pages. Progress in these areas is rapid, but nonetheless constrained by the availability of high quality training corpora and evaluation metrics in multiple languages. Garg et al., ( 2018)), to give just a few examples. Hamilton et al., ( 2016)), or linguistic representations of social biases (e.g. Thompson et al., ( 2018)), semantic change (e.g. Pereira et al., ( 2018)), and to predict human lexical judgements of e.g., word similarity, analogy, and concreteness (see Methods for more detail) and as models that help researchers gain quantitative traction on large-scale linguistic phenomena, such as semantic typology (e.g. Pereira et al., ( 2016)) and neurophysiological data (e.g. Chen et al., ( 2017)) as tools to help researchers interpret behavioral (e.g. Vector representations of semantics are of value to the language sciences in numerous ways: as hypotheses about the structure of human semantic representations (e.g. These models implicitly learn a vector space representation of lexical relationships from co-occurrence statistics embodied in large volumes of naturally occurring text. Recent progress in applied machine learning has resulted in new methods for efficient induction of high-quality numerical representations of lexical semantics- word vectors-directly from text.
0 Comments
Leave a Reply. |