CollexRelatedWork
From MWCSWiki
Add an annotated entry here about every tool/project we have found that is close to what we're doing. And add whatever structure you see that seems relevant. (categories of different kinds of systems, etc.)
Applications and Techniques:
____________________________________________________________
FrameNet
http://framenet.icsi.berkeley.edu/
This is a lot like WordNet. There are several articles about this project, and many more articles that refer to it. It currently deals only with English (a new Spanish version is coming) and has different aims than our project does. It does have a cool way of mapping language. We may want to consider using it in addition to WordNet (possibly it can supplement WordNet's functionality). We might also want to look ti their design for inspiration. (see picture)
____________________________________________________________
Markov model of POS tagging:
Part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up the words in a text as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e., relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.
Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech, in accordance with a set of descriptive tags.
____________________________________________________________
EDR Electronic Dictionary
http://www2.nict.go.jp/r/r312/EDR/index.html
"The EDR Electronic Dictionary was developed for advanced processing of natural language by computers, and is composed of eleven sub-dictionaries. Sub-dictionaries include a concept dictionary, word dictionaries, bilingual dictionaries, etc. The EDR Electronic Dictionary is the result of a nine-year project (from fiscal 1986 to fiscal 1994) aimed at establishing an infrastructure for knowledge information processing. The project was funded by the Japan Key Technology Center and eight computer manufacturers."
"The EDR Electronic Dictionary is a machine-tractable dictionary that catalogues the lexical knowledge of Japanese and English (the Word Dictionary, the Bilingual Dictionary, and the Co-occurrence Dictionary), and has unified thesaurus-like concept classifications (the Concept Dictionary) with corpus databases (the EDR Corpus). The Concept Classification Dictionary, a sub-dictionary of the Concept Dictionary, describes the similarity relation among concepts listed in the Word Dictionary. The EDR Corpus is the source for the information described in each of the sub-dictionaries. The basic approach taken during the development of the dictionary was to avoid a particular linguistic theory and to allow for adoptability to various applications."
"The EDR Electronic Dictionary, thus developed, is believed to be useful in R&D of natural language processing and the next generation of knowledge processing systems. In addition, it will become part of an infrastructure that provides new types of activities in information services."
This is an article about it:
The EDR electronic dictionary
Source Communications of the ACM archive
Volume 38 , Issue 11 (November 1995) table of contents
Pages: 42 - 44
Year of Publication: 1995
ISSN:0001-0782
Author
Toshio Yokoi
Philippine Software Development Institute, Quezon City, Philippines
Publisher
ACM New York, NY, USA
"Natural language processing will grow into a vital industrial technology in the next five to 10 years. But this growth depends on the development of large linguistic databases that capture natural language phenomena [1, 2]. Another important theme for future work is development of large knowledge bases that are shared widely by different groups. One promising approach to such knowledge bases draws on natural language processing and linguistic knowledge. This article describes the EDR Electronic Dictionary [3], which seeks to provide a foundation for linguistic databases, and explains the relation of electronic dictionaries to very large knowledge bases."
____________________________________________________________
Articles:
____________________________________________________________
A FEATURE-BASED MODEL FOR LEXICAL DATABASES
Jean Veronis and Nancy Ide, 1992
- Talks about how the "classical" database models (especially relational) are not capable of correctly modeling the complexities of a dictionary
- Describes techniques like "recursive nesting" to model the multiple meanings listed in a dictionary entry for a word.
- Has a lot of pictures and explores some cool problems in modeling a language.
- A bit outdated in some ways
Verdict: Somewhat applicable to what we are doing: Addresses similar issues of how to model relationships between words in a very complex and accurate way, capturing subtleties of the language and what words mean. Discusses the limitations of the relational model (which applies to our project).
____________________________________________________________
Putting FrameNet data into the ISO linguistic annotation framework
Source Annual Meeting of the ACL archive
Proceedings of the ACL 2003 workshop on Linguistic annotation: getting the model right - Volume 19 table of contents
Pages: 22 - 29
Year of Publication: 2003
Authors
Srinivas Narayanan International Computer Science Institute, Berkeley, California
Miriam R. L. Petruck International Computer Science Institute, Berkeley, California
Collin F. Baker International Computer Science Institute, Berkeley, California
Charles J. Fillmore International Computer Science Institute, Berkeley, California
Publisher
Association for Computational Linguistics Morristown, NJ, USA
"This paper describes FrameNet (Lowe et al., 1997; Baker et al., 1998; Fillmore et al., 2002), an online lexical resource for English based on the principles of frame semantics (Fillmore, 1977a; Fillmore, 1982; Fillmore and Atkins, 1992), and considers the FrameNet database in reference to the proposed ISO model for linguistic annotation of language resources (ISO TC37 SC4 )(ISO, 2002; Ide and Romary, 2001b). We provide a data category specification for frame semantics and FrameNet annotations in an RDF-based language. More specifically, we provide a DAML+OIL markup for lexical units, defined as a relation between a lemma and a semantic frame, and frame-to-frame relations, namely Inheritance and Subframes. The paper includes simple examples of FrameNet annotated sentences in an XML/RDF format that references the project-specific data category specification."
Frame Semantics and the FrameNet Project
FrameNet’s goal is to provide, for a significant portion of the vocabulary of contemporary English, a body of semantically and syntactically annotated sentences from which reliable information can be reported on the va-lences or combinatorial possibilities of each item in-cluded.?
A semantic frame is a script-like structure of infer-ences, which are linked to the meanings of linguistic units (lexical items). Each frame identifies a set of...
Verdict: might be very applicable and similar to one aspect of our project. This application uses FrameNet just like we want to use WordNet. We should look at FramNet and this application more closely.
____________________________________________________________
Linguistic Databases
Reviewed by
Jorg Tiedemann
Uppsala University
1995
Verdict: This is a book review. Some of the issues presented are relevant to our project. It also mentions one project in which WordNet was incorporated. It is 13 years old though, so it might not be worth looking into any further.
"Linguistic Databases is an edited collection of papers on the use of databases in linguistics. It comprises a selection of 12 contributions to the conference with the same title, which was held at the University of Groningen on 23-24 March 1995. The need for data management tools in linguistics is evident. Although collections of linguistic data grew rapidly in the past, the development of suitable database structures and management systems is still in an early stage. The articles presented in the book introduce a variety of approaches to several kinds of applications in different fields of linguistics. "
____________________________________________________________
Multilingual Lexical Database Generation from parallel texts in 20 European languages with endogenous resources
Giguet Emmanuel and Muquet Pierre-Sylvain, 2006
Verdict: this paper deals with generating a database from a corpus. Not very related to what we are doing. Talks a lot about natural language processing
"This paper deals with multilingual database generation from parallel corpora. The idea is to contribute to the enrichment of lexical databases for languages with few linguistic resources. Our approach is endogenous: it relies on the raw texts only, it does not require external linguistic resources such as stemmers or taggers. The system produces alignments for the 20 European languages of the 'Acquis Communautaire' Corpus."
____________________________________________________________
Data Oriented Methods for Grapheme-to-Phoneme Conversion
Antal van den Bosh and Walter Daelemans 1992
This article talks about how to make programs that read out loud (converting graphemes to phonemes).
The point is this: conventional opinion says that you need to give the application knowledge of linguistic rules and structure in order for the application to read out loud correctly. This paper argues that the linguistic knowledge is not necessary, instead it calls for the application to be loaded up with several corpora that have been transcribed into phonemes. Then when the application is reading out loud it can look for similar patterns in the corpora and predict how graphemes should be pronounced in different contexts.
Verdict: Not similar or applicable to our work
____________________________________________________________
A Language Oriented Data Modeling Approach (LODM)
Geoffrey Steinberg and Jicheng Lin, 1996
Verdict: Not similar or applicable to our work
This is really cool. It can take natural language and interpret it into a DB schema. A line of text like "I have a CD collection. CD's have titles, artists, songs and genres. Artists are people. People have birthdays" would be interpreted and made into a collection of tables and fields.
____________________________________________________________
The Annotation of Temporal Information in Natural Language Sentences
Graham Katz and Fabrizio Arosio
Verdict: not very applicable
This is really complex stuff about storing data about tenses. You have to "tag" the verbs correctly so the Natural Language processor can process it correctly.
____________________________________________________________
Review of "Linguistic exploitation of syntactic databases: the use of the nijmegen linguistic database program"
by Hans van Halteren, Theo van den Heuvel. Editions Rodopi 1990.
Source Computational Linguistics archive
Volume 17 , Issue 4 (December 1991) table of contents
REVIEWS: Book reviews table of contents
Pages: 457 - 461
Year of Publication: 1991
ISSN:0891-2017
Publisher
MIT Press Cambridge, MA, USA
Bibliometrics
____________________________________________________________
Integration of Corpus Linguistics and Object-Oriented Database Technology for Fine-Grained Analysis of Electronic Documents
Source ACM SIGOIS Bulletin archive
Volume 15 , Issue 1 (August 1994) table of contents
Special issue: the 1993 workshop on digital libraries
Pages: 6 - 7
Year of Publication: 1994
ISSN:0894-0819
Author
Robert P. Futrelle Northeastern University
Publisher
ACM New York, NY, USA
Verdict: Not similar or applicable to our work
"The availability of full text and graphics in future digital libraries will bring with it demands for more fine-grained and knowledge-intensive analysis of document content. This must, in turn, be founded on Corpus Linguistics methods, which typically have been implemented using the byte stream / pipes technology of UNIX..."
____________________________________________________________
The following articles talk about collaborative knowledge building and how people think, learn, contribute to and learn from collaborative environments like wikis. This is definitely worth a read and I think might influence our interface and display choices. There are more articles on this topic in the same journal that these came from, but I thought these were the best. I was unable to find any articles specifically about linguistics or building a dictionary.
A systemic and cognitive view on collaborative knowledge building with wikis Journal International Journal of Computer-Supported Collaborative Learning
Publisher Springer New York
ISSN 1556-1607 (Print) 1556-1615 (Online)
Issue Volume 3, Number 2 / June, 2008
DOI 10.1007/s11412-007-9035-z
Pages 105-122
Subject Collection Humanities, Social Sciences and Law
SpringerLink Date Friday, January 11, 2008
____________________________________________________________
Collaborative knowledge building using the Design Principles Database
Journal International Journal of Computer-Supported Collaborative Learning
Publisher Springer New York
ISSN 1556-1607 (Print) 1556-1615 (Online)
Issue Volume 1, Number 2 / June, 2006
DOI 10.1007/s11412-006-8993-x
Pages 187-201
Subject Collection Humanities, Social Sciences and Law
SpringerLink Date Thursday, June 22, 2006

