Johns Hopkins UniversityHuman Language Technology

About Us

Staff

Opportunities/Jobs

Collaborators

News & Events

SCALE-Summer Workshop

SCALE 2010

Publications

Contact Us

Home

 

 

Publications  

 


Featured Publication!

Congratulations to Dr. Mary Harper!  Her paper, "Multimodal Floor Control Shift Detection," written with Dr. Lei Chen, won the ICMI-MLMI 2009 Outstanding Paper Award!

 


 

Robust Speech Technology

Multimodal Floor Control Shift Detection, Lei Chen, Mary Harper

Floor control is a scheme used by people to organize speaking turns in multi-party conversations. Identifying the floor control shifts is important for understanding a conversation’s structure and would be helpful for more natural human computer interaction systems. Although people tend to use verbal and nonverbal cues for managing floor control shifts, only audio cues, e.g., lexical and prosodic cues have been used in most previous investigations on speaking turn prediction. In this paper, we present a statistical model to automatically detect floor control shifts using both verbal and nonverbal cues. Our experimental results show that using a combination of verbal and nonverbal cues provides more accurate detection.

   

Combining LVCSR and Vocabulary-Independent RankedUtterance Retrieval for Robust Speech Search, Scott Olsson, Douglas Oard

Well tuned Large-Vocabulary Continuous Speech Recognition(LVCSR) has been shown to generally be more effective than vocabulary-independent techniques for ranked retrieval of spoken content when one or the other approach is used alone. Tuning LVCSR systems to a topic domain can be costly, however, and the experiments in this paper show that Out-Of-Vocabulary (OOV) query terms can significantly reduce retrieval effectiveness when that tuning is not performed. Further experiments demonstrate, however, that retrieval effectiveness for queries with OOV terms can be substantially improved by combining evidence from LVCSR with additional evidence from vocabulary-independent Ranked Utterance Retrieval (RUR). The combination is performed by using relevance judgments from held-out topics to learn generic (i.e., topic-independent), smooth, non-decreasing transformations from LVCSR and RUR system scores to probabilities of topical relevance. Evaluated using a CLEF collection that includes topics, spontaneous conversational speech audio, and relevance judgments, the system recovers 57% of the mean uninterpolated average precision that could have been obtained through LVCSR domain tuning for very short queries (or 41% for longer queries).

Transducing Logical Relations from Automatic and Manual Annotation, Adam Meyers, Michiko Kosaka, Heng Ji, Nianwen Xue, Mary Harper, Ang Sun, Wei Xu and Shasha Liao- submitted to the Third Linguistic Annotation Workshop, 2009

GLARF relations are generated from treebankand parses for English, Chinese and Japanese. Our evaluation of system output for these input types requires consideration of multiple correct answers.

A Joint Language ModelWith Fine-grain Syntactic Tags, Denis Filimonov,  Mary Harper [2009]

We present a scalable joint language model designed to utilize fine-grain syntactic tags. We discuss challenges such a design faces and describe our solutions that scale well to large tagsets and corpora.  We advocate the use of relatively simple tags that do not require deep linguistic knowledge of the language but provide more structural information than POS tags and can be derived from automatically generated parse trees – a combination of properties that allows easy adoption of this model for new languages. We propose two fine-grain tagsets and evaluate our model using these tags, as well as POS tags and SuperARV tags in a speech recognition task and discuss future directions.

Unsupervised Acoustic and Language Model Training with Small Amounts of Labelled Data,  Scott Novotney, Richard Schwartz and Jeff Ma

We measure the effects of a weak language model, estimated from as little as 100k words of text, on unsupervised acoustic model training and then explore the best method of using word confidences to estimate n-gram counts for unsupervised language model training. Even with 100k words of text and 10 hours of training data, unsupervised acoustic modeling is robust, with 50% of the gain recovered when compared to supervised training. For language model training, multiplying the word confidences together to get a weighted count produces the best reduction in WER by 2% over the baseline language model and 0.5% absolute over using unweighted transcripts. Oracle experiments show that a larger gain is possible, but better confidence estimation techniques are needed to identify correct n-grams.

Semi-Automatic Learning of Acoustic Models:  A White Paper on Proposed Research Projects, James Baker [2009]

This paper is an intuitive introduction to some newly proposed techniques for semi-automatic learning.  Some of these proposed techniques are specifically aimed at greater automation of the process of developing acoustic models for speech recognition in a new language.  Such automation is needed in particular to support development of speech recognition in a large number of languages, such as the All the Languages of the World (ATLOTW) Grand Challenge, whose goal is to develop speech recognition, machine-aided translation and other speech and language technologies in all of the world’s languages.  The emphasis in this white paper is on techniques relying primarily on learning from unlabeled data, requiring only small amounts of labeled data.  Developing techniques for such a data-rich, label-poor situations has been a primary focus for the Human Language Technology Center of Excellence (HLTCOE) at Johns Hopkins University.

Unsupervised Pronunciation Validation, Christopher M. White, Abhinav Sethy, Bhuvana Ramabhadran, Patrick Wolfe, Erica Cooper, Murat Saraclar, James K. Baker

Abstract:  This paper addresses selecting between candidate pronunciations for out-of-vocabulary words in speech processing tasks. We introduce a simple, unsupervised method that outperforms the conventional supervised method of forced alignment with a reference. The success of this method is independently demonstrated using three metrics from largescale speech tasks: word error rates for large vocabulary continuous speech recognition, decision error tradeoff curves for spoken term detection, and phone error rates compared to a handcrafted pronunciation lexicon. The experiments were conducted using state-of-the-art recognition, indexing, and retrieval systems. The results were compared across many terms, hundreds of hours of speech, and well known data sets.

Anchored Speech Recognition for Question Answering, Sibel Yaman, Gokhan Tur, Dimitra Vergyri, Dilek Hakkani-Tur, Mary Harper and Wen Wang

Abstract:  In this paper, we propose a novel question answering system that searches for responses from spoken documents such as broadcast news stories and conversations. We propose a novel two-step approach, which we refer to as anchored speech recognition, to improve the speech recognition of the sentence that supports the answer. In the first step, the sentence that is highly likely to contain the answer is retrieved among the spoken data that has been transcribed using a generic automatic speech recognition (ASR) system. This candidate sentence is then re-recognized in the second step by constraining the ASR search space using the lexical information in the question.  Our analysis showed that ASR errors caused a 35% degradation in the performance of the question answering system. Experiments with the proposed anchored recognition approach indicated a significant improvement in the performance of the question answering module, recovering 30% of the answers erroneous due to ASR.

Sequential System Combination for Machine Translation of Speech, Damianos Karakos and Sanjeev Khudanpur

Abstract:  System combination is a technique which has been shown to yield significant gains in speech recognition and machine translation. Most combination schemes perform an alignment between different system outputs in order to produce lattices (or confusion networks), from which a composite hypothesis is chosen, possibly with the help of a large language model.  The benefit of this approach is two-fold: (i) whenever many systems agree with each other on a set of words, the combination output contains these words with high confidence; and (ii) whenever the systems disagree, the language model resolves the ambiguity based on the (probably correct) agreed upon context. The case of machine translation system combination is more challenging because of the different word orders of the translations: the alignment has to incorporate computationally expensive movements of word blocks. In this paper, we show how one can combine translation outputs efficiently, extending the incremental alignment procedure of [1]. A comparison between different system combination design choices is performed on an Arabic speech translation task

Reconciliation of Human and Machine Speech Recognition Performance, Misha Pavel, Malcolm Slaney and Hynek Hermansky

Abstract:  This paper focuses on resolving a number of issues that appear when the performance of human speech recognition is compared to that of automatic speech recognition. In particular human experimental data suggest that the resulting error is a product of the individual streams. On the other hand, Bayesian combination requires a multiplication of the estimates of prior probabilities and likelihoods. We show that, in principle, there is no discrepancy. The product of errors is a performance measure and human and machine performance may be consistent with this empirically established regularity. The product of probabilities is step in an algorithm to achieve the performance that may or may not be consistent with the product of errors. The main problem is that most of prior discussions failed to distinguish the performance measures from the estimates of the parameters used in the algorithm.

Chinese Statistical Parsing, Mary P. Harper,  Zhongqiang Huang 

Abstract:  This chapter describes several issues that are fundamental to achieving accurate Chinese parsing given available Chinese resources and the challenges of the Gale processing pipeline. For Gale, our parsing algorithm is expected to accurately parse various different materials, ranging from newswire text, which tends to be grammatically well formed, to n-best ASR outputs, many of which are poorly formed sentences. To address this challenge, we have re-implemented and enhanced the Berkeley parser to handle unknown Chinese words efficiently, parse difficult sentences robustly, and operate more efficiently. We also address issues related to training the parser for several different genres given a limited number of available training trees, the importance of matching word segmentation to the treebank segmentation standard to support accurate parsing, and the need for standardized tokenization for managing the types of things that will appear as input to the parser.  Understanding and handling these issues is a prerequisite for achieving adequate parsing performance levels. An important tool for enhancing parsing performance given the limited number of trees in the Chinese treebanks is self-training with automatically labeled in-domain data.

Effect of Pronunciations on OOV Queries in Spoken Term Detection, Dogan Can, Erica Cooper, Abhinav Sethy, Bhuvana Ramabhadran, Murat Saraclar, Christopher M. White

This paper focusses on the effect of pronunciations for Out-of-Vocabulary (OOV) query terms on the performance of a spoken term detection (STD) task. OOV terms, typically proper names or foreign language terms occur infrequently but are rich in information. The STD task returns relevant segments of speech that contain one or more of these OOV query terms. The STD system described in this paper indexes word-level and subword level lattices produced by an LVCSR system using Weighted Finite State Transducers (WFST).  Experiments comparing pronunciations using n-best variations from letter-to-sound rules, morphing pronunciations using phone onfusions for the OOV terms and indexing one-best transcripts, lattices and confusion networks are presented. The following observations are worth mentioning: phone indexes generated from subwords represented OOVs well, too many variants for the OOV terms degrades performance if pronunciations are not weighted

Stream Characterization of Data

Random Attributed Graphs for Statistical Inference from Content and context, Allen Gorin, Carey Priebe, John Grothendieck

Coping with Information Overload is a major challenge of the 21st century. Huge volumes and varieties of multilingual data must be processed to extract salient information. Previous research has addressed automatic characterization of streaming content. However, information includes both content and associated meta-data, which humans deal with as a gestalt but computer systems often treat separately.  Random attributed graphs provide an effective means to characterize and draw inferences from large volumes of language content plus associated meta-data. This paper describes these methods and their utility, with experimental proof-of-concept on the Switchboard and Enron corpora.

Fusion and Inference from Multiple Data Sources, Priebe, Carey, Ma, Zhiliang, Marchette, David J. Hohman, Elizabeth, Coppersmith, Glen

Given K matched feature vectors xi,1, . . . , xi,K for each of n objects, with xi,k 2 k, and given additional feature vectors {yk}Kk =1, we consider testing H0 : {yk}Kk=1 are matched feature vectors representing a single object measured under K conditions versus HA: they do not represent a single object. We develop an approach to this problem which uses only the interpoint dissimilarities for each condition separately. We impute the dissimilarities between matched measurements of different conditions to obtain one omnibus dissimilarity matrix, which is then embedded into Euclidean space. Out-of-sample embedding is used to embed the new measurements {yk}Kk=1 into this same space, and we determine whether a match is present by examining the distance between the corresponding embeddings.  We illustrate our methodology on English and French documents collected from Wikipedia, demonstrating superior performance compared to that obtained via standard Procrustes analysis. [Submitted to ISI 2009]

Statistical Inference on Attributed Random Graphs:Fusion of Graph Features and Content: An Experiment on Time Series of Enron Graphs, C.E. Priebe, Y. Park, D.J. Marchette, J.M. Conroy, J. Grothendiek

Abstract: Fusion of information from graph features and content can providesuperior inference for an anomaly detection task, compared to the corresponding content-only or graph feature-only statistics. In this paper, we design and execute an experiment on a time series of attributed graphs extracted from the Enron email corpus which demonstrates the benefit of fusion. The experiment is based on injecting a controlled anomaly into the real data and measuring its detectability.

Statistical Inference on Random Graphs: Comparative Power Analyses via Monte Carlo, Henry Pao, Glen A. Coppersmith, and Carey E. Priebe

Abstract:  We present a comparative power analysis, via Monte Carlo, of various graph invariants used as statistics for testing graph homogeneity versus a “chatter” alternative – the existence of a local region of excessive activity. Our results indicate that statistical inference on random graphs, even in a relatively simple setting, can be decidedly non-trivial. We find that none of the graph invariants considered is uniformly most powerful throughout our space of alternatives.

Natural Language Processing

Multi-Class Confidence Weighted Algorithms, Koby Crammer, Mark Dredze, Alex Kulesza

The recently introduced online confidence-weighted (CW) learning algorithm for binary classification performs well on many binary NLP tasks.  However, for multi-class problems CW learning updates and inference cannot be computed analytically or solved as convex optimization problems as they are in the binary case. We derive learning algorithms for the multi-class CW setting and provide extensive evaluation using nine NLP datasets, including three derived from the recently released New York Times corpus. Our best algorithm outperforms state-of-the-art online and batch methods on eight of the nine tasks. We also show that the confidence information maintained during learning yields useful probabilistic information at test time.

Addressing Morphological Variation in Alphabetic Languages, Paul McNamee, Charles Nicholas, James Mayfield

The selection of indexing terms for representing documents is a key decision that limits how effective subsequent retrieval can be. Often stemming algorithms are used to normalize surface forms, and thereby address the problem of not finding documents that contain words related to query terms through inflectional or derivational morphology. However, rule-based stemmers are not available for every language and it is unclear which methods for coping with morphology are most effective. In this paper we investigate an assortment of techniques for representing text and compare these approaches using data sets in eighteen languages and five different writing systems.  We find character n-gram tokenization to be highly effective.  In half of the languages examined n-grams outperform unnormalized words by more than 25%; in highly inflective languages relative improvements over 50% are obtained. In languages with less morphological richness the choice of tokenization is not as critical and rule-based stemming can be an attractive option, if available. We also conducted an experiment to uncover the source of n-gram power and a causal relationship between the morphological complexity of a language and n-gram effectiveness was demonstrated.

Ter-Plus: Paraphrase, Semantic, and Alignment Enhancements to Translation Edit Rate, Matthew Snover, Nitin Madnani, Bonnie Dorr, Richard Schwartz

This paper describes a new evaluation metric, Ter-Plus (Terp) for automatic evaluation of machine translation. Terp is an extension of Translation Edit Rate (Ter). It builds on the success of Ter as an evaluation metric and alignment tool and addresses several of its weaknesses through the use of paraphrases, morphological stemming, synonyms, as well as edit costs that can be automatically optimized to correlate better with various types of human judgments. We present a correlation study comparing Terp to Bleu, Meteor and Ter, and illustrate that Terp can better evaluate translation adequacy.

Translation Corpus Source and Size in Bilingual Retrieval, Paul McNamee, James Mayfield, Charles Nicholas

This paper explores corpus-based bilingual retrieval where the translation corpora used vary by source and size. We find that the quality of translation alignments and the domain of the bitext are important. In some settings these factors are more critical than corpus size. We also show that judicious choice of tokenization can reduce the amount of bitext required to obtain good bilingual retrieval performance.

Cross-Document Coreference Resolution:A Key Technology for Learning by Reading, James Mayfield, David Alexander, Bonnie Dorr, Jason Eisner, Tamer Elsayed, Tim Finin, Clay Fink, Marjorie Freedman, Nikesh Garera, Paul McNamee, Saif Mohammad, Douglas Oard, Christine Piatko, Asad Sayeed, Zareen Syed, Ralph Weischedel, Tan Xu and David Yarowsky

Abstract:  Automatic knowledge base population from text is an important technology for a broad range of approaches to learning by reading. Effective automated knowledge base population depends critically upon coreference resolution of entities across sources. Use of a wide range of features, both those that capture evidence for entity merging and those that argue against merging, can significantly improve machine learning-based cross-document coreference resolution.  Results from the Global Entity Detection and Recognition task of the NIST Automated Content Extraction (ACE) 2008 evaluation support this conclusion.

Using Wikitology for Cross-Document Entity Coreference Resolution, Tim Finin, Zareen Syed, James Mayfield, Paul McNamee, and Christine Piatko

Abstract:  We describe the use of the Wikitology knowledge base as a resource for a variety of applications with special focus on a cross-document entity coreference resolution task. This task involves recognizing when entities and relations mentioned in different documents refer to the same object or relation in the world. Wikitology is a knowledge base system constructed with material from Wikipedia, DBpedia and Freebase that includes both unstructured text and semi-structured information.  Wikitology was used to define features that were part of a system implemented by the Johns Hopkins University Human Language Technology Center of Excellence for the 2008 Automatic Content Extraction cross-document coreference resolution evaluation organized by National Institute of Standards and Technology

Latent-Variable Modeling of String Transductions with Finite-State Methods, Markus Dreyer and Jason R. Smith and Jason Eisner

Abstract:  String-to-string transduction is a central problem in computational linguistics and natural language processing. It occurs in tasks as diverse as name transliteration, spelling correction, pronunciation modeling and inflectional morphology. We present a conditional loglinear model for string-to-string transduction, which employs overlapping features over latent alignment sequences, and which learns latent classes and latent string pair regions from incomplete training data. We evaluate our approach on morphological tasks and demonstrate that latent variables can dramatically improve results, even when trained on small data sets. On the task of generating morphological forms, we outperform a baseline method reducing the error rate by up to 48%. On a lemmatization task, we reduce the error rates in Wicentowski (2002) by 38–92%.

Structural, Transitive and Latent Models for Biographic Fact Extraction, Nikesh Garera and David Yarowsky

Abstract: This paper presents six novel approaches to biographic fact extraction that model structural, transitive and latent properties of biographical data. The ensemble of these proposed models substantially outperforms standard pattern-based biographic fact extraction methods and performance is further improved by modeling inter-attribute correlations and distributions over functions of attributes,  achieving an average extraction accuracy of 80% over seven types of biographic attributes.

Knowledge Base Evaluation for Semantic Knowledge Discovery, James Mayfield, Bonnie J. Dorr, Tim Finin, Douglas W. Oard and Christine D. Piatko

Machine Learning with Annotator Rationales to Reduce Annotation Cost, Omar F. Zaidan, Jason Eisner, Christine D. Piatko

Abstract:  We review two novel methods for text categorization, based on a new framework that utilizes richer annotations that we call annotator rationales. A human annotator provides hints to a machine learner by highlighting contextual “rationales” in support of each of his or her annotations. We have collected such rationales, in the form of substrings, for an existing document sentiment classification dataset [1].  We have developed two methods, one discriminative [2] and one generative [3], that use these rationales during training to obtain significant accuracy improvements over two strong baselines. Our generative model in particular could be adapted to help learn other kinds of probabilistic classifiers for quite different tasks. Based on a small study of annotation speed, we posit that for some tasks, providing rationales can be a more fruitful use of an annotator’s time than annotating more examples.

Dyna: A Non-Probabilistic Programming Language for Probabilistic AI, Jason Eisner

HLT/COE Research in Cross-document Coreference Resolution, James Mayfield, Paul McNamee, Christine Piatko, Clayton Fink and Tim Finin

N-gram Tokenization for Indian Language Text Retrieval, Paul McNamee

Abstract - Character n-gram tokenization is a language-neutral technique that addresses the problems created by morphological processes that lower IR performance, such as inection, derivation, and compounding. N-grams have been widely adopted for use in Asian languages, especially languages such as Chinese and Japanese where words are not separated by spaces. Use of n-grams in alphabetic languages is less popular; however, they have been shown to be an effective technique in many European languages using data sets developed at CLEF.  This paper describes monolingual experiments using ngrams as the primary method of tokenization in several Indian languages. Tests are conducted in Bengali, Hindi, and Marathi using benchmarks created in 2008 for the FIRE workshop.

Computing Word-Pair Antonymy, Saif Mohammad, Bonnie Dorr, Graeme Hirst

Abstract: Knowing the degree of antonymy between words has widespread applications in natural language processing. Manually-created lexicons have limited coverage and do not include most semantically contrasting word pairs. We present a new automatic and empirical measure of antonymy that combines corpus statistics with the structure of a published thesaurus.  The approach is evaluated on a set of closest-opposite questions, obtaining a precision of over 80%. Along the way, we discuss what humans consider antonymous and how antonymy manifests itself in utterances.

Dependency Parsing by Belief Propagation, David A. Smith and Jason Eisner

Abstract:  We formulate dependency parsing as a graphical model with the novel ingredient of global constraints. We show how to apply loopy belief propagation (BP), a simple and effective tool for approximate learning and inference. As a parsing algorithm, BP is both asymptotically and empirically efficient. Even with second-order features or latent variables, which would make exact parsing considerably slower or NP-hard, BP needs only O(n3) time with a small constant factor. Furthermore, such features significantly improve parse accuracy over exact first-order methods. Incorporating additional features would increase the runtime additively rather than multiplicatively

Multiple Alternative Sentence Compressions andWord-Pair Antonymyfor Automatic Text Summarization and Recognizing Textual Entailment, Saif Mohammad, Bonnie Dorr, Melissa Egan, Nitin Madnani, David Zajic, & Jimmy Lin

Abstract:  The University of Maryland participated in three tasks organized by the Text Analysis Conference 2008 (TAC 2008): (1) the update task of text summarization; (2) the opinion task of text summarization; and (3) recognizing textual entailment (RTE). At the heart of our summarization system is Trimmer, which generates multiple alternative compressed versions of the source sentences that act as candidate sentences for inclusion in the summary. For the first time, we investigated the use of automatically generated antonym pairs for both text summarization and recognizing textual entailment.  We used an antonymy feature in both the opinion summarization task and for recognizing textual entailment. More coherent summaries resulted when using the antonymy feature as compared to when not using it. However, performance on ROUGE dropped. The RTE system performed almost equally well when using antonyms from WordNet and when using automatically generated antonyms.

Interlingual Annotation of Parallel Text Corpora: A New Framework for Annotation and Evaluation, Bonnie J. Dorr, David Farwell, Rebecca Green, Nizar Habash. Stephen Helmreich, Eduard Hovy, Lori Levin, Keith J. Miller, Teruko Mitamura, Rebecca J. Passonneau, Owen Rambow, Florence Reeder, Advaith Siddharthan - ( Received 30 April 2004; revised January 2008 )

Abstract:  This paper focuses on the next step in the creation of a system of meaning representation and the development of semantically-annotated parallel corpora, for use in applications such as machine translation, question answering, text summarization, and information retrieval. The work described below constitutes the rst e ort of any kind to annotate multiple translations of foreign-language texts with interlingual content. The resulting annotated, multilingually-induced, parallel corpora will be useful as an empirical basis for a wide range of research, including the development and evaluation of interlingual NLP systems and paraphrase-extraction systems as well as a host of other research and development eorts in theoretical and applied linguistics, foreign language pedagogy, translation studies, and other related disciplines.

Modeling Latent Speaker Attributes in Conversational Transcripts, Nikesh Garera and David Yarowsky

Abstract:  This paper presents and evaluates several original techniques for the latent classification of speaker gender, age and native language in conversational speech transcripts. It explores a rich variety of novel sociolinguistic and discourse-based features, including mean utterance length, passive/active usage, percentage domination of the conversation, speaking rate and filler word usage. It also shows performance gains from the novel modeling of speaker attributes sensitive to partner speaker attributes, given the differences in lexical usage and discourse style such as observed between same-gender and mixed-gender conversations.  Cumulatively up to 20% error reduction is achieved relative to the standard Boulis and Ostendorf (2005) algorithm, and accuracy for gender detection on the Switchboard Corpus approaches 97% when multiple conversations per speaker are collectively classified.

From Linguistic Annotations to Knowledge Objects - Bonnie J. Dorr. Saif Mohammad

JHU Ad Hoc Experiments at CLEF 2008, Paul McNamee

For CLEF 2008 JHU conducted monolingual and bilingual experiments in the ad hocTEL and Persian tasks.  The TEL task involved focused on searching electronic card catalog records in English, French, and German using data from the British Library, the Bibliotheque Nationale de France, and the  Osterreichische Nationalbibliothek (Austrian National Library). The approach we adopted for TEL was to strip out non-content sections of records and to treat the task as ordinary full-text search using character n-grams and stemmed words.  For the Persian task, which is based on the Hamshahri corpus, several different forms of textual normalization were compared. Using the provided training topics we compared character n-grams, n-gram stems, ordinary words, words automatically segmented into morphemes, and a novel form of n-gram indexing based on n-grams with character skips. On the training topics we found that character 5-grams and skipgrams performed the best and this was borne out in our ocial submissions.  We also did some post hoc experiments using previous CLEF ad hoc tests sets in 13 languages.

In all three tasks we explored alternative methods of tokenizing documents including plain words, stemmed words, automatically induced segments, a single selected ngrams for each words, and all n-grams from words (i.e., traditional character n-grams).  Character n-grams demonstrated consistent gains over ordinary words in each of these three diverse sets of experiments. Using mean average precision, relative gains of of 50-200% on the TEL task, 5% on the Persian task, and 18% averaged over 13 languages from past CLEF evaluations, were observed.

Retrieval Experiments at Morpho Challenge 2008, Paul McNamee

Morpho Challenge 2008 hosted an extrinsic evaluation of morphological analysis that explored whether unsupervised morphology induction could bene t information retrieval. This paper presents results in alternative methods for word normalization using test sets from the Cross-Language Evaluation Forum (CLEF) ad-hoc collections.  Preliminary results for the Morpho Challenge 2008 evaluation are consistent with these data. We found that: (1) rule-based stemming is effective in less morphologically complicated languages; (2) alternative methods for stemming such as unsupervised learning of morphemes and least common n-gram stemming are helpful; and, (3) full character n-gram indexing is the most effective form of tokenization in more morphologically complex language

Latent-Variable Modeling of String Transductions with Finite-State Methods(2008): Markus Dreyer, Jason Smith, Jason Eisner 

String-to-string transduction is a central problem in computational linguistics and natural language processing. It occurs in tasks as diverse as name transliteration, spelling correction, pronunciation modeling and inflectional morphology. We present a conditional log-linear model for string-to-string transduction, which employs overlapping features over latent alignment sequences, and which learns latent classes and latent string pair regions from incomplete training data. We evaluate our approach on morphological tasks and demonstrate that latent variables can dramatically improve results, even when trained on small data sets. On the task of generating morphological forms, we outperform a baseline method reducing the error rate by up to 48%. On a lemmatization task, we reduce the error rates in Wicentowski (2002) by 38–92%. . Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Honolulu, October 2008.  Link:  http://cs.jhu.edu/~jason/papers/#emnlp08-morph

Resolving Personal Names in Email Using Context Expansion:  Tamer Elsayed, Douglas W. Oard, and Galileo Namata

This paper describes a computational approach to resolving the true referent of a named mention of a person in the body of an email. A generative model of mention generation is used to guide mention resolution. Results on three relatively small collections indicate that the accuracy of this approach compares favorably to the best known techniques, and results on the full CMU Enron collection indicate that it scales well to larger collections.

Pairwise Document Similarity in Large Collections with MapReduce:  Tamer Elsayed, Jimmy Lin, and Douglas W. Oard

This paper presents a MapReduce algorithmfor computing pairwise document similarity in large document collections. MapReduce is an attractive framework because it allows us to decompose the inner products involved in computing document similarity into separate multiplication and summation stages in a way that is well matched to efficient disk access patterns across several machines. On a collection consisting of approximately 900,000 newswire articles, our algorithm exhibits linear growth in running time and space in terms of the number of documents.

About Us |  Staff Opportunities & Jobs |  SCALE-Summer Workshop   SCALE 2010   Collaborators News & Events |  Publications | Contact Information | Home|

 © The Johns Hopkins University. All rights reserved.