In January of 2007, Johns Hopkins University was awarded a long-term multi-million dollar contract to establish and operate a Human Language Technology Center of Excellence adjacent to the Johns Hopkins’ Homewood campus. The Center’s research focuses on advanced technology for automatically analyzing a wide range of speech, text, and document image data in multiple languages. Other entities involved include: The Johns Hopkins Center for Language and Speech Processing, the University of Maryland, BBN Technologies, Carnegie Mellon University, and Columbia University. The focus of the technical program is in Automatic Population of Knowledge Bases from Text, Proof-of-Concept Experiments for Robust Speech Technology, and Stream Characterization for Content. These projects address key issues in extracting information from massive sources of text and speech. Doing research at the Human Language Technology Center of Excellence is exciting because the mission of the HLTCOE is to explore highly innovative technologies that could have a significant impact on challenging real-world problems. The HLTCOE’s work is organized around these broadly stated challenge problems: Producing Structured Knowledge from Unstructured Language Data: Many important applications will become possible when systems can automatically produce language-independent structured representations of knowledge derived from unstructured text, speech and document image data in a wide variety of languages and genres. The derived knowledge can be aggregated into a cumulative knowledge base, but it can also serve as input to a range of analytic and inference technologies. Although humans could potentially extract the kinds of information needed (such as various classes of entities, relations, events, opinions, scenarios and so forth), the large volumes of data, the complexity, and the required level of detail make such tasks impractical. Reasonably accurate, fully automatic methods are therefore essential. 
Robust Speech Challenge: In recognizing what a speaker is saying a human listener uses an enormous amount of language knowledge and world knowledge that is difficult to even represent on a computer, much less to learn automatically. Therefore, automatic speech recognition has been a challenging and exciting area of research for several decades. However, even after decades of development that has been very successful in some areas, automatic systems still fall far short of human listeners in the ability to tolerate moderate changes in the speech that deviate from the current model. The challenge for the HLTCOE is to develop new methodologies that not only improve the overall performance on speech recognition and related tasks, but that maintain this performance across a wide range of changes in the speech or language characteristics. Data Annotation Bottleneck: In many applications, in particular in both spoken and written language applications, there is often a large amount of data. Many experiments in language technology require that large quantities of this data be manually labeled so that automatic learning algorithms can build sophisticated models of this data. But manual annotation of a large quantity of data is both expensive and time-consuming. A common challenge in both speech recognition and text-based language analysis is to turn the large quantity of data into a resource rather than a burden. Meeting this challenge requires research at the cutting edge of automatic learning techniques, useful not only in many fields within language technology but for many other applications as well. Automatic Stream Characterization: Some important applications involve large volumes of streaming data whose content must be analyzed quickly to identify: - Data similar to other data previously judged to be of interest
- Data differing from previous norms or otherwise anomalous
- Overall stream characteristics (e.g., distribution of spoken languages)
The challenge is to develop high-speed algorithms that are reasonably accurate, that can analyze speech or text (either would be useful), and that can deal with data in motion (rather than archival data stores.) |