"The words are just symbols to the computer. How does it know what they really mean?" - David Ferrucci
IBM Watson is IBM's project to create the first computer that can win the TV quiz show Jeopardy when pitted against human contestants, including the record holder for the longest championship streak, Ken Jennings and the current biggest all-time Jeopardy money winner, Brad Rutter. The resulting computer will be a contestant on Jeopardy next month, February 2011. I will try to give an overview here of what is known to date about IBM Watson from open sources. I'm writing this as much for my own learning as any other reason; so, give me a break if it gets a little fuzzy in the complicated parts, besides IBM is not playing all their cards. Of course, I welcome any and all comments, corrections and clarifications. BTW, in the spirit of full disclosure, I am a so-called "IBM brat" having grown up in an IBM family; my father @ljendicott worked for IBM, 1960-1987.
According to the IBM DeepQA FAQ, the history of Watson includes both "Project Halo", the quest for a "digital Aristotle", and AQUAINT, the Advanced Question Answering for Intelligence program. In fact, David Ferrucci, principal investigator for the DeepQA/Watson project, has four publications listed in the AQUAINT Bibliography. The earliest version of Watson was a trial of IBM’s AQUAINT system called PIQUANT, Practical Intelligent Question Answering Technology, adapted for the Jeopardy challenge. Another question answering system, a contemporary of PIQUANT called Ephyra (now available as OpenEphyra), was used with PIQUANT in early trials of Watson, both by IBM and their partners at the Language Technologies Institute at Carnegie Mellon University (who are jointly developing the "Open Advancement of Question Answering Systems" initiative). One of the things OpenEphyra can do that Watson doesn't do at the moment is retrieve the answers to natural language questions from the Web. IBM Watson is not a conversational intelligence per se, but rather a question answering system (QA system). It is fully self-contained and not connected with the Internet at all. Watson does have an active Twitter account at @IBMWatson, but it is operated by a group of Watson's handlers (using CoTweet). Watson has no speech recognition capability; questions are delivered textually. It does not have autonomous text-to-speech (TTS) capability: TTS must be triggered by an operator (ostensibly to avoid interruptions to the television performance). New York Times readers have tentatively identified the voice of Watson TTS as that of Jeff Woodman. Presumably, an IBM WebSphere Voice product is being used for Watson TTS. The distinctive "face" or avatar of IBM Watson, about the size of a human head, expresses emotion and reacts to the environment. It was created by Joshua Davis with Adobe Flash Professional CS5 using the ActionScript HYPE visual framework deployed via Adobe Flash Player 10.1. The avatar is connected to Watson via an XML socket server, which sends information about the computer’s current mood or state, such as “I know the answer”, “I won the buzz”, etc. The avatar also receives audio input from Watson’s voice by analyzing audio from the microphone.
IBM Watson is built on a massively parallel supercomputer. The hardware configuration consists of a room-sized system, about the size of 10 refrigerators: 10 racks containing 90 IBM Power 750 server clusters connected over a 10 Gb Ethernet. Each Power 750 contains 4 chips and 32 cores, and is supposedly the world's fastest processor. IBM Watson has a total of 360 computer chips and 2,880 processor cores. It has 15 terabytes of RAM, and a total data repository of 4 terabytes, consisting of two 2 terabyte (TB) I/O nodes. IBM Watson operates at some 80 teraFLOPS, or 80 trillion operations per second. (For comparison, both IBM Blue Gene and the AFRL Condor Cluster operate at some 500 teraFLOPS.)
Many sources, including the Wall Street Journal, are claiming Watson's 4 terabytes (TB) of storage contains some 200 million "pages" of content. Wired claimed only 2 million pages of data for Watson. 1TB (or 1,024GB) is roughly equivalent to the number of books in a large library (or about 1,610 CDs). Large municipal libraries may contain an average of 10,000 volumes. So, if a book averaged say 200 pages, then Watson should contain closer to something like 8 million pages of content. Content sources include unstructured text, semistructured text, and triplestores. Watson's software configuration consists basically of SUSE Linux Enterprise Server 11, Apache Hadoop, and UIMA-AS. SUSE Linux Enterprise Server 11 is a Linux distribution supplied by Novell and targeted at the business market. Apache Hadoop is a software framework that supports data-intensive distributed applications, including an open source version of MapReduce, enabling applications to work with thousands of nodes and petabytes of data. UIMA-AS (Unstructured Information Management Architecture - Asynchronous Scaleout) is an add-on scaleout framework supporting flexible scaleout with Java Message Service. Hadoop facilitates Watson's massively parallel probabilistic evidence-based architecture by distributing it over the thousands of processor cores. The DeepQA architecture has three layers: natural language processing (NLP), knowledge representation and reasoning (KRR), and machine learning (ML). The IBM Watson team used every trick in the book for DeepQA; apparently they couldn't decide which natural language processing techniques to use, so just used them all. Each one of Watson's 2,880 processor cores can be used like an individual computer, enabling Watson to run hundreds if not thousands of processes simultaneously. For instance, each processor thread could host a separate search. All the hundreds of components in DeepQA are implemented as UIMA annotators. The internal communications among processes is handled in UIMA by OpenJMS, an open source version of Java Message Service. The IBM Content Analytics product LanguageWare is used in Watson for natural language processing. According to David Ferrucci, Watson contains "about a million lines of new code".
Processing steps: (1) Question Analysis -> (2) Query decomposition -> (3) Hypothesis generation -> (4) Soft filtering -> (5) Evidence scoring -> (6) Synthesis -> (7) Merging and ranking -> (8) Answer and confidence
(1) Question Analysis:
In the UIMA architecture, the collection processing engine consists of the collection reader, analysis engine and common analysis structure. Collection level processing contains the entity registrar with event, entity and relation coreferencers, ultimately creating a semantic search index, the feature structure or common analysis structure store in XML and extracted knowledge database. The UIMA analysis engine consists of programs that analyze documents and infer information about them. The extracted knowledgebase resides in an IBM DB2 database. Data in the common analysis structure can only be retrieved using indexes. Indexes are analogous to the indexes that are specified on tables of a database, and are used to retrieve instances of type and subtypes. In addition to a base common analysis structure index, there are additional indexes for annotated views, created by natural language processing techniques such as tokenization and named entity recognition. In the Jeopardy game show, contestants are presented with clues in the form of answers, and must phrase their responses in question form. Watson receives questions or "clues" textually and then breaks them down into subclues. Question clues often consist of relations, such as syntactic subject-verb-object predicates and semantic relationships between subclues such as entities. A semantic search is where the intent of the query is specified using one or more entity or relation specifiers. Triplestore queries in the primary search are based on named entities in the clue. Watson can use detected relations to query a triplestore and directly generate candidate answers. Triplestore sources in Watson include dbpedia.org, wordnet.princeton.edu and YAGO (which itself is a combination of dbpedia, WordNet and geonames.org). Triplestore and reverse dictionary lookup can produce candidate answers directly as search results. Reverse dictionary lookup is where you look up a word by its meaning, rather than vice versa.
(2) Query decomposition:
DeepQA supports nested decomposition, or query decomposition, a kind of stochastic programming, where questions are broken down into more easily answered subclues. Nesting means that an inner subclue is nested in the outer clue, so the subclue can be replaced with an answer to form a new question that can be answered more easily.
(3) Hypothesis generation:
In constructing hypotheses, Watson creates candidate answers and intermediate hypotheses, and then checks hypotheses against WordNet for "evidence", dealing with hundreds of thousands of evidence pairs. Watson uses the offline version of WordNet, a lexical database that groups English words into synsets, or sets of synonyms, that provide definitions and record semantic relationships. Chris Welty, David Gondek, JW (Bill) Murdock and Chang Wang are the IBM Watson Algorithms Team machine learning experts. Wang in particular is an expert in "Manifold Alignment". In engineering, manifolds typically bring one into many or many into one. According to Wang, "Manifold alignment builds connections between two or more disparate data sets by aligning their underlying manifolds and provides knowledge transfer across the data sets". Watson uses logical form alignment to score on grammatical relationships, deep semantic relationships or both. Inverse document frequency is used as a statistical measure of word importance. And, the Smith-Waterman algorithm compares sequencing between questions and candidate answers for evidence.
(4) Soft filtering:
Soft filtering may consist of a lightweight scorer computing the likelihood of a candidate answer simply being an instance of the lexical answer type, or LAT. A LAT is a word in the clue that categorizes the type of answer required, independent of assigned semantics. Watson uses lexical answer type for deferred type evaluation. Interestingly, Ferrucci's name is on an IBM patent (System And Method For Providing Question And Answers With Deferred Type Evaluation), which includes lexical answer type. The patent method includes processing a query including waiting until a descriptor (Type) is determined and a candidate answer is provided. Then, a search is conducted to look for evidence that the candidate answer has the required lexical answer type. Or, it may attempt to match the LAT to a known ontological type (OT). The evidence from the different ways to determine that the candidate answer has the expected lexical answer type (LAT) is combined and one or more answers are delivered to a user. The IBM Watson team found 2500 distinct and explicit LATs in the 20,000 Jeopardy Challenge question sample; the most frequent 200 explicit LATs covered less than 50 percent of those.
(5) Evidence scoring:
There are two layers of machine learning on top of the many NLP processes. Learners located at the bottom layer are called base learners, and their predictions are combined by metalearners in the upper layer. On top of the first learning layer is a reasoning layer, which includes temporal reasoning, statistical paraphrasing, and geospatial reasoning, in order to gather and weigh evidence over both the unstructured and structured content to determine an answer with the most confidence. Watson uses about 100 algorithms for rating each of up to some 10,000 sets of possible answers for every question. Trained classifiers score each of the hundreds of NLP processes.
One type of scorer uses knowledge in triplestores for simple reasoning, such as subsumption and disjointness in type taxonomies, geospatial and temporal reasoning. Temporal reasoning is used in Watson to detect inconsistencies between dates in the clue and those associated with a candidate answer. Paraphrasing is the expression of the same message in different words. Statistical paraphrasing is the use of a statistical sentence generation technique that recombines words probabilistically to create new sentences. Geospatial reasoning is used in Watson to detect the presence or absence of spatial relations, such as directionality, borders and containment between geoentities.
Each subclue of every nested decomposable question is processed by a dedicated QA subsystem, in a parallel process. DeepQA then synthesizes final answers using a custom answer combination component. This custom synthesis component allows special synthesis algorithms to be easily plugged into the common framework.
Aditya Kalyanpur, Siddarth Patwardhan and James Fan are the IBM Watson Algorithms Team reasoning experts. In their 2010 paper, titled "PRISMATIC: Inducing Knowledge from a Large Scale Lexicalized Relation Resource", Kalyanpur, Fan and Ferrucci present a system for the statistical aggregation of syntactic frames. A syntactic frame is the position in which a word occurs relative to other classes of words, such as subject, verb, and object. In contrast, a semantic frame can be thought of as a concept with a script used to describe an object, state or event.
(7) Merging and ranking:
Watson uses hierarchical machine learning, a learning methodology inspired by human intelligence, to combine and weigh evidence in order to compute the confidence score, and through training it learns to be predictive. Watson merges answer scores prior to ranking and probabilistic confidence estimation, using a variety of matching, normalization, and coreference resolution algorithms. In this second level of machine learning, metalearner classification systems take classifiers and turn them into more powerful learners, using multiple trained models. Final ranking and merging evaluates hundreds of hypotheses based on hundreds of thousands of scores to identify the best one based on the likelihood it is correct.
(8) Answer and confidence:
After being trained on more or less the entire history of the Jeopardy game, the second level of machine learning kicks in to rank the merged scores using one or more metalearners that have learned to evaluate the results of the first level classifiers. The metalearner combines these predictions by multiplying the probabilities by weights assigned to each base learner and taking the average, and learning how to stack and combine the scores. The ultimate answer results from this statistical confidence.
So, how many PlayStations (PS3) would it take to make an IBM Watson? By my calculation, 320. AFRL Condor Cluster took about 2,000 PS3 to make and does some 500 teraFLOPS. IBM Watson does 80 teraFLOPS. [500/80=6.25 & 2000/6.25=320] The cost of 320 PlayStations would be about $128,000, or half the retail price for one IBM Power 750 32 core cluster at around $350,000. (In comparison, as of 2007 PCWorld put IBM's Blue Gene/P system cost at $1.3M per rack, and the Blue Gene/L at $800K.) Deep Blue was a $100 million project. I'm estimating the cost of IBM Watson at up to $50 million, including at least $18 million labor and potentially up to $31.5 million in material costs. It should be noted that "Jeopardy! And IBM Announce Charities To Benefit From Watson Competition".
= = =
Appendix 1: Chronological bibliography of David Angelo Ferrucci (David A. Ferrucci, David Ferrucci, D.A. Ferrucci, D. Ferrucci):
Fan J, Ferrucci D, Gondek D, Kalyanpur A. PRISMATIC: Inducing Knowledge from a Large Scale Lexicalized Relation Resource. In: First International Workshop on Formalisms and Methodology for Learning by Reading (FAM-LbR).; 2010:122.
Ferrucci D. Build Watson: an overview of DeepQA for the Jeopardy! challenge. In: Proceedings of the 19th international conference on Parallel architectures and compilation techniques.; 2010:1-2.
Ferrucci D, Brown E, Chu-Carroll J, et al., others. Building Watson: An Overview of the DeepQA Project. AI Magazine. 2010;31(3):59.
Ferrucci D, Lally A, Verspoor K, Nyberg A. Unstructured Information Management Architecture (UIMA) Version 1.0. Oasis Standard. 2009.
Ferrucci D, Lally A. Building an example application with the unstructured information management architecture. IBM Systems Journal. 2010;43(3):455-475.
Drissi Y, Boguraev B, Ferrucci D, Keyser P, Levas A. A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture. In: LREC.; 2008.
Chu-Carroll J, Prager J, Czuba K, Ferrucci D, Duboue P. Semantic search via XML fragments: a high-precision approach to IR. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval.; 2006:445-452.
Ferrucci D, Grossman RL, Levas A. PMML and UIMA based frameworks for deploying analytic applications and services. In: Proceedings of the 4th international workshop on Data mining standards, services and platforms.; 2006:14-26.
Ferrucci D, Lally A, Gruhl D, et al., others. Towards an interoperability standard for text and multi-modal analytics. IBM Res. Rep. 2006.
Ferrucci D, Murdock JW, Welty C. Overview of Component Services for Knowledge Integration in UIMA (aka SUKI). IBM Research Report RC24074. 2006.
Ferrucci DA. Putting the Semantics in the Semantic Web: An overview of UIMA and its role in Accelerating the Semantic Revolution. In: ; 2006.
Fikes R, Ferrucci D, Thurman D. Knowledge associates for novel intelligence (kani). In: 2005 International Conference on Intelligence Analysis.; 2005.
Levas A, Brown E, Murdock JW, Ferrucci D. The Semantic Analysis Workbench (SAW): Towards a framework for knowledge gathering and synthesis. In: Proc. Int’l Conf. in Intelligence Analysis.; 2005.
Mcguinness DL, Pinheiro P, William SJ, Ferrucci MD. Exposing Extracted Knowledge Supporting Answers. Stanford Knowledge Systems Laboratory Technical 12. 2005.
Murdock JW, Silva PPD, Ferrucci D, Welty C, Mcguinness D. Encoding Extraction as Inferences. In: Stanford University. AAAI Press; 2005:92-97.
Ferrucci D, Lally A. UIMA: an architectural approach to unstructured information processing in the corporate research environment. Natural Language Engineering. 2004;10(3-4):327-348.
Nyberg E, Burger JD, Mardis S, Ferrucci D. Software Architectures for Advanced Question Answering. New Directions in Question Answering. 2004.
Chu-Carroll J, Ferrucci D, Prager J, Welty C. Hybridization in question answering systems. In: Working Notes of the AAAI Spring Symposium on New Directions in Question Answering.; 2003:116-121.
Nyberg E, Burger JD, Mardis S, Ferrucci DA. Software Architectures for Advanced QA. In: New Directions in Question Answering.; 2004:19-30.
Chu-Carroll J, Prager J, Welty C, et al. A multi-strategy and multi-source approach to question answering. NIST SPECIAL PUBLICATION SP. 2003:281-288.
Bringsjord S, Ferrucci D. Artificial Intelligence and Literary Creativity: Inside the Mind of Brutus, A Storytelling Machine. Lawrence Erlbaum; 1999.
Welty CA, Ferrucci DA. A formal ontology for re-use of software architecture documents. In: Automated Software Engineering, 1999. 14th IEEE International Conference on.; 2002:259-262.
Welty CA, Ferrucci DA. Instances and classes in software engineering. intelligence. 1999;10(2):24-28.
= = =
[APPLICATION] Method For Processing Natural Language Questions And Apparatus Thereof
US Pat. 12765990 - Filed Apr 23, 2010 - INTERNATIONAL BUSINESS MACHINES CORPORATION
[APPLICATION] System And Method For Providing Question And Answers With Deferred Type Evaluation
US Pat. 12126642 - Filed May 23, 2008 - INTERNATIONAL BUSINESS MACHINES CORPORATION
[APPLICATION] System and method for providing answers to questions
US Pat. 12152411 - Filed May 14, 2008 - International business machines corporation
[APPLICATION] Method and system for characterizing unknown annotator and its type system with respect to reference annotation types and associated reference taxonomy nodes
US Pat. 11620189 - Filed Jan 5, 2007
Method and system for characterizing unknown annotator and its type system with respect to reference annotation types and associated reference taxonomy nodes
US Pat. 7757163 - Filed Jan 5, 2007 - International Business Machines Corporation.
[APPLICATION] Method And Apparatus For Managing Instant Messaging
US Pat. 11459694 - Filed Jul 25, 2006
[APPLICATION] Autonomous system and method for creating readable scripts for concatenative text-to-speech synthesis (TTS) corpora
US Pat. 11332292 - Filed Jan 17, 2006 - International Business Machines Corporation
Question answering system, data search method, and computer program
US Pat. 7844598 - Filed Sep 22, 2005 - Fuji Xerox Co., Ltd.
[APPLICATION] System, method and computer program product for performing unstructured information management and automatic text analysis, and including a document common analysis system
US Pat. 10449264 - Filed May 30, 2003 - International Business Machines Corporation
[APPLICATION] System, method and computer program product for performing unstructured information management and automatic text analysis, including an annotation inverted file system facilitating indexing and searching
US Pat. 10449398 - Filed May 30, 2003 - International Business Machines Corporation
[APPLICATION] System, method and computer program product for performing unstructured information management and automatic text analysis, and providing multiple document views derived from different document tokenizations
US Pat. 10449409 - Filed May 30, 2003 - International Business Machines Corporation
System, method and computer program product for performing unstructured information management and automatic text analysis, and providing multiple document views derived from different document tokenizations
US Pat. 7139752 - Filed May 30, 2003 - International Business Machines Corporation
[APPLICATION] System, Method and Computer Program Product for Performing Unstructured Information Management and Automatic Text Analysis
US Pat. 10448859 - Filed May 30, 2003 - International Business Machines Corporation
Method and system for loose coupling of document and domain knowledge in interactive document configuration
US Pat. 7131057 - Filed Feb 4, 2000 - International Business Machines Corporation
Method and system for document component importation and reconciliation
US Pat. 7178105 - Filed Feb 4, 2000 - International Business Machines Corporation
Method and system for automatic computation creativity and specifically for story generation
US Pat. 7333967 - Filed Dec 23, 1999 - International Business Machines Corporation