Interest in question answering systems has been revived since the birth of the World Wide Web in 1989 (Cailliau, 2000) and the launch of Apple’s Siri in 2011. Using the web as a knowledgebase for such a system allows answers to be retrieved from a potentially limitless number of sources. However, the organisation and scale of the web makes this an extremely difficult task, which is why comparing existing search engine algorithms’ suitability for this purpose is a worthwhile area of research. Although question answering systems have been around for some time, utilising the web as the knowledgebase is a relatively new concept, of which there are many considerations.
The aim of this post is to establish the most effective way for an Internet based question answering system to use the World Wide Web as its knowledgebase. Three major factors are investigated and are: the search algorithm used to obtain candidate documents for answer extraction, the number of candidate documents used and the importance of classifying answer types.
Some key conclusions from my research include 1) Increasing the number of candidate documents undoubtedly improves the accuracy of potential answers and 2) Google’s PageRank has a positive effect on obtaining candidate documents for a web based question answering system (compared to other algorithms)
In addition to this, further primary research was conduced on the Google Application Programming Interface (API) due to reliability issues which led to rouge results and anomalies appearing throughout the data acquisition phase. Although these issues have been resolved, the problems with the Google API are fully documented in Chapter 5, section 5.3.
Using the World Wide Web as the knowledgebase for a question answering system extends beyond information retrieval. It results in the entire world participating in creating a repository of information ready to be interfaced with question answering technology. In a webcast by Sun Microsystems and eBay, it was said that the web is “. switching from a market where the value is in access, to a market where the value is in participation.” (Schwartz, 2005). Instead of the traditional approach of obtaining pre-keyed answers from one location, the public nature of the web results in potential answers coming from any individual or business that has published information online. Although concerns for accuracy from unverified sources (factoids) is an issue (Economist Article, 2004), the value of incorporating user participation greatly extends the scope of question answering systems.
"The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things before they are suffocated. Too many facts are as bad as none at all.“ (W.H. Auden, 2000)
Google’s search engine is one of the largest and widely used resources on the web. Its index has grown from just fifty-five million pages in 1999 (SEO Journal 2004) to over one trillion in 2008; over one-hundred fold in just six years. The evident popularity of the Internet as a medium for research makes it an extremely attractive resource for seeking quick answers to simple, fact based questions such as ‘What is the tallest mountain in Scotland?’. For many users "inexperienced with the art of web research” (Brin & Page, 1998), getting answers to questions can sometimes be very frustrating. Instead of receiving a direct answer to a question, a list of websites is returned from which the answer must be sourced manually.
Despite having indexed a large proportion of the World Wide Web, the major search providers have not yet found solutions to obtaining answers to questions posed in natural language form. Furthermore, Google was described as the “default command-line interface for the Web” (The Linux Journal), an accurate analogy as it is an extremely powerful resource, yet requires some knowledge to utilise it effectively – especially when seeking specific answers to questions.
In a recent experiment (Toms et. al. 2001), ninety of two hundred participants (45%) asked to answer a question using the Google search engine entered it in natural language form, while the remainder entered only keywords to locate their desired page. The former resulted in the query string having more irrelevant stop words in it; such as ‘it’, ‘the’ and ‘how’. In addition to this, when questions are posed in natural language form, stems are added to words such as ‘ing’, ’s’ and ‘ed’. These are both factors which undoubtedly affect the quality of results returned to the end user. This is just an example of such a study, yet gives a valuable insight into the scope for research in this area, and how traditional question answering systems could be adopted and modified to utilize the World Wide Web as a knowledgebase. This document will investigate areas of previous research and comment on their outcomes.
1.3 Research aims and objectives
The overall aim of the study is to investigate the factors that influence the accuracy of answers for question answering systems using the World Wide Web as a knowledgebase.
In order to accomplish this aim, the following research objectives were formulated:
1. To review and analyse question answering systems and components. This is answered through the review of literature.
2. To establish the most effective way to present a question to a document retriever to obtain optimum results. This is answered through the review of literature.
3. To determine if the search algorithm used affects the quality of results retrieved. This is answered though primary research.
4. To determine whether increasing the number of candidate documents increases the likelihood of obtaining potential answers through repetition. This is answered though primary research.
5. To identify the extent of problems that exist with the Google API. This is answered though primary research.
6. To identify and recommend the ways in which future web question answering systems can be improved. This is addressed in the conclusion
1.4 Research methodology
Various methods of research were reviewed and considered for the undertaking of this post, and both secondary and primary research were deemed necessary in order to address all research objectives.
Secondary research entailed looking into how existing systems operate, their strengths and weaknesses, and why there is not currently a commercial-scale question answering system. It was important to understand each area search engine technology, and to determine what strategies are already in place with regards to question answering.
The primary research was conducted through the development of a question keyword analysis in order to gather quantifiable results about the optimum method of utilising the web as a knowledgebase for a question answering system.
1.5 Points to prove!
Some areas of interest in this document area:
1. Google’s search algorithm has a negative impact on obtaining accurate candidate documents for a web based question answering system.
2. Increasing the number of candidate documents increases the likelihood of obtaining potential answers through pattern repetition.
3. The Google application programming interface is not currently suitable for commercial use.
4. Stemming is the optimum way to prepare a question for presentation to the document retrieval component of a question answering system.
2.1 Introduction
This chapter contains the literature analysis and information on key technologies that are relevant to the project. It aims to critically analyse various question answering techniques, and will ultimately result in areas of further research becoming clearer. Moreover, it should help define the objectives of the project from the perspective of the literature. Literature has been sourced from journal archives, books and the Internet. Additional resources have also been provided by project supervisor Katrin Hartmann. In addition, experimental requirements will be established by analysing data obtained from other researchers resulting in the development of a prototype keyword analysis system to aid in addressing the primary research topics.
2.2 Question answering background
Obtaining the answer to a question using a search engine can sometimes be very frustrating. Instead of getting a direct response to a question, a list of websites is provided for the user to view and locate relevant information manually. Question answering is a task that “aims beyond document retrieval and towards natural language understanding.” (Aunimo & Kuuskoski, 2001). Systems which use the World Wide Web as a knowledgebase aim to parse documents retrieved by a search engine for scrapes of relevant information, and identify and return correct answers directly to the user; effectively removing an entire step from the search process.
A pre-requisite to the success of a question answering system is a solid understanding of the English language by the researcher. It was said that “.ambiguity is an essential part of language; and it is often an obstacle ignored” (Quiroga-Clare, 2002). It is clear that it would be a mistake not to take core values of the English language such as this into consideration when creating this type of system, however many tools and databases exist to assist developers with refining questions and establishing the answer type expected.
Examples of resources available to developers of question answering systems include algorithms to break down words to their simplest form (Porter, M.F, 2002), as well as thesaurus-style databases of words with their alternatives already exist (Wordnet, 2005). This work analyses existing techniques such as these in addition to conducting primary research in the field to attempt to improve the quality of answers obtained for fact based questions using the World Wide Web as a knowledgebase.
A notable resource ranked highly among developers is the Text REtrieval Conference (TREC), which was set up as a forum to support research in all areas of Information Retrieval. TREC provides the infrastructure and funding necessary for the development of “large-scale evaluation of text retrieval methodologies” (NIST, 2004) and in addition to its core purposes, acts as an open forum for the international community of information retrieval academia. TREC is extremely relevant to question answering developers and encourages development in the field by settings challenges and tasks. Organisers of the conference provide test data for members to analyse and test their systems (Wikipaedia, 2006). Question answering tests will be in the form of questions, however other areas of Information Retrieval can be tested using topics, or other features (TIPSTER, 2002). A scoring system is implemented to enable participants systems to be evaluated fairly. After evaluation of the results, a workshop provides a place for participants to collect together thoughts and ideas and present current and future research work. The practical aspect of this work makes use of the TREC sample data by running many of the stemmed questions through the keyword analysis system developed for this project.
2.3 The history of question answering
Variants of question answering systems date back to the early days of computing (Witten, 1994), and despite many advances in the field of online information retrieval, major search engine companies such as Google and Microsoft (MSN) have yet to unveil a publicly available, automated question answering system.
The earliest evidence of such research can be dated back as far as the 1950s, when Turing came up with the theory of whether or not machines were capable of rational thought (Turing, A; 1957). He proposed a task he called ‘The Turing Test’ which originated from a previous task called ‘The Imitation Game’ in which a user must identify the different between a man and a woman via instant messaging style interface The Turing Test, however, involved a human candidate communicating via an instant messaging style interface with another human and a machine. (Copeland, 2004) The test was said to be passed if the human was able to identify which was the human candidate and which was the machine. This system can be related to question answering as the computer relied on identifying patterns in the user’s dialogue and tried to match it to pre-programmed data.
As interest in mainstream computing heightened in the mid-60s, a system called Baseball was implemented on top of a database (Green, BF; 1963). The system was able to answer questions about baseball scores recorded in the USA and used parsing to identify the teams and statistics. This system was able to produce more accurate results than Turing’s system as it relied on NLP rather than pattern matching. It was also able to handle more complex queries than involved locating multiple answers from different tables within the database. From the 1970’s onwards, attempts were made to create systems capable of understanding and learning language in the same way a human being does. One system, Margie, (Schank et al.), could read a document and answer simple questions on it. It worked by parsing the text and organised it in the same way the human brain organises data. This was the first attempt to emulate what a human does when reading a document.
The first implementation of such a utility in mainstream computing emerged in Expert systems developed the 1970’s such as Lunar and Baseball (Bert, F et al 1963). The 1990’s saw the first online Encyclopedia named Murax, and the year 2000 saw the launch of BrainBoost.com, the first fully fledged question answering system to utilise the World Wide Web as its knowledgebase.
2.4 Components of a typical web based question answering system
The majority of web based question answering systems can be broken down into four components; question analysis, document retrieval, passage retrieval and answer extraction (Hirschman and Gaizauskas, 2001). These techniques are illustrated below and analysed in depth in the following paragraphs:
Figure 1: Typical Question Answering System Components (Aunimo & Kuuskoski , 2002)
The first component, the question analyser, identifies the type of response expected from the question. For example, “Where is Kilmarnock?” would result in a location result. Questions can be classed into categories then further refined to determine the answer type expected. Detailed information on question types can be found in the next section (Section 2.5).Once the question has been parsed by either stemming or query expansion, defined later in this section, the document retrievalcomponent prepares a list of candidate documents. Subsequently, the passage retrieval component selects passages of text which may be relevant and indexes them according to relevancy. The final component, the answer extractor, searches and ranks the passages in more detail and produce a list of candidate results to the question. The scope of this project extends to altering the traditional four-step model and introducing another stage to further refine the document retrieval process.
One factor which affects the quality of results retrieved by a question answering system is the keyword preparation technique. Stemming (Porter, M.F., 1980) is a process in which words are broken down into their simplest form with all suffixes stripped. The Porter stemming algorithm removes the common endings from words and cross checks the word against a database of dictionary words for verification. Conversely, query expansion, as the name would indicate, does the opposite. Its aim is to increase the accuracy of results by expanding the query using words or phrases with a similar meaning.
In a series of recent experiments, these two types of technique were compared (Bilotti and Katz 2004). The first these tests indexed variations of words during the ‘document retrieval’ process using query expansion, and the other broke down words into their simplest forms at retrieval time using stemming. The results of this experiment were contrary to the researcher’s initial assumptions, as it transpires that Porter’s stemming algorithm positively affects the accuracy of results whilst generating a full spectrum of words at indexing time by query expansion decreased the accuracy of the documents returned. The results of Bilotti and Katz’s experiment resulted in ‘stemming’ being utilised as part of the keyword analyser component of the question answering system developed for this project. As Chapter 5 results indicate, phrasing questions as one would expect them to be answered and truncating the stems dramatically improves the quality of candidate documents returned by the search engine.
Many systems, however, utilise both query expansion and stemming in two separate queries to maintain the maximum possible amount of candidate documents. One such system is Brainboost (Brainboost.com 2006), does this method with some success. Although results appear relatively accurate, the system is let down by extremely slow execution times. Speed is a major consideration when designing a question answering system, and research indicates that users will only wait a maximum of ten seconds for a page to load (Webmasterworld Article 2004). This has resulted only one technique, stemming, will be used for the practical implementation of the prototype system.
2.5 Determining question types
It is logical to assume that if the type of answer expected is determined, it will be easier to search for that answer within a set of documents. Unfortunately, however, knowing the type of answer expected is not enough to assist in locating a suitable answer (Moldovian et. al. 2000), which is why attempts must me made to further refine questions into particular groups (Figure 2.5a).
Figure 2: Types of questions and corresponding answer types (Data from Lampert 2004)
The above table provides a simplified graphical analysis of how answer types can be determined from the input question (Lampert 2004). For categories that are more structured, such as people’s names and place names, it is possible to further refine the list of scrapes for analysis using systems such as Wordnet, a lexical database of word associations (Section 2.7). For instance, if the question ‘Who was the first UK prime minister?’ is posed to a question answering system, the answer type according to Lampert, is ‘Person / Organisation’. Now the type of answer expected has been established, a database of names can be cross checked, and dictionary words and items which have been established as non-matching can be removed from the candidate answer list, further narrowing down the list of potential answers to the question. (Lampert 2003).
It has also established that the focus of the question, “a word or number that indicates what is being asked” (Moldovian et. Al., 2002) is another important factor in reaching the correct answer. For example, the question “Who was the first prime minister of Great Britain” has the focus “first prime minister”. If both the question type and focus are both known, the system can more easily reach a conclusion.
2.5 Analysis of existing systems
As discussed, many systems have been developed to address the problem of question answering for the web, many of which are publicly available via the Internet. The best known public question answering system is Ask Jeeves (Roussinov, Chau & Filatova, . Contrary to public belief, this is not a fully automated question answering system and instead attempts to improve the quality of search results obtained for questions formed in “natural language style” (AskJeeves, Last Accessed 2006). The system no longer operates the same way it did when it was launched, and has been recently rebranded as Ask.com in an attempt to follow the success of the Google empire. Although it was recently said that “.AskJeeves is still a place for questions and answers.” (Information Week Article, 2005), it appears to have lost many of the qualities for which it was originally admired. AskJeeves, therefore, can not be classed as a full question answering system as it still returns a set of documents for users to peruse, instead of direct answers, or even fragments of potential answers.
An example of an accurate and well-structured accurate system for fact based questions START (START 2006), which claims to be the first system of its kind. It takes the same approach as most question answering systems by utilising the World Wide Web as its knowledgebase, and attempts to analyse passages retrieved from candidate documents and parses them into meaningful sentences. It was developed at the Massachusetts Institute of Technology, and claims to “supply users with "just the right information,” instead of “merely providing a list of hits” (MIT, 2001). A technique called “natural language annotation” is integral to the success of START. This technique utilises ‘natural language’ phrases as descriptions of content that are associated with “information segments”. An information segment is retrieved when its phrase matches an input question. This method allows the system to deal with a wide range of question types, in addition to being able to display media such as graphics and sounds. The former results in the system being extremely accurate for fact based questions, however ineffective in answering a broader range of questions.
Mulder, another legacy system, is believed to be the first question answering system made publicly available on the World Wide Web. The system works similarly to the model described previously. The user enters a question on a web based form. The system then constructs a table of the structure and classification of the question. Next, the query formulator prepares a list of search engine queries which are issued to the likes of Google. The answer extractor then obtains fragments of these documents, which are then scored and ranked. The list of fragments obtained are then displayed to the user. In 2000, performed A recent experiment showed that each component of the Mulder system contributed equally to the success of the system, and that user effort is reduced by a factor of 6.6 compared to Google (Kwok and Etzioni, 2000). Additionally, the system statistically performed a massive three times better than AskJeeves according to this document.
A more recent question answering system which uses the web as its knowledgebase is AnswerBus, developed in 2001 (Zheng et al. 2003) It is very similar in terms of structure to START, however incorporates a multi-lingual element into the equation. Zheng states that the “system is only designed for short, fact based questions” (Zheng 2001), and appears to enjoy relatively high accuracy rates. It answered over two thirds of a sample of TREC-8 questions accurately (Zheng et al. 2003) using extremely low resources. The primary difference between Answerbus and other question answering systems is in its document retrieval component. Instead of just interfacing with one search engine, it uses multiple sources to obtain its candidate documents. In addition, Zheng’s algorithm also takes into account the search engine’s ranking of the page which helps with several of the hypothesis in this project. It can be concluded that Answerbus is one of the most accurate question answering systems that uses only the web as its knowledgebase.