How to answer a question: a simple system

Reference: How to answer a question: a simple system

My Master's thesis^1 was very related to this topic, so I thought I would share a little anecdote.

Michael Nielsen:

As I describe in detail below, their approach was to take the question asked, to rewrite it in the form of a search engine query, or perhaps several queries, and then extract the answer by analysing the Google results for those queries.

While I was researching my thesis, I came across a paper by Sergey Brin (cofounder of Google) which described a system called DIPRE which tried to do a similar thing using the Google index, though before Google was Google. My favorite quote from the paper was "the Google search engine and other research projects". Brin has been aware of the power of huge sets of redundant documents for a long time.

The system Nielsen describes is interesting and worth a look. I will just nitpick a little.

  1. Why do so much query rewriting? Google does a lot of query augmentation itself now. Also, a high number of documents means it is very likely to find the question exactly as it was posed. Try asking Google a question and see if you can find the answer on the results page.
  2. The system does not actually get rid of domain knowledge: it replaces part of the algorithm with a Google search, but there is a lot of domain knowledge of the English language used to extract the data from the text. An implementation of the system would use a simple statistical model of answers to find the text to extract.
  3. The system of weighting queries is very hard to justify mathematically. Much better would be a probabilistic system, such as Naive Bayes. Naive Bayes is very simple and, modulo a naive assumption, is mathematically correct.
  4. Be careful with questions like "Who shot JFK?" It is difficult for humans to answer, let alone computers.

The introduction to my thesis is a pretty good introduction to the
techniques of information extraction in general.