Monday, April 26, 2010

Downloading wikipedia

So I'm headed to Italy for vacation. Internet access is not assured, but I would like to have access to wikipedia articles on things like "Michelangelo's David" and "The Roman Empire." What to do?

It turns out that wikipedia makes all of its text not just available, but easy to download. Getting a browser for that text is a little harder, but there are some nice out-of-the-box wiki browsing apps for Windows.

Here are the parts you'd need to download and browse the full text (no pictures, unfortunately) of English wikipedia.

1. Download the compressed 5.7G wikipedia archive here: http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

2. Get the wikiTaxi application here: http://www.wikitaxi.org/delphi/doku.php/products/wikitaxi/index

I haven't actually tried this since wikiTaxi is a windows app and I'm running linux, but I've seen several favorable reviews.

Alternatively, you could try Pocket Wikipedia, a much smaller text-and-pictures version that contains many of the most popular wikipedia pages.: http://www.free-soft.ro/pocket-wikipedia/pocket-wikipedia.html.

More info on downloading from wikipedia is available here: http://en.wikipedia.org/wiki/Wikipedia_database. This page is useful, but pretty technical.

Monday, April 19, 2010

Active learning

I've been cranking out a term paper on active learning in my natural language processing class. It's worth a mention here as well. Here's an excerpt.

Active learning is an approach to machine learning and artificial intelligence in which the computer guides selection of training examples. Instead of being a passive recipient of information provided by a programmer, the computer sends queries for the kinds of information that will be most useful and informative for the task at hand. In other words, asking questions for better information is part of the computer algorithm!


"Active learning" is also a hot topic in education research. Many of the ideas carry surprisingly well (What types of questions are the most effective? How should demands on the teacher and learner's time be balanced? What sorts of learning curves arise in practice?). Of course, in education research, the learners are young human students, not computers.

Here are two good reviews on active learning in machine learning (aka AI), and NLP:

(The figure above is from Settles' survey.)

Saturday, April 17, 2010

Computation in political science

I was talking to Mike B the other day, wondering how to best proselytize social scientists to the wonders of computational methods. We ended up brainstorming a list of topics for an Introductory Computation 501 class for grad students.

Here's what I remember from the list. What should we add?
  • Basic architecture of computers and the Internet
  • Enough python to understand pseudocode (if, else, for, while...)
  • Big O notation
  • Regular expressions, probably in tandem with HTML
  • Spiders and crawlers
  • NLP text classification (probably Naive Bayes, latent semantic analysis, and SVM)
  • Database architecture (keys, tables, queries, etc...)

Thursday, April 15, 2010

On insomnia and separating hyperplane classifiers...


The other day I woke up at three in the morning. Couldn't sleep. No good reason.

So I went to the living room and tried to work through a statistical problem that had been bugging me: how are support vector machines different from logistic regression? I know -- I'm sure it's been on your mind too.

Anyway, after much math and Googling, I discovered this paper, which clarified the whole thing. SVM and logistic regression are basically the same, except that they optimize for slightly different parameters. SVMs typically do better on smaller training sets, but logistic regression has better asymptotic properties.

And now you know.

PS - This is all very handy if you are trying to train computers to read blogs for you.