• This page includes pretty old works. Just refer it to get some ideas :)

Old works

I have been working on the Wikipedia database dump for several years. Belows are the list of previous works. Regretfully, Wikipedia lost their old database. So the source data is not available any more.

Have any questions?

Please read FAQ page which is the collection of actual troubleshooting procedures that I helped many people around the world. Still have a question? Send me an email: pilhokim AT gatech DOT edu

Download Wikipedia database

Since after my last update, I heard some bad news from Wikipedia that they lost all their old database dumps.

So I decided to update this page to work with more recent Wikipedia archive that are available in the download database. Again, I am working on the English Wikipedia only. For other language, use some creativity to modify works here for your interest.

As of now the complete and most recent one was 20110901 version.

Detail information to download Wikipedia database is available at Wikipedia database download. For efficient multiple downloading that supports reconnection, I once used VisualWGet.

Now Wiki team released the Python code to help splitted 7z type dumps. Yes, finally big files are chopped off. I mostly do not download entire data but at this time I let it to go with the complete one. It also embeds md5 sum checking codes inside. One thing to modify is in its code, you have to modify

#projects = [['enwiki', '20110405']]

to

#projects = [['enwiki', '20110901']]

Be aware that Wiki's downloading codes internally use wget and md5sum that you should have in your system or in PATH. Visit [1] to get all and set the path to binaries.

Downloading started on Sep. 23rd 3:25:00 PM. Let's see when it completes. It finished the first file downloading on 17:17:12 and died.

C:\Users\pilhokim\Documents\Research\wikipedia\downloads>python wikipediadownloader.py
SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\GnuWin32/etc/wgetrc --2011-09-23 15:25:08--  http://download.wikimedia.org/ptwiki/20110910/ptwiki-20110910-pages-meta-history.xml.7z
Resolving download.wikimedia.org... 208.80.152.185
Connecting to download.wikimedia.org|208.80.152.185|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://dumps.wikimedia.org/ptwiki/20110910/ptwiki-20110910-pages-meta-
history.xml.7z [following]
--2011-09-23 15:25:09--  http://dumps.wikimedia.org/ptwiki/20110910/ptwiki-20110910-pages-meta-history.xml.7z
Resolving dumps.wikimedia.org... 208.80.152.185
Reusing existing connection to download.wikimedia.org:80.
HTTP request sent, awaiting response... 200 OK
Length: 2331775864 (2.2G) [application/x-7z-compressed]
Saving to: `p/ptwiki/ptwiki-20110910-pages-meta-history.xml.7z'

100%[====================================>] 2,331,775,864  865K/s   in 1h 52m

2011-09-23 17:17:12 (339 KB/s) - `p/ptwiki/ptwiki-20110910-pages-meta-history.xml.7z' saved [2331775864/2331775864]

Traceback (most recent call last):
  File "wikipediadownloader.py", line 77, in <module>
    md51 = re.findall(r'(?P<md5>[a-f0-9]{32})\s+%s/%s' % (path, dumpfilename), raw)[0]
IndexError: list index out of range

So let's go with the original plan VisualWGet.

Downloads started at 09:38:43 PM on Sep. 23rd and finisehd at 12:04:20 AM on Sep. 24th.

http://download.wikimedia.org/enwiki/20110901/enwiki-20110901-flaggedrevs.sql.gz
http://download.wikimedia.org/enwiki/20110901/enwiki-20110901-flaggedpages.sql.gz
http://download.wikimedia.org/enwiki/20110901/enwiki-20110901-pages-logging.xml.gz
http://download.wikimedia.org/enwiki/20110901/enwiki-20110901-pages-meta-current.xml.bz2
http://download.wikimedia.org/enwiki/20110901/enwiki-20110901-abstract.xml
http://download.wikimedia.org/enwiki/20110901/enwiki-20110901-all-titles-in-ns0.gz
http://download.wikimedia.org/enwiki/20110901/enwiki-20110901-iwlinks.sql.gz
http://download.wikimedia.org/enwiki/20110901/enwiki-20110901-redirect.sql.gz
http://download.wikimedia.org/enwiki/20110901/enwiki-20110901-protected_titles.sql.gz
http://download.wikimedia.org/enwiki/20110901/enwiki-20110901-page_props.sql.gz
http://download.wikimedia.org/enwiki/20110901/enwiki-20110901-page_restrictions.sql.gz
http://download.wikimedia.org/enwiki/20110901/enwiki-20110901-page.sql.gz
http://download.wikimedia.org/enwiki/20110901/enwiki-20110901-category.sql.gz
http://download.wikimedia.org/enwiki/20110901/enwiki-20110901-user_groups.sql.gz
http://download.wikimedia.org/enwiki/20110901/enwiki-20110901-interwiki.sql.gz
http://download.wikimedia.org/enwiki/20110901/enwiki-20110901-langlinks.sql.gz
http://download.wikimedia.org/enwiki/20110901/enwiki-20110901-externallinks.sql.gz
http://download.wikimedia.org/enwiki/20110901/enwiki-20110901-templatelinks.sql.gz
http://download.wikimedia.org/enwiki/20110901/enwiki-20110901-imagelinks.sql.gz
http://download.wikimedia.org/enwiki/20110901/enwiki-20110901-categorylinks.sql.gz
http://download.wikimedia.org/enwiki/20110901/enwiki-20110901-pagelinks.sql.gz
http://download.wikimedia.org/enwiki/20110901/enwiki-20110901-oldimage.sql.gz
http://download.wikimedia.org/enwiki/20110901/enwiki-20110901-image.sql.gz
http://download.wikimedia.org/enwiki/20110901/enwiki-20110901-site_stats.sql.gz

Below items are not updated yet

Setup server

To install MediaWiki, you need the web server that supports PHP and MySQL like Apache+PHP+MySQL. If you are not good at configuring three of all to work together, then MS-Windows English users may visit EasyPHP. They provide the pre-configured APM solution for free. For Korean users, I recommend APMSETUP.

  • Note! When you install MySQL, I would stick to MySQL 5.0.22 since it passed this test but MySQL 5.1 busted out to death during the test. It did not handle the enormous Wikipedia text table properly.

Configure MySQL

Importing Wikipedia database requires the huge memory for fast insertion operation. Sample my.ini is below.

 [mysqld]
 
 basedir = "C:/APM_Setup/Server/MySQL5/"
 datadir = "C:/APM_Setup/Server/MySQLData"
 
 innodb_file_per_table
 
 connect_timeout = 10
 max_connections = 6000
 wait_timeout = 28800
 
 key_buffer = 1024M
 max_allowed_packet = 16M
 table_cache = 6000
 sort_buffer_size = 512M
 read_buffer_size = 64M
 myisam_sort_buffer_size = 512M
 thread_cache = 8
 query_cache_size= 16M
 bulk_insert_buffer_size=64M  
 
 thread_concurrency = 10
 max_sp_recursion_depth = 255
 tmp_table_size=512M
 max_heap_table_size=512M
  • Note! If other than English Wikipedia database, you should check the MySQL database setting to employ the character set of a specific language. You may need to set the default char code of the Wikipedia database to UTF-8 or else. This experiment however only shows the English version case and all MySQL configurations are set to latin1 (ISO 8859-1 West European).

Install MediaWiki

MediaWiki, though it sounds confusing with Wikipedia, is a PHP-based back-end engine for Wikipedia. MediaWiki has several incompatible changes in between major releases like different table definitions and relations. Hence when you download Wikipedia database, you should use the MediaWiki published at that time for safety.

  • Download MediaWiki engine from MediaWiki. Check the release date that matches with the Wikipedia database that you downloaded. Find and read some instructions on the MediaWiki installation. For this experiment, I used MediaWiki 1.15.1.
  • Unpack under the html root document directory. The root html directory like http or html or htdocs depends on your web server configuration.
  • Do not run config/index.php at this time. You will configure Wikipedia at the end of this process and scream Ta-da. If you run config at this step, then it will put initial setup data into the database which will conflict with Wikipedia data that we will dump from the Wikipedia database.
  • Copy tables.sql under wikipedia/maintenance/tables.sql to your mysql/bin directory or add MySQL bin path to the current path setting.
  • Run mysql client
  • Run following sql commands
    1. CREATE DATABASE enwiki_20080724; // you may change the database name as you wish
    2. USE enwiki_20080724;
    3. SOURCE tables.sql; // If you change the file name above, then use that.

Convert the Wikipedia database into the SQL dump

There is only one big compressed XML file to convert into the SQL dump. The XML file name could be variant but looks similar with enwiki-date-pages-articles.xml.bz2. There is a tool developed to convert this XML dump into the huge list of MySQL data file. Use Xml2sql to convert XML dump into MySQL local dump file. After unpack the package, run xml2sql-fe.exe, set output format as mysqlimport and apply to the big XML file (. You will get 3 files: page.txt, revision.txt and text.txt. This process would take about 1.5 ~ 2 hours.

  • Note! You do not need to unzip that big file. Xml2sql will handle the compressed file by itself. And even do not try to unzip it which will consume your hard disk over a hundred GB file. Use pipelined process instead. If you are using the console version Xml2SQL, then the command would be like bunzip2 -c pages-meta-current.xml.bz2 | xml2sql. Or GUI version Xml2SQL will handle the zipped file naturally.

Process to import local data dump files into the database

To enhance the speed to import very large data set, please read MySQL insert speed column. Myisamchk will only work for the MyISAM tables as its name represents.

  • Disable indexes: myisamchk --keys-used=0 -rq ..\data\enwiki_20080724\page
  • Change to MySQL shell
  • Use database: mysql> use enwiki_20080724;
  • Flush tables: mysql> FLUSH TABLES;
  • Change the MyISAM file size limit. Check to change the limitation. This problem happens when a single MyISAM data file size bigger than 4GB: mysql> ALTER TABLE page MAX_ROWS=1000000000;
  • Load local data file: mysql> LOAD DATA LOCAL INFILE 'page.txt' INTO TABLE page;
  • Reconstruct indexes: Read myisamchk and myisamchk memory usage to configure the parameters. Stop the mysql server before running: myisamchk --fast --force --update-state --key_buffer_size=1024M --sort_buffer_size=1024M --read_buffer_size=64M --write_buffer_size=64M -rq ..\data\enwiki_20080724\page

Repeat the above steps equally to revision.txt and text.txt.

Process to import other data SQL files

Except that big zipped XML file, you have tens of other files to fill up major portions of Wikipedia. Ironically, this process takes most time in importing Wikipedia database. This is because these other file sources are millions of INSERT data objects which means that you have to SOURCE each data file and millions of INSERT commands have to be executed. This will take a few days (or less than one day with super computer) with your personal computer. I endured such time last year. Just rand the process before I got sleep and the process was completed about the end of next day. This year, Wikipedia has grown so fast and I could not stand up with this slow process. So here is a magic bullet to make the process much faster.

The idea is converting SQL source files to the raw-data text file just like we did above for XML cases. In the above we prepared the raw text data using Xml2sql. Here I wrote a simple utility which I named Insert2Txt. For details please download File:Insert2txt.zip and take a look at Readme.txt file. This zipped file includes complete project and source files with a compiled binary for Windows XP and above.

The syntax to run Insert2Txt can not be more simple. Check below example and just run it. If you missed the 2nd argument, then Insert2Txt will dump the output to the screen.

  • Extract data sets from Wikipedia INSERT sources.

This whole process would take you less than 30 minutes with a decent computer or over 1 day with 386SX -- I guess.

 insert2txt enwiki-20080724-category.sql enwiki-20080724-category.txt
 insert2txt enwiki-20080724-categorylinks.sql enwiki-20080724-categorylinks.txt
 insert2txt enwiki-20080724-externallinks.sql enwiki-20080724-externallinks.txt
 insert2txt enwiki-20080724-image.sql enwiki-20080724-image.txt
 insert2txt enwiki-20080724-imagelinks.sql enwiki-20080724-imagelinks.txt
 insert2txt enwiki-20080724-interwiki.sql enwiki-20080724-interwiki.txt
 insert2txt enwiki-20080724-langlinks.sql enwiki-20080724-langlinks.txt
 insert2txt enwiki-20080724-logging.sql enwiki-20080724-logging.txt
 insert2txt enwiki-20080724-oldimage.sql enwiki-20080724-oldimage.txt
 insert2txt enwiki-20080724-pagelinks.sql enwiki-20080724-pagelinks.txt
 insert2txt enwiki-20080724-page_props.sql enwiki-20080724-page_props.txt
 insert2txt enwiki-20080724-page_restrictions.sql enwiki-20080724-page_restrictions.txt
 insert2txt enwiki-20080724-protected_titles.sql enwiki-20080724-protected_titles.txt
 insert2txt enwiki-20080724-redirect.sql enwiki-20080724-redirect.txt
 insert2txt enwiki-20080724-site_stats.sql enwiki-20080724-site_stats.txt
 insert2txt enwiki-20080724-templatelinks.sql enwiki-20080724-templatelinks.txt
 insert2txt enwiki-20080724-user_groups.sql enwiki-20080724-user_groups.txt
  • Import local files

The output format of Insert2Txt is exactly what MySQL can import as local infiles. The output file has "\r\n" at every line ends. Thus Linux or UNIX OS users may suffer some problem when importing data into the MySQL server (Thanks to Palmer for reports). Open the console (cmd.exe) at where you put output files and run mysql to use the enwiki database. Then run below codes to import local files. This will complete your Wikipedia database. Importing local files with unique keys may get significantly slow down since it can not use the sort function to define the key. Here is the workaround you may find very useful.

LOAD DATA LOCAL INFILE "enwiki-20080724-category.txt" INTO TABLE category FIELDS TERMINATED BY "," ENCLOSED BY "'" LINES TERMINATED BY "\r\n";
LOAD DATA LOCAL INFILE "enwiki-20080724-categorylinks.txt" INTO TABLE categorylinks FIELDS TERMINATED BY "," ENCLOSED BY "'" LINES TERMINATED BY "\r\n";
LOAD DATA LOCAL INFILE "enwiki-20080724-externallinks.txt" INTO TABLE externallinks FIELDS TERMINATED BY "," ENCLOSED BY "'" LINES TERMINATED BY "\r\n";
LOAD DATA LOCAL INFILE "enwiki-20080724-image.txt" INTO TABLE image FIELDS TERMINATED BY "," ENCLOSED BY "'"v;
LOAD DATA LOCAL INFILE "enwiki-20080724-imagelinks.txt" INTO TABLE imagelinks FIELDS TERMINATED BY "," ENCLOSED BY "'" LINES TERMINATED BY "\r\n";
LOAD DATA LOCAL INFILE "enwiki-20080724-interwiki.txt" INTO TABLE interwiki FIELDS TERMINATED BY "," ENCLOSED BY "'" LINES TERMINATED BY "\r\n";
LOAD DATA LOCAL INFILE "enwiki-20080724-langlinks.txt" INTO TABLE langlinks FIELDS TERMINATED BY "," ENCLOSED BY "'" LINES TERMINATED BY "\r\n";
LOAD DATA LOCAL INFILE "enwiki-20080724-logging.txt" INTO TABLE logging FIELDS TERMINATED BY "," ENCLOSED BY "'" LINES TERMINATED BY "\r\n";
LOAD DATA LOCAL INFILE "enwiki-20080724-oldimage.txt" INTO TABLE oldimage FIELDS TERMINATED BY "," ENCLOSED BY "'" LINES TERMINATED BY "\r\n";
LOAD DATA LOCAL INFILE "enwiki-20080724-pagelinks.txt" INTO TABLE pagelinks FIELDS TERMINATED BY "," ENCLOSED BY "'" LINES TERMINATED BY "\r\n";
LOAD DATA LOCAL INFILE "enwiki-20080724-page_props.txt" INTO TABLE page_props FIELDS TERMINATED BY "," ENCLOSED BY "'" LINES TERMINATED BY "\r\n";
LOAD DATA LOCAL INFILE "enwiki-20080724-page_restrictions.txt" INTO TABLE page_restrictions FIELDS TERMINATED BY "," ENCLOSED BY "'" LINES TERMINATED BY "\r\n";
LOAD DATA LOCAL INFILE "enwiki-20080724-protected_titles.txt" INTO TABLE protected_titles FIELDS TERMINATED BY "," ENCLOSED BY "'" LINES TERMINATED BY "\r\n";
LOAD DATA LOCAL INFILE "enwiki-20080724-redirect.txt" INTO TABLE redirect FIELDS TERMINATED BY "," ENCLOSED BY "'" LINES TERMINATED BY "\r\n";
LOAD DATA LOCAL INFILE "enwiki-20080724-site_stats.txt" INTO TABLE site_stats FIELDS TERMINATED BY "," ENCLOSED BY "'" LINES TERMINATED BY "\r\n";
LOAD DATA LOCAL INFILE "enwiki-20080724-templatelinks.txt" INTO TABLE templatelinks FIELDS TERMINATED BY "," ENCLOSED BY "'" LINES TERMINATED BY "\r\n";
LOAD DATA LOCAL INFILE "enwiki-20080724-user_groups.txt" INTO TABLE user_groups FIELDS TERMINATED BY "," ENCLOSED BY "'" LINES TERMINATED BY "\r\n";

Configure Wikipedia

Open the web browser and run config/index.php under Wikipedia directory

  • Note! The screen you will see will not look like that of the current Wikipedia. The page will complain many many missing templates and plug-ins that are not included in the Wikipedia database for public download. To mimic the current Wikipedia page exactly, you need to install and configure numerous templates, plug-in sources and sure you have to bare with long time debugging process. This is not what I intended to do. The purpose of this experiment is importing Wikipedia into the local (i.e. my ) database and search the content in a way I want. I do not mean Wikipedia USER INTERFACE. If you really want to go through this, check below two links.
    1. [2] to see what extensions Wikipedia is currently using.
    2. [3] to see what templates are in there.

Query information

After successful DB building, you need the database layout to query data. Look at Database Layout. Let me give you one example that selects articles in the main namespace with a given page title:

SELECT 
   old_text
FROM 
   page
   INNER JOIN revision ON page_latest = rev_id 
   INNER JOIN text ON rev_text_id = old_id
WHERE    
   page_title = 'Your page title' 
      AND
   page_namespace=0;

Now it is your turn to figure out how to play with your locally imported Wikipedia database. Enjoy!

FAQ

Many people around the world are sending me emails on this topic. All of they got success in importing Wikipedia databases into their local database.

Though I still do not feel the need to run the forum BBS on this article, it looks a good idea to have some separated FAQ page.

This FAQ lists the actual troubles that I helped. So please be sure to visit FAQ page and check the Q&A when you are in trouble. If failed in finding any relevant things, then send me an email.