EChronicle:Wikipedia database FAQ
From Pilho Kim
just wanted to ask: how long do these operations take (and on which hardware)? I want to know when it's taking way too long...
Such time factors depend heavily on your system configuration. For your reference, it took me less than one day with a computer having 2.4GHz Intel P4 CPU, 1G memory, 250GB 5400rpm hard disk under Windows XP SP2, and MySQL MyISAM configurations. Since the job can not be done in a fully automated way, I had to check the progress time to time and sure spent some times to bite something to eat. Anyway the full process took me less than one day. I mean my case. Good luck.
Dear Mr kim,
I am trying to set up my wikipedia website using Mediawiki and having read your article I wanted to find out if you can set up my website for a fee. I have some basic web coding knowledge but not enough to set up the site even with your instructions. If you have the time and can let me know what your terms are I would be most grateful.
I intend to get a shared host with unlimited capacities, etc, such as with hostmonster.com.
I look forward to hearing from you soon.
What I descried at my Wiki is about importing the database to your local MySQL database, not about visual interface part.
I did not handle the topic on installing plug-ins and templates that will make the local Wikipedia just like Wikipedia.org. I am saying that even if you successfully imported Wikipedia databases by following my instructions, what you will have is the naive MySQL database filled up with Wikipedia data, not the Web site that behaves and represents Wiki texts just like Wikipedia.org.
To accomplish what you are thinking, we have to know Wikipedia.org server and their MediaWiki configurations to figure out what kinds of templates and plug-ins are installed at their system. And this is not available on the Web as far as I know.
Importing character set problem
Hi Pilho, I came across your webpage describing how to import a wikipedia dump into MySQL on Windows. Thank you for putting together such great information. I am trying to import the 10/8/08 dump right now and have some questions for you.
When I import the enwiki-20081008-pagelinks.sql file (for both the original .sql file or from the output of your insert2txt program), I get this error:
ERROR 1366 (HY000): Incorrect string value: '\xBD_char...' for column 'pl_title'
I found the offending line, which has:
I think the "½" character is causing the problem. I had created the database as utf8, which I think is correct.
There are actually many occurrences of this type of error when I import the dump. I was wondering if you came across this problem? I am using Vista64, Mysql 5.1.30, and the command prompt (both the Windows command shell as well as cygwin).
That looks like some MySQL field definition problem which I am not that familiar with. I would recommend you to check the pagelink table and the pl_title field definition whether their char set and field char set are both set to be utf8. If this does not work, then consult to MySQL forum or DB expert.
Thanks for the info. Did you not encounter the same problem when you imported the Wikipedia dump into your MySql database on Windows?
I am not quite sure whether I experienced that problem. The MySQL error message complains about incorrect string values and I also agree with your thought that the code page could be the source of problems. In my case, ALL tables are set to have code page LATIN1 (not utf8) since I only handle English Wikipedia. I mentioned this at my article and you should check it again.
Based on my work, that data "582, 0, Ranma_blahblah" was successfully imported in. So let me give you the pagelink table definition DDL. If you think that you want to do the same process to import English Wikipedia, then I would recommend to update or change the table definition like below.
CREATE TABLE `pagelinks` ( `pl_from` int(10) unsigned NOT NULL DEFAULT '0', `pl_namespace` int(11) NOT NULL DEFAULT '0', `pl_title` varchar(255) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL DEFAULT '', UNIQUE KEY `pl_from` (`pl_from`,`pl_namespace`,`pl_title`), KEY `pl_namespace` (`pl_namespace`,`pl_title`,`pl_from`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1 MAX_ROWS=1000000000;
Wish this works for you.
Solutions by anonymous
I found the problem. There is a character set mismatch between the mysql server and mysql client. See:
If I enter "SET NAMES latin1;" first, and then do all the INSERT... statements, then everything works fine and the weird characters like in 'Ranma_½_characters' are retained, which is what I want. I used the table definition of pagelinks from the Wikipedia website, which keeps everything in UTF8. http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/maintenance/tables.sql?view=markup
By the way, do you know what the "namespace" field means in the page and pagelinks tables? I found the same page title (e.g. 'Sarrasine') with two entries in the page table, differentiated by namespace (0 and 1).
Yes. I do know. the "namespace" represents the purpose of the page. If you want to perform full-text search on main articles, then select the value "0".
For other values, see http://www.mediawiki.org/wiki/Help:Namespaces
Wikipedia database importing takes too much time
What did you do exactly to load the enwiki-20080724-pagelinks.txt file (and all the other data) in less than a day?
My pagelinks.txt is 8GB and it takes forever to import on my well equipped machine. I've tried tweaking Mysql parameters, tried disabling keys on the table, tried loading smaller 1GB chunks at a time, and still I'm quitting after a day or so. Did you load the entire table in one go, with indexes and all?
If you have read my old article EChronicle::Importing Wikipedia February 6th 2007, then the answer is YES. That may take a day or more.
If you have read my recent article at EChronicle:Wikipedia database, then the answer is NO. It should take less then 30 minutes.
In the new article, I wrote the simple C++ program to accelerate the importing process. So let me first check whether we are on the same page. And if you still experience the trouble, then ask me back with more details.
BTW, you should set all table engines to MyISAM.
Yes, sorry I wasn't clear on which of your pages I was following. I was reading the latest one: I converted the SQL files to lists of comma-separated values (though I wrote a Java program to do it because I didn't have a Windows-pc handy for your C++ program). Basically it looks like this:
0,0,'2008' 0,0,'Adam_Karrillon' 0,0,'Ambitionz_of_a_ridah'
And so I am doing a LOAD LOCAL DATA INFILE of this file into a pagelinks table I have created using the MyISAM engine.
Am I missing something?
If you already have pagelinks.txt file, then the problem is equal to the case of page, revision or text.txt case. I guess I had no problem importing that file but if your process hangs on, then apply the same techniques used for page.txt. Since both are same big files and include millions records. So modify below steps for your pagelinks case.
- Disable indexes: myisamchk --keys-used=0 -rq ..\data\enwiki_20080724\pagelinks
- Change to MySQL shell
- Use database: mysql> use enwiki_20080724;
- Flush tables: mysql> FLUSH TABLES;
- Change the MyISAM file size limit. Check to change the limitation. This problem happens when a single MyISAM data file size bigger than 4GB: mysql> ALTER TABLE pagelinks MAX_ROWS=1000000000;
- Load local data file: mysql> LOAD DATA LOCAL INFILE "enwiki-20080724-pagelinks.txt" INTO TABLE pagelinks FIELDS TERMINATED BY "," ENCLOSED BY "'";
- Reconstruct indexes: Read myisamchk and myisamchk memory usage to configure the parameters. Stop the mysql server before running: myisamchk --fast --force --update-state --key_buffer_size=1024M --sort_buffer_size=1024M --read_buffer_size=64M --write_buffer_size=64M -rq ..\data\enwiki_20080724\pagelinks
If you still suffers the problem, then checks the original data text file whether it is formatted correctly. Or kill the mysqld server process when it hangs, then restart the server and check the last inserted data, then you will see where MySQL complains. Also checks the MySQL error log file.
If this does not work, there are many experts on MySQL out there including MySQL forum. They may help you better. :)
Sorry to bother you again, but I have to ask: I'm using MySQL 5.1.28, and I just read your warning. However, I had no trouble loading the text table, that you seemed to have trouble with, but maybe you could tell me if you got so far as to try loading the pagelinks table on 5.1.28?
Right now, the table is loading around 50MB/hour, which is unreasonably slow...
Yes. I experienced some problems with MySQL 5.1 last year. But I have not tested the procedure for the latest 5.1 version. Anyway setting up MySQL 5.0 won't take you over 30 minutes. So why not try if you do not need 5.1 specific features? :)
Such a slow importing rate obviously does not make a sense. Importing 50MB should take less than 1 second if :
(1) the target table is MyISAM,
(2) you turned off the index using the comand:
myisamchk --keys-used=0 -rq ..\data\enwiki_20080724\pagelinks
(3) your source data file format conforms to the table schema
(4) your table files are not corrupted.
Problems you are experiencing are mostly related with MySQL. I recommend you to read http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html . For expert's opinions, MySQL forum (http://forums.mysql.com/) would be the place to visit. I got help from there several times.
FYI, I just realized the similar case that a guy from Microsoft once questioned to me. He also suffered in importing the pagelinks table.
I am not so sure whether you both are suffering the same problem. However, he also used the recent MySQL server and Wikipedia database. And he succeeded in importing Wikipedia. So see the FAQ section and check the solution he found for himself at EChronicle:Wikipedia database FAQ#Importing character set problem
Thank you, you have been an excellent help!
That last tip was what did it, I think: I dropped the table and used the create statement in the answer you linked to. It's either the "COLLATE latin1_bin" part (default was latin1_swedish_ci), or the MAX_ROWS (less likely, because I think that is for older MySQL versions).
Anyway, I started the LOAD before going to work and when I got home, the client was still busy on the LOAD statement. However, the table files looked a decent size, so I interrupted the client, then rebuilt the index, which took a few hours, I believe. Now, I'm running a custom script for building a link summary table (that I need for my project), which uses the pagelinks, and it runs smoothly!
Glad to hear your success.
Some time later when you want to share your works on Wikipedia and write on the web, please share the link with me. I am getting more emails on Wikipedia things from everywhere. And your tips will help many out there.
Ah.. if that LOAD thing works slow, you may use an external mysqlimport command next time. That's much faster. MAX_ROWS limit exists in all MySQL version though it depends on your operating system file size limit.