Wikipedia is a big database occupying over pure database table size, 56GB, plus media files (> 70 GB). Wikipedia texts are written using Wiki syntax. Hence to utilize its contents for information processing, this section will illustrate the step to convert Wiki texts into raw texts.

The below regular expressions are tested on Tropicsoft regular expression libraries.

Regular expression in C++ format (Most recent)

m_sWikiRegExps[WIKI_ELEMENT_REDIRECT] = _T("^\\#(REDIRECT|redirect).?\\[{2}(.+)\\]{2}");
m_sWikiRegExps[WIKI_ELEMENT_GALLERY] = _T("<(gallery|GALLERY)\\b[^>]*>(.*?)</\\1\\b.*?>");
m_sWikiRegExps[WIKI_ELEMENT_HEADING] = _T("^[=]+([^=]+)?[=]+");
m_sWikiRegExps[WIKI_ELEMENT_PARAGRAPH] = _T("^([^$]+?)(?:\\s+)?(?:(?:\\r)?\\n){2,}");
m_sWikiRegExps[WIKI_ELEMENT_EXTENSION] = _T("^\\{{2}([a-zA-Z]{1,100})(.+)^\\}{2}");
m_sWikiRegExps[WIKI_ELEMENT_LINKS] = _T("[\\[]{2}([^\\|\\]]+)\\|?([^\\]]+)?[\\]]{2}");
m_sWikiRegExps[WIKI_ELEMENT_CATEGORY] = _T("[\\[]{2}Category[:]([^\\|\\]]+)\\|?([^\\]]+)?[\\]]{2}");
m_sWikiRegExps[WIKI_ELEMENT_FOREIGN] = _T("[\\[]{2}([a-z]{2})[:]([^\\]]*)[\\]]{2}");
m_sWikiRegExps[WIKI_ELEMENT_INTERWIKI] = _T("[\\[]{2}(commons|mediazilla|meta|mw|wikibooks|wikimedia|wikinews|wikiquote|wikisource|wikisource|wikispecies|wiktionary):(.*?)[\\]]{2}");
m_sWikiRegExps[WIKI_ELEMENT_TEMPLATE] = _T("[\\{]{2}([^\\}{]{1,})[\\}]{2}");
m_sWikiRegExps[WIKI_ELEMENT_EXTERNAL] = _T("[\\[]{1}((https?|ftp|mailto|file|HTTPS?|FTP|MAILTO|FILE)://([^\\|\\]]+))\\|?([^\\]]+)?]{1}");
m_sWikiRegExps[WIKI_ELEMENT_BOLDITALIC] = _T("[\\']{2,5}([^']+)[\\']{2,5}");
m_sWikiRegExps[WIKI_ELEMENT_HTMLCODE] = _T("<(center|b|i|p|b|br|hr|tt|pre|nowiki|math|strike|u|table|caption|tr|td|th|li|ul|ol|dl|dd|dt|div|h1|h2|h3|h4|h5|h6|h7|h8|h9|small|blockquote)\\s?>(.*?)</(\\1)\\s?>");
m_sWikiRegExps[WIKI_ELEMENT_HTMLLINEFEED] = _T("(<br\\s?/>)");
m_sWikiRegExps[WIKI_ELEMENT_SPECIALHTML] = _T("—");
m_sWikiRegExps[WIKI_ELEMENT_LISTDEF] = _T("^((\\*|#|:|;){1,10}).?([^\\n]+)$");
m_sWikiRegExps[WIKI_ELEMENT_COMMAND] = _T("<(nowiki|pre|blockquote)\\s?>(.*)</(\\1)\\s?>");
m_sWikiRegExps[WIKI_ELEMENT_REMOVEALL] = _T("<([a-zA-Z][a-zA-Z0-9]*)\\b[^>]*>(.*?)</\\1>");
m_sWikiRegExps[WIKI_ELEMENT_HASTEXT] = _T("(\\w+)");
m_sWikiRegExps[WIKI_ELEMENT_DATEYEAR1] = _T("((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Apr|Sep|Oct|Nov|Dec)\\.\\s{1}(?:0[1-9]|1[012])\\,?\\s{1}(?:19|20)\\d{2})"); // Feb. 21, 2001
m_sWikiRegExps[WIKI_ELEMENT_DATEYEAR2] = _T("((19|20)\\d\\d([- /.])(0[1-9]|1[012])\\3(0[1-9]|[12][0-9]|3[01]))"); // 1999-01-20
m_sWikiRegExps[WIKI_ELEMENT_DATEYEAR3] = _T("((0[1-9]|1[012])([- /.])(0[1-9]|[12][0-9]|3[01])\\3(19|20)\\d\\d)"); // 01-20-1999
m_sWikiRegExps[WIKI_ELEMENT_DATEYEAR4] = _T("((0[1-9]|[12][0-9]|3[01])([- /.])(0[1-9]|1[012])\\3(19|20)\\d\\d)"); // 20-01-1999
m_sWikiRegExps[WIKI_ELEMENT_DATETIME1] = _T("(?:(?:(?:[0-1]?\\d)|(?:2[0-3])):[0-5]\\d)?(?::[0-5]\\d)\\s?([APap][.]?[Mm][.]?)"); // 20-01-1999
m_sWikiRegExps[WIKI_ELEMENT_SENTENCE] = _T("(?:^|\\s|\\/)?([A-Z0-9\\'\\\"\\(\\{].+?[\\.\\?\\!]{1}(?:\\]|\\\"|\\'|\\}|^\\s)?)(?:$)?");
m_sWikiRegExps[WIKI_ELEMENT_REMOVEDUPSPACE] = _T("([ ]+)");

Regular expressions in general format to strip off Wiki syntax

  • Redirect
^\#(REDIRECT|redirect).?\[{2}(.+)\]{2}
  • Gallery
<(gallery|GALLERY)\b[^>]*>(.*?)</\1\b.*?>
  • Headings
^[=]+[^=]*?[=]+
  • Paragraphs in texts of a heading
^([^\n]+)$
  • MediaWiki extensions
^\{{2}([a-zA-Z]{1,100})(.+)^\}{2}
  • Links
[\[]{2}([^\]]{1,})[\]]{2}
    • Categories
[\[]{2}Category[:]([^\]]*)[\]]{2}
    • Foreign languages
[\[]{2}[a-z]{2}[:]([^\]]*)[\]]{2}
    • Interwikies
[\[]{2}(commons|mediazilla|meta|mw|wikibooks|wikimedia|wikinews|wikiquote|wikisource|wikisource|wikispecies|wiktionary):(.*?)[\]]{2}
  • Templates
[\{]{2}([^\}{]{1,})[\}]{2}
  • External links
[\[]{1}(https?|ftp|mailto|file|HTTPS?|FTP|MAILTO|FILE):([^\]]{1,})[\]]{1}
  • Bold
[']{2}([^\n']{1,})[']{2}
  • Italic
[']{3}([^\n']{1,})[']{3}
  • Bold and italic
[']{5}([^\n']{1,})[']{5}
  • Remove HTML codes
<(center|b|i|p|b|br|hr|tt|pre|nowiki|math|strike|u|table|caption|tr|td|th|li|ul|ol|dl|dd|dt|div|h1|h2|h3|h4|h5|h6|h7|h8|h9|small|blockquote)\s?>(.*?)</(\1)\s?>
  • Replace HTML line feeds
(<br\s?/>)
  • Other special characters
  • Lists & definition
^((\*|#|:|;){1,10}).?([^\n]+)$
  • Wiki commands
<(gallery|nowiki|pre|blockquote)\s?>(.*)</(\1)\s?>
  • Final code removal
<([a-zA-Z][a-zA-Z0-9]*)\b[^>]*>(.*?)</\1>

Paragraphs to sentences

After removing Wiki strips, tear off each paragraphs into sentences. As well know, extracting sentences from a paragraph is NLP (Natural Language Processing) work, not a simple grammar job. Hence, below regular expressions are only for tests or tricks.

  • Sentences in the paragraph
[.?!][\]\"')}]*($|\t|\s)[\n]*

External links