wikipedia pre-process for nlp

I have been playing around with several ideas about grounding symbolic text into meaningful SDR’s or some other representations. Whatever proves to be useful. There is Cortical.IO that provides an API and even an explanation to how their SDRs are generated. Nevertheless, I wanted to go much deeper and do my own implementation for the sake of deeper understanding, control, and experiment.

Wikipedia is large. In fact, a version that I have downloaded is a 20gb XML dump from 2014 that contained a lot of meta information. Most of it is useful or certainly versatile, however, for the my purposes it was redundant and a bit hard to get rid of.

List of filters

I began with processing first 10000 lines of text to see how my system worked. Very fast I piled many RegEx expressions that helped me get rid of different problems. The end result I wanted required spaces between each token; no numbers; no special characters; only simple punctuation. I have not tried to optimize my filters at first only later to run into a problem - it takes a lot of time to process 20gb of text.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
public static string StripHTML(string input) {
return Regex.Replace(input, "<.*?>", string.Empty);
}
public static string StripSpecialSymbols(string input) {
return Regex.Replace(input, "&.*?;", string.Empty);
}
public static string StripAllBadWords(string input) {
input = Regex.Replace(input, @"\s([^A-Za-z-\.\,\!\?]+)\s", string.Empty);
return input;
}
public static string StripBrackets(string input) {
input = Regex.Replace(input, "\\(.*?\\)", string.Empty);
input = Regex.Replace(input, "\\\".*?\\\"", string.Empty);
input = Regex.Replace(input, "\\'.*?\\'", string.Empty);
return input;
}
public static string KeepOnlySentences(string input) {
if (input.EndsWith(".")) return input;
return string.Empty;
}
public static string FixPunctuation(string input) {
input = input.Replace(",", " , ")
.Replace(".", " . ")
.Replace(";", " ; ")
.Replace("!", " ! ")
.Replace("?", " ? ")
.Replace(":", " : ")
.Replace(" ", " ");
return input;
}
public static string FinalCleanup(string line) {
var data = line.Split(' ')
.Where(l => (l.Length > 1 && Regex.IsMatch(l, @"^[a-zA-Z\-]+$")) || l == "." || l == "," || l == "!" || l == "?" || l == "I" || l == "a" || l == "A")
.ToList();
if (data.Count < 20) return string.Empty;
return string.Join(" ", data);
}

Once I have done all the preprocessing I had to face my biggest enemy - that code was slow. I had to run it only once, so any thoughtful optimization was out of the question. Parallel.ForEach ended up being a marvelous solution. I’ve loaded chunks into memory (about 30000 articles per thread), processed chunks in parallel (16 threads) and then wrote them to disk in sequence with the code below.

1
2
3
4
5
6
7
8
9
10
11
12
13
var bufferNumbers = new List<int>();
for (var i = 0; i < threadCount; i++) bufferNumbers.Add(i);
var resultBuffers = new List<List<string>>();
for (var i = 0; i < threadCount; i++) resultBuffers.Add(new List<string>());
Parallel.ForEach(bufferNumbers, (bufferNumber) => {
resultBuffers[bufferNumber] = ProcessWikiLines(perThreadBuffers[bufferNumber], bufferNumber);
});
using (StreamWriter sw = new StreamWriter(cleanedFilePath, true, Encoding.Default)) {
foreach (var buffer in resultBuffers) {
foreach (var bufferLine in buffer) sw.WriteLine(bufferLine);
}
sw.Close();
}

In the batch file available at my GitHub Repo you can see all the settings on the first line:

-cleanWiki data\enwiki-20140707-corpus.xml data\enwiki-text-only.txt 6 30000

  • 6 is the number of threads
  • 30000 is the number of articles per thread.

One could recycle the threads and not synchronize them, but having to run this code only once I opted for a simpler solution.

Source Code

All the source code is available on my GitHub Repo
Once compiled you can use provided batch file. Make sure to place enwiki dump inside the /data folder.