Here I am assuming that our input file has text with each sentence in a new line. We use this file for tokenization.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
#open a file and clean its contents #!/usr/bin/python import re with open('raw_corpus.txt') as fp: lines = fp.read().split("n") #here lines contains entire file contents #incremental variable i=1; #to access file contents line by line for line in lines: #if empty break from current iteration if line == "": break #convert to lowercase line = line.lower() #leaning line = re.sub(r'.', " .", line) #substitute . with space . line = re.sub(r',', " ,", line) #substitute , with space , line = re.sub(r'?', " ?", line) #substitute ? with space ? line = re.sub(r'!', " !", line) #substitute ! with space ! #replace multiple spaces into single spaces line = re.sub(r's+', " ", line) #get words in current line words = line.split(' ') print ("<Sentence Id='",i,"'>",sep='') #use sep='' to suppress white space while printing j=1 #token counter for word in words: print (j,"t",word) j=j+1 print ("</Sentence>",sep='') i = i + 1 #increment i |
This script will open ‘raw_corpus.txt’ and remove junks.
It will also print sentence boundaries and tokenize a sentence into words