In NLP the most recurrent task is to find the frequency of words in a text file or a huge corpora. Below is one such script that we can use to find frequency of words in a text file. This script uses python dictionary. To know more about dictionary usage in Python click here.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
#Program to read a file(corpus) and find frequency of each token #!/usr/bin/python import re #read file file=open("raw_corpus.txt","r+") #dictionary to save tokens as keys and values fruquency as values wordcount={} for word in file.read().split(): #split() will split according to whitespace that includes ' ' 't' and 'n'. It will split all the values into one list. #print (word) #cleaning corpus word = word.lower() #convert to lowercase word = re.sub('.', "", word) #substitute . with empty #check if current token already exists in dictionary if word not in wordcount: wordcount[word] = 1 else: wordcount[word] += 1 #print the dictionary with keys and values #for k,v in wordcount.items(): #print (k, v) #print the dictionary with sorted keys(tokens) and values for k in sorted(wordcount): print (k, wordcount[k]) |
This script opens the file ‘raw_corpus.txt’, each lines is split into words. Each word is stored in dictionary, with key as word and value as the frequency. When the same word(key) is encountered again value is incremented by 1.