Similar to word frequency there is sometimes a need to find the character frequency in a text file containing corpora. So using below script we can just do that.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
#open a file and find its character frequency with open('raw_corpus.txt') as fp: lines = fp.read().split("n") #here lines contains entire file contents #incremental variable i=1; #dictionary to save characters as keys and values as fruquency charcount={} #to access file contents line by line for line in lines: #convert to lowercase lower_line = line.lower() chars = lower_line #for loop to access current line characters for char in chars: if char not in charcount: charcount[char] = 1 else: charcount[char] += 1 #print (i,"t",lower_line) i = i + 1 #increment i #print the dictionary with sorted keys(tokens) and values for k in sorted(charcount): print (k, charcount[k]) |
This script will open the file ‘raw_corpus.txt’ read its contents line by line, then find each character frequency and store in dictionary.
Dictionary in Python is similar to hashes in Perl. It stores a values for each corresponding key, duplicate keys are overridden when a same key is encountered while storing.