基于COCA词频表的文本词汇分布测试工具v0.1

美国语言协会对美国人日常使用的英语单词做了一份详细的统计，按照日常使用的频率做成了一张表，称为COCA词频表。排名越低的单词使用频率越高，该表可以用来统计词汇量。

如果你的词汇量约为6000，那么这张表频率6000以下的单词你应该基本都认识。（不过国内教育平时学的单词未必就是他们常用的，只能说大部分重合）

我一直有个想法，要是能用COCA词频表统计一本小说中所有的词汇都是什么等级的，然后根据自己的词汇量，就能大致确定这本小说是什么难度，自己能不能读了。

学习了C++的容器和标准库算法后，我发现这个想法很容易实现，于是写了个能够统计文本词频的小工具。现在只实现了基本功能，写得比较糙。还有很多要改进的地方，随着以后的学习，感觉这个工具还是可以再深挖的。

v0.1

1.可逐个分析文本中的每一个单词并得到其COCA词频，生成统计报告。
2.仅支持txt。

实现的重点主要分三块。

（1）COCA词典的建立

首先建立出COCA词典。我使用了恶魔的奶爸公众号提供的txt格式的COCA词频表，这张表是经过处理的，每行只有一个单词和对应的频率，所以读取起来很方便。

读取文本后，用map容器来建立“单词——频率”的一对一映射。

要注意的是有些单词因为使用意思和词性的不同，统计表里会出现两个频率，也就是熟词的偏僻意思。我只取了较低的那个为准，因为你读文本时，虽然可能不是真正的意思但起码这个单词是认识的，而且这种情况总体来说比较少见，所以影响不大。

存储时我用了26组map，一个首字母存一组，单词查询时按首字母查找。可能有更有效率的办法。

（2）单词的获取和处理

使用get()方法逐个获取文本的字符。

使用vector容器存储读到的文本内容。如果是字母或连字符就push进去，如果不是，就意味着一个单词读完了，接着开始处理，把读到的文本内容assign到一个字符串中组成完整的单词，然后根据其首字母查找词频。

每个单词查到词频后，使用自定义的frequency_classify()区分它在什么范围，再用word_frequency_analyze()统计在不同等级的分布。

最后报告结果，显示出每个等级的百分比。

（3）单词的修正

首先是分析过的单词不会再分析第二遍，因为一般原文中像I，he，she，it，a，this，that这种特别常见的词还是很多的，如果每个都统计进去，会导致不同文章看到的结果都是高频词占到80%，90%。应该以单词的种类为准，而不是数量。

其次有很多单词的变形COCA词频表是不统计的，像许多单词的过去式，复数，进行时等，还有was，were，has，had等。所以我单独加了word_usual_recheck()和word_form_recheck()两个函数来检查是否有这些情况，如果有变形的话，就变为原型再尝试查找一次。但是这样做相当于手动修正，不能覆盖到所有情况，比如有些单词的变形并不遵循传统。

我先用了乔布斯斯坦福大学的演讲稿测试。测试结果：

总共分析了869个单词，77.45%的单词都找到了，且分布在4000以下，可见演讲稿使用的绝大多数的单词都是极其常用的单词，有4000以上的词汇量就基本能读懂。

我再测试了一下《小王子》的原文。

约70%的单词都找到了，且分布在4000以下，再加上4000-8000之间的，总共有接近80%的单词是六七千以下的级别。

为了对比，我再随便找了一篇纽约客上的文章测试。

可以看到，高频词的比例明显降低了一些，约有10%的单词是8000以上的级别，还有更多的单词找不到了，说明出现了许多单词的变形，当然也可能包括人名，毕竟是讨论政治的文章，还有小部分可能是超出20000。

以下是代码。

TextVolcabularyAnalyzer.cpp

  1 #include 
  2 #include 
  3 #include <string>
  4 #include 
  5 #include 
  6 #include 
  7 #include 
  8 #include 
  9 #include 
 10 #include 
 11 #include 
 12 
 13 #define COCA_WORDS_NUM       20201U
 14 #define WORDS_HEAD_NUM       26U
 15 
 16 #define WORDS_HEAD_A         0U
 17 #define WORDS_HEAD_B         1U
 18 #define WORDS_HEAD_C         2U
 19 #define WORDS_HEAD_D         3U
 20 #define WORDS_HEAD_E         4U
 21 #define WORDS_HEAD_F         5U
 22 #define WORDS_HEAD_G         6U
 23 #define WORDS_HEAD_H         7U
 24 #define WORDS_HEAD_I         8U
 25 #define WORDS_HEAD_J         9U
 26 #define WORDS_HEAD_K         10U
 27 #define WORDS_HEAD_L         11U
 28 #define WORDS_HEAD_M         12U
 29 #define WORDS_HEAD_N         13U
 30 #define WORDS_HEAD_O         14U
 31 #define WORDS_HEAD_P         15U
 32 #define WORDS_HEAD_Q         16U
 33 #define WORDS_HEAD_R         17U
 34 #define WORDS_HEAD_S         18U
 35 #define WORDS_HEAD_T         19U
 36 #define WORDS_HEAD_U         20U
 37 #define WORDS_HEAD_V         21U
 38 #define WORDS_HEAD_W         22U
 39 #define WORDS_HEAD_X         23U
 40 #define WORDS_HEAD_Y         24U
 41 #define WORDS_HEAD_Z         25U
 42 
 43 #define USUAL_WORD_NUM       17U
 44 
 45 using namespace std;
 46 
 47 typedef enum WordFrequencyType
 48 {
 49     WORD_UNDER_4000 = 0,
 50     WORD_4000_6000,
 51     WORD_6000_8000,
 52     WORD_8000_10000,
 53     WORD_10000_12000,
 54     WORD_12000_14000,
 55     WORD_14000_16000,
 56     WORD_OVER_16000,
 57     WORD_NOT_FOUND_COCA,
 58     WORD_LEVEL_NUM
 59 }TagWordFrequencyType;
 60 
 61 const string alphabet_str = "abcdefghijklmnopqrstuvwxyz";
 62 
 63 const string report_str[WORD_LEVEL_NUM] = {
 64     "UNDER 4000: ",
 65     "4000-6000: ",
 66     "6000-8000: ",
 67     "8000-10000: ",
 68     "10000-12000: ",
 69     "12000-14000: ",
 70     "14000-16000: ",
 71     "16000-20000+: ",
 72     "\nNot found in COCA:"
 73 };
 74 
 75 //for usual words not included in COCA
 76 const string usual_w_out_of_COCA_str[USUAL_WORD_NUM] =
 77 {
 78     "s","is","are","re","was","were",
 79     "an","won","t","has","had","been",
 80     "did","does","cannot","got","men"
 81 };
 82 
 83 bool word_usual_recheck(const string &ws)
 84 {
 85     bool RetVal = false;
 86     for (int i = 0; i < USUAL_WORD_NUM;i++)
 87     {
 88         if (ws == usual_w_out_of_COCA_str[i])
 89         {
 90             RetVal = true;
 91         }
 92         else
 93         {
 94             //do nothing
 95         }
 96     }
 97     return RetVal;
 98 }
 99 
100 bool word_form_recheck(string& ws)
101 {
102     bool RetVal = false;
103     if (ws.length() > 3)
104     {
105         char e1, e2,e3;
106         e3 = ws[ws.length() - 3];    //last but two letter
107         e2 = ws[ws.length() - 2];    //last but one letter
108         e1 = ws[ws.length() - 1];    //last letter
109 
110         if (e1 == 's')
111         {
112             ws.erase(ws.length() - 1);
113             RetVal = true;
114         }
115         else if (e2 == 'e' && e1 == 'd')
116         {
117             ws.erase(ws.length() - 1);
118             ws.erase(ws.length() - 1);
119             RetVal = true;
120         }
121         else if (e3 == 'i' && e2 == 'n' && e1 == 'g')
122         {
123             ws.erase(ws.length() - 1);
124             ws.erase(ws.length() - 1);
125             ws.erase(ws.length() - 1);
126             RetVal = true;
127         }
128         else
129         {
130             //do nothing
131         }
132     }
133     
134     return RetVal;
135 }
136 
137 TagWordFrequencyType frequency_classify(const int wfrq)
138 {
139     if (wfrq == 0)
140     {
141         return WORD_NOT_FOUND_COCA;
142     }
143     else if (wfrq > 0 && wfrq <= 4000)
144     {
145         return WORD_UNDER_4000;
146     }
147     else if (wfrq > 4000 && wfrq <= 6000)
148     {
149         return WORD_4000_6000;
150     }
151     else if (wfrq > 6000 && wfrq <= 8000)
152     {
153         return WORD_6000_8000;
154     }
155     else if (wfrq > 8000 && wfrq <= 10000)
156     {
157         return WORD_8000_10000;
158     }
159     else if (wfrq > 10000 && wfrq <= 12000)
160     {
161         return WORD_10000_12000;
162     }
163     else if (wfrq > 12000 && wfrq <= 14000)
164     {
165         return WORD_12000_14000;
166     }
167     else if (wfrq > 14000 && wfrq <= 16000)
168     {
169         return WORD_14000_16000;
170     }
171     else
172     {
173         return WORD_OVER_16000;
174     }
175 }
176 
177 void word_frequency_analyze(array<int, WORD_LEVEL_NUM> & wfrq_array, TagWordFrequencyType wfrq_tag)
178 {
179     switch (wfrq_tag)
180     {
181         case WORD_UNDER_4000:
182         {
183             wfrq_array[WORD_UNDER_4000] += 1;
184             break;
185         }
186         case WORD_4000_6000:
187         {
188             wfrq_array[WORD_4000_6000] += 1;
189             break;
190         }
191         case WORD_6000_8000:
192         {
193             wfrq_array[WORD_6000_8000] += 1;
194             break;
195         }
196         case WORD_8000_10000:
197         {
198             wfrq_array[WORD_8000_10000] += 1;
199             break;
200         }
201         case WORD_10000_12000:
202         {
203             wfrq_array[WORD_10000_12000] += 1;
204             break;
205         }
206         case WORD_12000_14000:
207         {
208             wfrq_array[WORD_12000_14000] += 1;
209             break;
210         }
211         case WORD_14000_16000:
212         {
213             wfrq_array[WORD_14000_16000] += 1;
214             break;
215         }
216         case WORD_OVER_16000:
217         {
218             wfrq_array[WORD_OVER_16000] += 1;
219             break;
220         }
221         default:
222         {
223             wfrq_array[WORD_NOT_FOUND_COCA] += 1;
224             break;
225         }
226     }
227 }
228 
229 bool isaletter(const char& c)
230 {
231     if ((c >= 'a' && c <= 'z') || (c >= 'A' && c<= 'Z'))
232     {
233         return true;
234     }
235     else
236     {
237         return false;
238     }
239 }
240 
241 int main()
242 {
243     //file init
244     ifstream COCA_txt("D:\\COCA.txt");
245     ifstream USER_txt("D:\\test.txt");
246 
247     //time init
248     clock_t startTime, endTime;
249     double build_map_time = 0;
250     double process_time = 0;
251 
252     startTime = clock();    //build time start
253 
254     //build COCA words map
255     map<string, int> COCA_WordsList[WORDS_HEAD_NUM];
256     int readlines = 0;
257 
258     while (readlines < COCA_WORDS_NUM)
259     {
260         int frequency = 0; string word = "";
261         COCA_txt >> frequency;
262         COCA_txt >> word;
263 
264         //transform to lower uniformly 
265         transform(word.begin(), word.end(), word.begin(), tolower);
266 
267         //import every word
268         for (int whead = WORDS_HEAD_A; whead < WORDS_HEAD_NUM; whead++)
269         {
270             //check word head 
271             if (word[0] == alphabet_str[whead])
272             {
273                 //if a word already exists, only load its lower frequency
274                 if (COCA_WordsList[whead].find(word) == COCA_WordsList[whead].end())
275                 {
276                     COCA_WordsList[whead].insert(make_pair(word, frequency));
277                 }
278                 else
279                 {
280                     COCA_WordsList[whead][word] = frequency < COCA_WordsList[whead][word] ? frequency : COCA_WordsList[whead][word];
281                 }
282             }
283             else
284             {
285                 // do nothing
286             }
287         }
288         readlines++;
289     }
290 
291     endTime = clock();    //build time stop
292     build_map_time = (double)(endTime - startTime) / CLOCKS_PER_SEC;
293 
294     //user prompt
295     cout << "COCA words list imported.\nPress any key to start frequency analysis...\n";
296     cin.get();
297 
298     startTime = clock();    //process time start
299 
300     //find text words
301     vector<char> content_read;
302     string word_readed;
303     vector<int> frequecy_processed = { 0 };
304     array<int, WORD_LEVEL_NUM> words_analysis_array{ 0 };
305     char char_read = ' ';
306 
307     //get text char one by one
308     while (USER_txt.get(char_read))
309     {
310         //only letters and '-' between letters will be received
311         if (isaletter(char_read) || char_read == '-')
312         {
313             content_read.push_back(char_read);
314         }
315         else
316         {
317             //char which is not a letter marks the end of a word
318             if (!content_read.empty())    //skip single letter 
319             {
320                 int current_word_frequency = 0;
321 
322                 //assign letters to make the word
323                 word_readed.assign(content_read.begin(), content_read.end());
324                 transform(word_readed.begin(), word_readed.end(), word_readed.begin(), tolower);
325 
326                 cout << "Finding word \"" << word_readed << "\" \t";
327                 cout << "Frequency:";
328 
329                 //check the word's head and find its frequency in COCA list
330                 for (int whead = WORDS_HEAD_A; whead < WORDS_HEAD_NUM; whead++)
331                 {
332                     if (word_readed[0] == alphabet_str[whead])
333                     {
334                         cout << COCA_WordsList[whead][word_readed];
335                         current_word_frequency = COCA_WordsList[whead][word_readed];
336 
337                         //check if the word has been processed
338                         if (current_word_frequency == 0)
339                         {
340                             //addtional check
341                             if (word_usual_recheck(word_readed))
342                             {
343                                 word_frequency_analyze(words_analysis_array, WORD_UNDER_4000);
344                             }
345                             else if (word_form_recheck(word_readed))
346                             {
347                                 current_word_frequency = COCA_WordsList[whead][word_readed];    //try again
348                                 if (current_word_frequency > 0)
349                                 {
350                                     frequecy_processed.push_back(current_word_frequency);
351                                     word_frequency_analyze(words_analysis_array, frequency_classify(current_word_frequency));
352                                 }
353                                 else
354                                 {
355                                     // do nothing
356                                 }
357                             }
358                             else
359                             {
360                                 word_frequency_analyze(words_analysis_array, WORD_NOT_FOUND_COCA);
361                             }
362                         }
363                         else if (find(frequecy_processed.begin(), frequecy_processed.end(), current_word_frequency)
364                             == frequecy_processed.end())
365                         {
366                             //classify this word and make statistics
367                             frequecy_processed.push_back(current_word_frequency);
368                             word_frequency_analyze(words_analysis_array, frequency_classify(current_word_frequency));
369                         }
370                         else
371                         {
372                             // do nothing
373                         }
374                     }
375                     else
376                     {
377                         //do nothing
378                     }
379                 }
380                 cout << endl;
381 
382                 //classify this word and make statistics
383                 //word_frequency_analyze(words_analysis_array, frequency_classify(current_word_frequency));
384                 content_read.clear();
385             }
386             else
387             {
388                 //do nothing
389             }
390         }
391     }
392 
393     endTime = clock();    //process time stop
394     process_time = (double)(endTime - startTime) / CLOCKS_PER_SEC;
395 
396     //calc whole words processed
397     int whole_words_analyzed = 0;
398     whole_words_analyzed = accumulate(words_analysis_array.begin(), words_analysis_array.end(), 0);
399 
400     //report result
401     cout << "\n////////// Report ////////// \n";
402     for (int i = 0;i< words_analysis_array.size();i++)
403     {
404         cout << report_str[i] <<"\t"<< words_analysis_array[i] << " (";
405         cout<<fixed<2)<<(float)words_analysis_array[i] * 100 / whole_words_analyzed << "%)" << endl;
406     }
407     cout << "\nWords totally analyzed: " << whole_words_analyzed << endl;
408 
409     //show run time
410     cout << "Map build time: " << build_map_time*1000 << "ms.\n";
411     cout << "Process time: " << process_time*1000 << "ms.\n";
412     cout << "////////////////////////////" << endl;
413 
414     //close file
415     COCA_txt.close();
416     USER_txt.close();
417 
418     return 0;
419 }

CC++编程

基于COCA词频表的文本词汇分布测试工具v0.1

相关

浅谈C/C++编程前景

标签