字符串高级操作——利用链表进行文本单词频率统计

现有一片英语短文,要求用c语言实现对该文章的词频统计,即利用文件读写方法,提取文本中的每一个单词之后通过算法统计其出现频率,并输出到另外的文件中。
短文如下:

Of all the changes that have taken place in English-language newspapers during the past quarter-century, perhaps the most far-reaching has been the inexorable decline in the scope and seriousness of their arts coverage.
It is difficult to the point of impossibility for the average reader under the age of forty to imagine a time when high-quality arts criticism could be found in most big-city newspapers. Yet a considerable number of the most significant collections of criticism published in the 20th century consisted in large part of newspaper reviews. To read such books today is to marvel at the fact that their learned contents were once deemed suitable for publication in general-circulation dailies.
We are even farther removed from the unfocused newspaper reviews published in England between the turn of the 20th century and the eve of World War II, at a time when newsprint was dirt-cheap and stylish arts criticism was considered an ornament to the publications in which it appeared. In those far-off days, it was taken for granted that the critics of major papers would write in detail and at length about the events they covered. Theirs was a serious business, and even those reviewers who wore their learning lightly, like George Bernard Shaw and Ernest Newman, could be trusted to know what they were about. These men believed in journalism as a calling, and were proud to be published in the daily press. ¡°So few authors have brains enough or literary gift enough to keep their own end up in journalism,¡± Newman wrote, ¡°that I am tempted to define ¡®journalism¡¯ as ¡®a term of contempt applied by writers who are not read to writers who are.¡¯¡±
Unfortunately, these critics are virtually forgotten. Neville Cardus, who wrote for the Manchester Guardian from 1917 until shortly before his death in 1975, is now known solely as a writer of essays on the game of cricket. During his lifetime, though, he was also one of England’s foremost classical-music critics, a stylist so widely admired that his Autobiography (1947) became a best-seller. He was knighted in 1967, the first music critic to be so honored. Yet only one of his books is now in print, and his vast body of writings on music is unknown save to specialists.
Is there any chance that Cardus’s criticism will enjoy a revival? The prospect seems remote. Journalistic tastes had changed long before his death, and postmodern readers have little use for the richly upholstered Vicwardian prose in which he specialized. Moreover, the amateur tradition in music criticism has been in headlong retreat.

(其实是一篇考研英语的阅读文章)

首先要思考的问题便是用何种数据结构存储单词文本以及出现频率,最简单想到的自然是数组,但是用数组存储会有很多问题,用字符数组几乎无法实现!

所以便想到用链表,在C语言中可以用结构体定义出类似C++中的类,用于存储更为复杂的数据类型,比如本题中不仅要存储每一个扫描出来的单词,也要存储每个单词的出现频率,并且存储过程中单词不能重复出现,遇到重复的单词要使该单词对应结点中的词频加一,这种要求使用数组是难以达到的。

那么我们所要实现的功能便很清楚了:

  • 首先读文件,提取单词(但是文本中有不少标点符号自然要去除)。
  • 其次实现链表的构建,将提取到的单词存入链表结点里,再计数。
  • 最后遍历链表每一个节点,把结点中的单词与词频输出。

三个步骤对应三个函数,代码如下:

#include<stdio.h>
#include<string.h>
#include<stdlib.h>
typedef struct Data{
    char *c; // 单词
    int t;  // 词频
}Data;

typedef struct Node{
    Data data;  // 数据域
    struct Node *next; // 指针域
}Node, *pNode;

typedef struct HeadNode{  // 头节点
    int total_num;
    pNode next;
}HeadNode, *pHeadNode;
// 函数声明
void deleteNotA(char str[]);
void insertNoes(pHeadNode head, char str[]);
void showItems(pHeadNode head, FILE *out);

int main(){
    FILE *f_in, *f_out;
    char str[50] = "";
    f_in = fopen("f1.in", "r+");
    f_out = fopen("f1.out", "w+");
    pHeadNode head = (pHeadNode) malloc(sizeof(HeadNode));
    head->total_num = 0;
    head->next = NULL;
    while(fscanf(f_in, "%s", str) != EOF){
        deleteNotA(str);
        //fprintf(f_out, "%s\n", str);
        insertNoes(head, str);
        head->total_num++;
    }

    showItems(head, f_out);
    printf("单词总数:%d", head->total_num);
    fclose(f_in);
    fclose(f_out);
    return 0;
}

void deleteNotA(char str[]){
    // 删除非字符元素
    int length = strlen(str), i, index = 0;
    char *temp = (char *)(malloc(length + 1));
    for(i = 0; i <= length; i++){
        if(str[i] >= 65 && str[i] <= 90 || str[i] >= 97 && str[i] <= 122){
            temp[index++] = str[i];
        }
    }
    temp[index] = '\0';
    strcpy(str, temp);
    return ;
}

void insertNoes(pHeadNode head, char str[]){
    int length = strlen(str);
    //int i;
    static pNode r = NULL; // 尾指针  方便赋值
    pNode p = head->next;
    while(p){ // 检测是否有重复的单词
        if(strcmp(p->data.c, str) == 0){
            p->data.t++;
            return ;
        }
        p = p->next;
    }
    // 创建新节点
    pNode node = (pNode)malloc(sizeof(Node));
    node->data.c = (char*)malloc(sizeof(char) * length);
    node->data.t = 1;
    node->next = NULL;
    strcpy(node->data.c, str);
    if(!r){
        head->next = node;
    }
    else{
        r->next = node;
    }
    r = node;
    return ;
}

void showItems(pHeadNode head, FILE *out){
    pNode p = head->next;
    while(p){
        fprintf(out, "%s:%d\n", p->data.c, p->data.t);
        printf("%s:%d\n", p->data.c, p->data.t);
        p = p->next;
    }
    fprintf(out, "单词总数:%", head->total_num);
}
  • 3
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

wangbowj123

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值