NLP 文本处理工具

1.中文语料常常遇到编码问题，将任意字符集文件转为utf-8编码

2.将unlabel文件夹中的所有.txt文件合并，每个文件之间空一行

3.随机抽取.txt文件中的60%，20%，5%

4.将已经分好词的文件去掉空格（正则），恢复成文件原来的样子

5.读取excel文件转换成.json文件

回到顶部https://blog.csdn.net/qq_35751770/article/details/103664496

5.读取excel文件转换成.json文件

 1 #coding=utf-8
 2 import xlrd        #对excel文件内容读取
 3 import xlwt        #对excel文件内容写入 
 4 import json
 5 """
 6 打开excel文件 处理成json文件 {text：，label：}
 7 data.xls变成train.json、val.json、test.json
 8 """
 9 
10 def deal_data(filename,outpath):              #filename为xlsx文件路径 outputfile为json文件路径
11     wb = xlrd.open_workbook(filename)         #打开excel文件读取数据 
12     data_file=["train","test","val"]
13 
14     for excel_name in data_file:
15         output_file = outpath + excel_name+".json"              #命名处理之后的json文件名 
16         output = open(output_file, "w", encoding="utf-8")       #写入 
17 
18         excel = wb.sheet_by_name(excel_name)    #根据sheet名称获取sheet内容
19         rows_n = excel.nrows                    #同时获取sheet总行数
20         for i in range(rows_n):                                 #分别获取每行的第0、1、2列 
21             data_dic = {}
22             data_dic["filepath"] = excel.cell_value(i , 0)            
23             data_dic["text"] = excel.cell_value(i , 1).strip()
24             data_dic["label"] = tuple(excel.cell_value(i , 2).split())
25 
26             output.write(json.dumps(data_dic) + "\n")           #写入json文件 
27         output.close()
28 
29 deal_data("data01.xls","corpus/class/origin_corpus/")

自然语言处理 npl

NLP 文本处理工具

5.读取excel文件转换成.json文件

相关

自然语言处理之文本分类

自然语言处理简述

自然语言处理之序列标注问题

Swagger 报错 Failed to start bean 'documentationPluginsBootstrapper'

记录 springboot 整合swagger2 出现documentationPluginsBootstrapper&&NullPointerEx

中文自然语言处理(NLP)(三)运用python jieba模块计算知识点当中关键词的词频

（转）聊聊Greenplum的那些事

【PyTorch Learning】Reduce the learning rate: Class torch.optim.lr_scheduler.Red

Python自然语言处理

基于图深度学习的自然语言处理方法和应用

通过tokenPlease()函数获取accessToken

自然语言处理 ( Natural Language Processing, NLP)

标签

NLP 文本处理 工具

5.读取excel文件转换成.json文件

相关

NLP 文本处理工具