Python中文分词 jieba 十五分钟入门与进阶

大家好，欢迎来到IT知识分享网。

文章目录

整体介绍

jieba 基于Python的中文分词工具,安装使用非常方便,直接pip即可,2/3都可以,功能强悍,博主十分推荐
github:https://github.com/fxsjy/jieba
开源中国地址:http://www.oschina.net/p/jieba/?fromerr=LRXZzk9z
写这篇文章花费两个小时小时,阅读需要十五分钟,读完本篇文章后您将能上手jieba

下篇博文将介绍将任意中文文本生成中文词云

同时如果你希望使用其它分词工具,那么你可以留意我之后的博客,我会在接下来的日子里发布其他有关内容.

三种分词模式与一个参数

以下代码主要来自于jieba的github,你可以在github下载该源码

import jieba seg_list = jieba.cut("我来到北京清华大学", cut_all=True, HMM=False) print("Full Mode: " + "/ ".join(seg_list)) # 全模式 seg_list = jieba.cut("我来到北京清华大学", cut_all=False, HMM=True) print("Default Mode: " + "/ ".join(seg_list)) # 默认模式 seg_list = jieba.cut("他来到了网易杭研大厦", HMM=False) print(", ".join(seg_list)) seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所，后在日本京都大学深造", HMM=False) # 搜索引擎模式 print(", ".join(seg_list)) # jieba.cut的默认参数只有三个,jieba源码如下 # cut(self, sentence, cut_all=False, HMM=True) # 分别为:输入文本 是否为全模式分词 与是否开启HMM进行中文分词

关键词提取

from os import path import jieba.analyse as analyse d = path.dirname(__file__) text_path = 'txt/lz.txt' #设置要分析的文本路径 text = open(path.join(d, text_path)).read() for key in analyse.extract_tags(text,50, withWeight=False): # 使用jieba.analyse.extract_tags()参数提取关键字,默认参数为50 print key.encode('utf-8') # 设置输出编码为utf-8不然在因为win下控制台默认中文字符集为gbk,所以会出现乱码 # 当withWeight=True时,将会返回number类型的一个权重值(TF-IDF)

运行结果如图所示,但是同样的我们也发现了一些问题,比如:
问题一:
分词错误,在运行结果中中”路明非”(龙族男主)被分成了”路明”和”明非”啷个中文词语,这是因为jieba的词库中并不含有该词的原因,同样的原因以及jieba词库比较老,因而在许多文本分词时都会产生这种情况,而这个问题我们将在第五个模块”三种可以让分词更准确的方法”解决
问题二:
出现非实意词语,无论在哪种语言中,都会存在大量的非实意单词,这一类词云我们需要在进行中文分词时进行去除停用词,这个问题将在下一个模块中解决

中文歧义测试与去除停用词

本段代码主要来自于《机器学习实践指南（第二版））》，其作者为麦好，ps：这是一本好书

找停用词点这里：多版本中文停用词词表 + 多版本英文停用词词表 + python词表合并程序

import jieba TestStr = "2010年底部队友谊篮球赛结束" # 因为在汉语中没有空格进行词语的分隔，所以经常会出现中文歧义，比如年底-底部-部队-队友 # jieba 默认启用了HMM（隐马尔科夫模型）进行中文分词，实际效果不错 seg_list = jieba.cut(TestStr, cut_all=True) print "Full Mode:", "/ ".join(seg_list) # 全模式 seg_list = jieba.cut(TestStr, cut_all=False) print "Default Mode:", "/ ".join(seg_list) # 默认模式 # 在默认模式下有对中文歧义有较好的分类方式 seg_list = jieba.cut_for_search(TestStr) # 搜索引擎模式 print "cut for Search","/".join(seg_list)

去除文本中的停用词

# - * - coding: utf - 8 -*- # # 作者：田丰(FontTian) # 创建时间:'2017/5/27' # 邮箱： # CSDN：http://blog.csdn.net/fontthrone import sys import jieba from os import path d = path.dirname(__file__) stopwords_path = 'stopwords\stopwords1893.txt' # 停用词词表 text_path = 'txt/lz.txt' #设置要分析的文本路径 text = open(path.join(d, text_path)).read() def jiebaclearText(text): mywordlist = [] seg_list = jieba.cut(text, cut_all=False) liststr="/ ".join(seg_list) f_stop = open(stopwords_path) try: f_stop_text = f_stop.read( ) f_stop_text=unicode(f_stop_text,'utf-8') finally: f_stop.close( ) f_stop_seg_list=f_stop_text.split('\n') for myword in liststr.split('/'): if not(myword.strip() in f_stop_seg_list) and len(myword.strip())>1: mywordlist.append(myword) return ''.join(mywordlist) text1 = jiebaclearText(text) print text1

三种可以让分词更准确的方法

#这个只需要在源代码中加入一个语句即可 import sys import jieba from os import path d = path.dirname(__file__) stopwords_path = 'stopwords\stopwords1893.txt' # 停用词词表 jieba.add_word('路明非') # 添加的自定义中文语句的代码在这里 # 添加的自定义中文语句的代码在这里 # 添加的自定义中文语句的代码在这里 text_path = 'txt/lz.txt' #设置要分析的文本路径 text = open(path.join(d, text_path)).read() def jiebaclearText(text): mywordlist = [] seg_list = jieba.cut(text, cut_all=False) liststr="/ ".join(seg_list) f_stop = open(stopwords_path) try: f_stop_text = f_stop.read( ) f_stop_text=unicode(f_stop_text,'utf-8') finally: f_stop.close( ) f_stop_seg_list=f_stop_text.split('\n') for myword in liststr.split('/'): if not(myword.strip() in f_stop_seg_list) and len(myword.strip())>1: mywordlist.append(myword) return ''.join(mywordlist) text1 = jiebaclearText(text) print text1

运行效果如下:

#encoding=utf-8 from __future__ import print_function, unicode_literals import sys sys.path.append("../") import jieba jieba.load_userdict("userdict.txt") # jieba采用延迟加载，"import jieba"不会立即触发词典的加载，一旦有必要才开始加载词典构建trie。如果你想手工初始jieba，也可以手动初始化。示例如下: # import jieba # jieba.initialize() #手动初始化（可选） # 在0.28之前的版本是不能指定主词典的路径的，有了延迟加载机制后，你可以改变主词典的路径: # 注意用户词典为主词典即优先考虑的词典,原词典此时变为非主词典 # jieba.set_dictionary('data/dict.txt.big') import jieba.posseg as pseg test_sent = ( "李小福是创新办主任也是云计算方面的专家; 什么是八一双鹿\n" "例如我输入一个带“韩玉赏鉴”的标题，在自定义词库中也增加了此词为N类\n" "「台中」正確應該不會被切開。mac上可分出「石墨烯」；此時又可以分出來凱特琳了。" ) words = jieba.cut(test_sent) print('/'.join(words)) print("="*40) result = pseg.cut(test_sent) # pseg.cut 切分,并显示词性 # 下面是userdict.txt的内容,如果不加入这个词库,那么在运行结果中,云计算,创新办等词都将无法识别 ''' 云计算 5 李小福 2 nr 创新办 3 i easy_install 3 eng 好用 300 韩玉赏鉴 3 nz 八一双鹿 3 nz 台中 凱特琳 nz Edu Trust认证 2000 '''

下面这段代码主要来自于jieba的github,你可以在github下载该源码

print('='*40) print('添加自定义词典/调整词典') print('-'*40) print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False))) #如果/放到/post/中将/出错/。 # 调整词典使 中将 变为中/将 print(jieba.suggest_freq(('中', '将'), True)) #494 print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False))) #如果/放到/post/中/将/出错/。 print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False))) #「/台/中/」/正确/应该/不会/被/切开 print(jieba.suggest_freq('台中', True)) print(jieba.suggest_freq('台中', True)) #69 # 调整词典使 台中 不被分词为台/中 print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False))) #「/台中/」/正确/应该/不会/被/切开

并行计算

下面这段代码主要来自于jieba的github,你可以在github下载该源码

原理：将目标文本按行分隔后，把各行文本分配到多个python进程并行分词，然后归并结果，从而获得分词速度的可观提升

基于python自带的multiprocessing模块，目前暂不支持windows

import sys import time sys.path.append("../../") import jieba jieba.enable_parallel() # 关闭并行分词 jieba.enable_parallel(4) # 开启并行分词模式，参数为并行进程数  url = sys.argv[1] content = open(url,"rb").read() t1 = time.time() words = "/ ".join(jieba.cut(content)) t2 = time.time() tm_cost = t2-t1 log_f = open("1.log","wb") log_f.write(words.encode('utf-8')) print('speed %s bytes/second' % (len(content)/tm_cost))

实验结果：在4核3.4GHz Linux机器上，对金庸全集进行精确分词，获得了1MB/s的速度，是单进程版的3.3倍。

免责声明：本站所有文章内容,图片，视频等均是来源于用户投稿和互联网及文摘转载整编而成，不代表本站观点，不承担相关法律责任。其著作权各归其原作者或其出版社所有。如发现本站有涉嫌抄袭侵权/违法违规的内容,侵犯到您的权益，请在线联系站长,一经查实,本站将立刻删除。本文来自网络,若有侵权，请联系删除，如若转载，请注明出处：https://haidsoft.com/149245.html