TF-IDF计算比较compare（gensim、jieba、sklearn、手工的异同）-python黑洞网

本站消息

站长简介/公众号

出租广告位,需要合作请联系站长

外星人入侵

1290

文章

1339660

访问

+关注

分类

python面试题(22)

字典(0)

日期归档

2024-11(1)

TF-IDF计算比较compare（gensim、jieba、sklearn、手工的异同）

发布于2019-08-20 10:45 阅读(1060) 评论(0) 点赞(6) 收藏(4)

一.概述

TF-IDF（英文名: term frequency-inverse document frequency），引用百度百科的说法: TF-IDF是一种用于信息检索与数据挖掘的常用加权技术。

TF意思是词频(Term Frequency)，用在句子构成的语料中，就是字或者词在文本中出现的频率。

一般计算是: TF = 字或词在句子中出现的次数 / 字或词在所有语料中出现的次数

IDF意思是逆文本频率指数(Inverse Document Frequency)，就是出现该字或者词的句子条数。

一般计算是: IDF = Log ( 语料中句子总数 / (包含该词或字的句子数+1) )

TF-IDF = TF * IDF

这是前文介绍TF-IDF时候的说法，正巧面试也手撸了一次这个算法，真实环境是不是这样呢? 我们一探究竟。

github地址:

https://github.com/yongzhuo/Tookit-Sihui/tree/master/tookit_sample/tf_idf_compare

本文主要介绍4中方案实现tf-idf，它们各有优点：


1.gensim
2.jieba
3.sklearn
4.by_hand(手动)

二.优缺点(推荐 sklearn)

a. gensim： corpora生成token，doc2bow生成词袋模型，tfidf_model计算tf-idf，idfs可给出。

未出现的词语idfs等不计算在内，中规中矩的一个模型,可输入list或者文件地址等。

b. jieba：有idf.txt，即计算好的idf，未出现的词语使用平均idf。

c. sklearn: CountVectorizer统计词频，TfidfTransformer计算tfidf，csr_matrix数据格式压缩，

可选择n-gram特征，可平滑处理，可选features，选择很多，还是推荐这个吧。

d. by_hand: 手工版可配置, 批量计算词频字典与合并，可在小内存下计算大样本，比如说wikicorpus。

三.实现与最后代码说明

2.1 gensim


# -*- coding: UTF-8 -*-
# !/usr/bin/python
# @time     :2019/7/31 21:20
# @author   :Mo
# @function :
 
from gensim import corpora, models
import jieba
 
 
 
def tfidf_from_questions(corpora_documents):
    """
        从文件读取并计算tf-idf
    :param sources_path: 
    :return: 
    """
    dictionary = corpora.Dictionary(corpora_documents)
    corpus = [dictionary.doc2bow(text) for text in corpora_documents]
    tfidf_model = models.TfidfModel(corpus)
    return dictionary, tfidf_model
 
 
def tfidf_from_corpora(sources_path):
    """
        从文件读取并计算tf-idf
    :param sources_path: 
    :return: 
    """
    from tookit_sihui.utils.file_utils import txt_read, txt_write
    questions = txt_read(sources_path)
    corpora_documents = []
    for item_text in questions:
        item_seg = list(jieba.cut(str(item_text).strip()))
        corpora_documents.append(item_seg)
 
    dictionary = corpora.Dictionary(corpora_documents)
    corpus = [dictionary.doc2bow(text) for text in corpora_documents]
    tfidf_model = models.TfidfModel(corpus)
    return dictionary, tfidf_model
 
 
if __name__ == '__main__':
    # test 1 from questions
    corpora_documents = [['大漠', '帝国'],['紫色', 'Angle'],['花落', '惊', '飞羽'],
                         ['我', 'm', 'o'], ['你', 'the', 'a', 'it', 'this']]
    dictionary, tfidf_model = tfidf_from_questions(corpora_documents)
    sentence = '大漠 大漠 大漠'
    seg = list(jieba.cut(sentence))
    bow = dictionary.doc2bow(seg)
    tfidf_vec = tfidf_model[bow]
    print(bow)
    print(tfidf_vec)
    bow = dictionary.doc2bow(['i', 'i', '大漠', '大漠', '大漠'])
    tfidf_vec = tfidf_model[bow]
    print(bow)
    print(tfidf_vec)
 
    # test 2 from file of text
    from tookit_sihui.conf.path_config import path_tf_idf_corpus
    dictionary, tfidf_model = tfidf_from_corpora(path_tf_idf_corpus)
    sentence = '大漠帝国'
    seg = list(jieba.cut(sentence))
    bow = dictionary.doc2bow(seg)
    tfidf_vec = tfidf_model[bow]
    print(bow)
    print(tfidf_vec)
    bow = dictionary.doc2bow(['sihui'])
    tfidf_vec = tfidf_model[bow]
    print(bow)
    print(tfidf_vec)
    gg = 0
    # 结果
    # [(12, 1)]
    # [(12, 1.0)]
    # []
    # []
    # [(172, 1), (173, 1)]
    # [(172, 0.7071067811865475), (173, 0.7071067811865475)]
    # []
    # []
 
 
 
# # 说明:
# 1.左边的是字典id,右边是词的tfidf,
# 2.中文版停用词(如the)、单个字母(如i)等，不会去掉
# 3.去除没有被训练到的词,如'sihui',没有出现就不会计算
# 4.计算细节
#   4.1 idf = add + log_{log\_base} \frac{totaldocs}{docfreq}, 如下:
    # eps = 1e-12, idf只取大于eps的数字
    def df2idf(docfreq, totaldocs, log_base=2.0, add=0.0):
        import numpy as np
        # np.log()什么都不写就以e为低, 由公式log(a)(b)=log(c)(b)/log(c)(a),
        # 可得函数中为log(2)(totaldocs / docfreq)
        # debug进去可以发现, 没有进行平滑处理, 即log(2)(文本数 / 词出现在文本中的个数),
        # 这也很好理解, 因为如果输入为[],则不会给出模型,出现的文本中的至少出现一次,也没有必要加1了
        return add + np.log(float(totaldocs) / docfreq) / np.log(log_base)
        # 注意self.initialize(corpus)函数
#   4.2 tf 从下面以及debug结果可以发现, gensim的tf取值是词频,
#          也就是说出现几次就取几次,如句子'大漠 大漠 大漠', '大漠'的tf就取3
#         termid_array, tf_array = [], []
#         for termid, tf in bow:
#             termid_array.append(termid)
#             tf_array.append(tf)
#
#         tf_array = self.wlocal(np.array(tf_array))
#
#         vector = [
#             (termid, tf * self.idfs.get(termid))
#             for termid, tf in zip(termid_array, tf_array)
#             if abs(self.idfs.get(termid, 0.0)) > self.eps
#         ]

2.2 jieba


# -*- coding: UTF-8 -*-
# !/usr/bin/python
# @time     :2019/7/31 21:21
# @author   :Mo
# @function :
 
 
import jieba.analyse
import jieba
 
sentence = '大漠 帝国 和 紫色 Angle'
seg = jieba.cut(sentence)
print(seg)
tf_idf = jieba.analyse.extract_tags(sentence, withWeight=True)
print(tf_idf)
 
# 结果
# [('Angle', 2.988691875725), ('大漠', 2.36158258893), ('紫色', 2.10190405216), ('帝国', 1.605909794915)]
 
 
# 说明,
# 1.1 idf  jieba中的idf来自默认文件idf.txt,
#          idf默认一段话来作为一个docunment,
#          没出现过的词语的idf默认为所有idf的平均值,即为11.多
#
# 1.2 tf   tf只统计当前句子出现的频率除以所有词语数,
#          例如'大漠 帝国 和 紫色 Angle'这句话, '大漠'的tf为1/5
#          tfidf的停用词"和"去掉了
#     tf计算代码
#         freq[w] = freq.get(w, 0.0) + 1.0
#         total = sum(freq.values())
#         for k in freq:
#             kw = k.word if allowPOS and withFlag else k
#             freq[k] *= self.idf_freq.get(kw, self.median_idf) / total

2.3 sklearn


# -*- coding: UTF-8 -*-
# !/usr/bin/python
# @time     :2019/7/31 21:21
# @author   :Mo
# @function :
 
 
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
 
 
def tfidf_from_ngram(questions):
    """
        使用TfidfVectorizer计算n-gram
    :param questions:list, like ['孩子气', '大漠帝国'] 
    :return: 
    """
    from sklearn.feature_extraction.text import TfidfVectorizer
    import jieba
    def jieba_cut(x):
        x = list(jieba.cut(x))
        return ' '.join(x)
    questions = [jieba_cut(''.join(ques)) for ques in questions]
    tfidf_model = TfidfVectorizer(ngram_range=(1, 2), # n-gram特征, 默认(1,1)
                                  max_features=10000,
                                  token_pattern=r"(?u)\b\w+\b", # 过滤停用词
                                  min_df=1,
                                  max_df=0.9,
                                  use_idf=1,
                                  smooth_idf=1,
                                  sublinear_tf=1)
    tfidf_model.fit(questions)
    print(tfidf_model.transform(['紫色 ANGEL 是 虾米 回事']))
    return tfidf_model
 
 
if __name__ == "__main__":
    # test 1
    corpora_documents = [['大漠', '帝国'], ['紫色', 'Angle'], ['花落', '惊', '飞羽'],
                         ['我', 'm', 'o'], ['你', 'the', 'a', 'it', 'this'], ['大漠', '大漠']]
    corpora_documents = [''.join(ques) for ques in corpora_documents]
    # 统计词频
    vectorizer = CountVectorizer()
    # 初始化,fit和transformer   tf-idf
    transformer = TfidfTransformer()
    # 第一个fit_transform是计算tf-idf, 第二个是将文本转为词频矩阵
    tfidf = transformer.fit_transform(vectorizer.fit_transform(corpora_documents))
    print(tfidf)
    # 模型所有词语
    word = vectorizer.get_feature_names()
    print(word)
    weight = tfidf.toarray()
    print(weight)
    
    
    # test 2 from file of text
    tf_idf_model = tfidf_from_ngram(corpora_documents)
    print(tf_idf_model.transform(['你 谁 呀, 小老弟']))
 
 
    #  sklearn的tfidf模型,可以采用TfidfVectorizer,提取n-gram特征,直接用于特征计算
    #  和gensim一样, 都有TfidfVectorizer, 继承的是CountVectorizer
    #             df += int(self.smooth_idf)        # 平滑处理
    #             n_samples += int(self.smooth_idf) # 平滑处理
    #             idf = np.log(n_samples / df) + 1  # 加了个1

2.4 byhand


# -*- coding: UTF-8 -*-
# !/usr/bin/python
# @time     :2019/6/19 21:32
# @author   :Mo
# @function :tf-idf
 
 
from tookit_sihui.utils.file_utils import save_json
from tookit_sihui.utils.file_utils import load_json
from tookit_sihui.utils.file_utils import txt_write
from tookit_sihui.utils.file_utils import txt_read
import jieba
import json
import math
import os
 
 
from tookit_sihui.conf.logger_config import get_logger_root
logger = get_logger_root()
 
 
def count_tf(questions):
    """
      统计字频,或者词频tf
    :param questions: list, 输入语料, 字级别的例子:[['我', '爱', '你'], ['爱', '护', '士']]
    :return: dict, 返回字频,或者词频, 形式:{'我':1, '爱':2} 
    """
    tf_char = {}
    for question in questions:
        for char in question:
            if char.strip():
                if char not in tf_char:
                    tf_char[str(char).encode('utf-8', 'ignore').decode('utf-8')] = 1
                else:
                    tf_char[str(char).encode('utf-8', 'ignore').decode('utf-8')] = tf_char[char] + 1
    tf_char['[LENS]'] = sum([v for k,v in tf_char.items()])
    return tf_char
 
 
def count_idf(questions):
    """
      统计逆文档频率idf
    :param questions: list, 输入语料, 字级别的例子:[['我', '爱', '你'], ['爱', '护', '士']]
    :return: dict, 返回逆文档频率, 形式:{'我':1, '爱':2}
    """
    idf_char = {}
    for question in questions:
        question_set = set(question) # 在句子中，重复的只计数一次
        for char in question_set:
            if char.strip(): # ''不统计
                if char not in idf_char: # 第一次计数为1
                    idf_char[char] = 1
                else:
                    idf_char[char] = idf_char[char] + 1
    idf_char['[LENS]'] = len(questions) # 保存一个所有的句子长度
    return idf_char
 
 
def count_tf_idf(freq_char, freq_document, ndigits=12, smooth =0):
    """
        统计tf-idf
    :param freq_char: dict, tf
    :param freq_document: dict, idf
    :return: dict, tf-idf
    """
 
    len_tf = freq_char['[LENS]']
    len_tf_mid = int(len(freq_char)/2)
    len_idf = freq_document['[LENS]']
    len_idf_mid = int(len(freq_document) / 2)
    # tf
    tf_char = {}
    for k2, v2 in freq_char.items():
        tf_char[k2] = round((v2 + smooth)/(len_tf + smooth), ndigits)
    # idf
    idf_char = {}
    for ki, vi in freq_document.items():
        idf_char[ki] = round(math.log((len_idf + smooth) / (vi + smooth), 2), ndigits)
    # tf-idf
    tf_idf_char = {}
    for kti, vti in freq_char.items():
        tf_idf_char[kti] = round(tf_char[kti] * idf_char[kti], ndigits)
 
    # 删去文档数统计
    tf_char.pop('[LENS]')
    idf_char.pop('[LENS]')
    tf_idf_char.pop('[LENS]')
 
    # 计算平均/最大/中位数
    tf_char_values = tf_char.values()
    idf_char_values = idf_char.values()
    tf_idf_char_values = tf_idf_char.values()
 
    tf_char['[AVG]'] = round(sum(tf_char_values) / len_tf, ndigits)
    idf_char['[AVG]'] = round(sum(idf_char_values) / len_idf, ndigits)
    tf_idf_char['[AVG]'] = round(sum(tf_idf_char_values) / len_idf, ndigits)
    tf_char['[MAX]'] = max(tf_char_values)
    idf_char['[MAX]'] = max(idf_char_values)
    tf_idf_char['[MAX]'] = max(tf_idf_char_values)
    tf_char['[MIN]'] = min(tf_char_values)
    idf_char['[MIN]'] = min(idf_char_values)
    tf_idf_char['[MIN]'] = min(tf_idf_char_values)
    tf_char['[MID]'] = sorted(tf_char_values)[len_tf_mid]
    idf_char['[MID]'] = sorted(idf_char_values)[len_idf_mid]
    tf_idf_char['[MID]'] = sorted(tf_idf_char_values)[len_idf_mid]
 
    return tf_char, idf_char, tf_idf_char
 
 
def save_tf_idf_dict(path_dir, tf_char, idf_char, tf_idf_char):
    """
        排序和保存
    :param path_dir:str, 保存文件目录 
    :param tf_char: dict, tf
    :param idf_char: dict, idf
    :param tf_idf_char: dict, tf-idf
    :return: None
    """
    if not os.path.exists(path_dir):
        os.mkdir(path_dir)
    # store and save
    tf_char_sorted = sorted(tf_char.items(), key=lambda d: d[1], reverse=True)
    tf_char_sorted = [tf[0] + '\t' + str(tf[1]) + '\n' for tf in tf_char_sorted]
    txt_write(tf_char_sorted, path_dir + 'tf.txt')
 
    idf_char_sorted = sorted(idf_char.items(), key=lambda d: d[1], reverse=True)
    idf_char_sorted = [idf[0] + '\t' + str(idf[1]) + '\n' for idf in idf_char_sorted]
    txt_write(idf_char_sorted, path_dir + 'idf.txt')
 
    tf_idf_char_sorted = sorted(tf_idf_char.items(), key=lambda d: d[1], reverse=True)
    tf_idf_char_sorted = [tf_idf[0] + '\t' + str(tf_idf[1]) + '\n' for tf_idf in tf_idf_char_sorted]
    txt_write(tf_idf_char_sorted, path_dir + 'tf_idf.txt')
 
 
def save_tf_idf_json(path_dir, tf_freq, idf_freq, tf_char, idf_char, tf_idf_char):
    """
        json排序和保存
    :param path_dir:str, 保存文件目录 
    :param tf_char: dict, tf
    :param idf_char: dict, idf
    :param tf_idf_char: dict, tf-idf
    :return: None
    """
    if not os.path.exists(path_dir):
        os.mkdir(path_dir)
    # freq
    save_json([tf_freq], path_dir + '/tf_freq.json')
    save_json([idf_freq], path_dir + '/idf_freq.json')
    # json_tf = json.dumps([tf_char])
    save_json([tf_char], path_dir + '/tf.json')
    # json_idf = json.dumps([idf_char])
    save_json([idf_char], path_dir + '/idf.json')
    # json_tf_idf = json.dumps([tf_idf_char])
    save_json([tf_idf_char], path_dir + '/tf_idf.json')
 
 
def load_tf_idf_json(path_tf_freq=None, path_idf_freq=None, path_tf=None, path_idf=None, path_tf_idf=None):
    """
        从json文件下载tf, idf, tf_idf
    :param path_tf: 
    :param path_idf: 
    :param path_tf_idf: 
    :return: 
    """
    json_tf_freq = load_json(path_tf_freq)
    json_idf_freq = load_json(path_idf_freq)
    json_tf = load_json(path_tf)
    json_idf = load_json(path_idf)
    json_tf_idf = load_json(path_tf_idf)
    return json_tf_freq[0], json_idf_freq[0], json_tf[0], json_idf[0], json_tf_idf[0]
 
 
def dict_add(dict1, dict2):
    """
      两个字典合并
    :param dict1: 
    :param dict2: 
    :return: 
    """
    for i,j in dict2.items():
        if i in dict1.keys():
            dict1[i] += j
        else:
            dict1.update({f'{i}' : dict2[i]})
    return dict1
 
 
class TFIDF:
    def __init__(self, questions=None, path_tf=None, 
                 path_idf=None, path_tf_idf=None, 
                 path_tf_freq=None, path_idf_freq=None,
                 ndigits=12, smooth=0):
        """
            统计字频,或者词频tf
        :param questions: list, 输入语料, 字级别的例子:[['我', '爱', '你'], ['爱', '护', '士']]
        """
        self.esplion = 1e-16
        self.questions = questions
        self.path_tf_freq = path_tf_freq
        self.path_idf_freq = path_idf_freq
        self.path_tf=path_tf
        self.path_idf=path_idf
        self.path_tf_idf=path_tf_idf
        self.ndigits=ndigits
        self.smooth=smooth
        self.create_tfidf()
 
    def create_tfidf(self):
        if self.questions != None: # 输入questions list, 即corpus语料
            self.tf_freq = count_tf(self.questions)
            self.idf_freq = count_idf(self.questions)
            self.tf, self.idf, self.tfidf = count_tf_idf(self.tf_freq, 
                                                         self.idf_freq, 
                                                         ndigits=self.ndigits, 
                                                         smooth =self.smooth)
        else: # 输入训练好的
            self.tf_freq, self.idf_freq, \
            self.tf, self.idf, self.tfidf = load_tf_idf_json(path_tf_freq = self.path_tf_freq,
                                                             path_idf_freq = self.path_idf_freq,
                                                             path_tf=self.path_tf,
                                                             path_idf=self.path_idf,
                                                             path_tf_idf=self.path_tf_idf)
        self.chars = [idf for idf in self.idf.keys()]
 
    def extract_tfidf_of_sentence(self, ques):
        """
            获取tf-idf
        :param ques: str
        :return: float
        """
        assert type(ques)==str
        if not ques.strip():
            return None
        ques_list = list(jieba.cut(ques.replace(' ', '').strip()))
        logger.info(ques_list)
        score = 0.0
        score_list = {}
        for char in ques_list:
            if char in self.chars:
                score = score + self.tfidf[char]
                score_list[char] = self.tfidf[char]
            else: #
                score = score + self.esplion
                score_list[char] = self.esplion
        score = score/len(ques_list)# 求平均避免句子长度不一的影响
        logger.info(score_list)
        logger.info({ques:score})
        return score
 
    def extract_tf_of_sentence(self, ques):
        """
            获取idf
        :param ques: str
        :return: float
        """
        assert type(ques)==str
        if not ques.strip():
            return None
        ques_list = list(jieba.cut(ques.replace(' ', '').strip()))
        logger.info(ques_list)
        score = 0.0
        score_list = {}
        for char in ques_list:
            if char in self.chars:
                score = score + self.tf[char]
                score_list[char] = self.tf[char]
            else: #
                score = score + self.esplion
                score_list[char] = self.esplion
        score = score/len(ques_list)# 求平均避免句子长度不一的影响
        logger.info(score_list)
        logger.info({ques:score})
        return score
 
    def extract_idf_of_sentence(self, ques):
        """
           获取idf
        :param ques: str
        :return: float
        """
        assert type(ques)==str
        if not ques.strip():
            return None
        ques_list = list(jieba.cut(ques.replace(' ', '').strip()))
        logger.info(ques_list)
        score = 0.0
        score_list = {}
        for char in ques_list:
            if char in self.chars:
                score = score + self.idf[char]
                score_list[char] = self.idf[char]
            else: #
                score = score + self.esplion
                score_list[char] = self.esplion
        score = score/len(ques_list) # 求平均避免句子长度不一的影响
        logger.info(score_list)
        logger.info({ques:score})
        return score
 
 
def create_TFIDF(path):
    # 测试1,根据corpus生成
    import time
    time_start = time.time()
    # 首先输入全部文本构建tf-idf,然后再拿去用
    from tookit_sihui.conf.path_config import path_tf_idf_corpus
    from tookit_sihui.utils.file_utils import txt_write, txt_read
 
    path_wiki = path if path else path_tf_idf_corpus
    #  测试1, tf-idf, 调用
    path_dir = 'tf_idf_freq/'
    # ques = ['大漠帝国最强', '花落惊飞羽最漂亮', '紫色Angle最有气质', '孩子气最活泼', '口袋巧克力和过路蜻蜓最好最可爱啦', '历历在目最烦恼']
    # questions = [list(q.strip()) for q in ques]
    # questions = [list(jieba.cut(que)) for que in ques]
    questions = txt_read(path_wiki)
    len_questions = len(questions)
    batch_size = 1000000
    size_trade = len_questions // batch_size
    print(size_trade)
    size_end = size_trade * batch_size
    # 计算tf-freq, idf-freq
    ques_tf_all, ques_idf_all = {}, {}
    for i, (start, end) in enumerate(zip(range(0, size_end, batch_size),
                        range(batch_size, size_end, batch_size))):
        print("第{}次".format(i))
        question = questions[start: end]
        questionss = [ques.strip().split(' ') for ques in question]
        ques_idf = count_idf(questionss)
        ques_tf = count_tf(questionss)
        print('tf_idf_{}: '.format(i) + str(time.time() - time_start))
        # 字典合并 values相加
        ques_tf_all = dict_add(ques_tf_all, ques_tf)
        ques_idf_all = dict_add(ques_idf_all, ques_idf)
        print('dict_add_{}: '.format(i) + str(time.time() - time_start))
        print('的tf:{}'.format(ques_tf_all['的']))
        print('的idf:{}'.format(ques_idf_all['的']))
    # 不足batch-size部分
    if len_questions - size_end >0:
        print("第{}次".format('last'))
        question = questions[size_end: len_questions]
        questionss = [ques.strip().split(' ') for ques in question]
        ques_tf = count_idf(questionss)
        ques_idf = count_tf(questionss)
        # tf_char, idf_char, tf_idf_char = count_tf_idf(ques_tf, ques_idf)
        ques_tf_all = dict_add(ques_tf_all, ques_tf)
        ques_idf_all = dict_add(ques_idf_all, ques_idf)
        print('{}: '.format('last') + str(time.time() - time_start))
        print('的tf:{}'.format(ques_tf_all['的']))
        print('的idf:{}'.format(ques_idf_all['的']))
    # 计算tf-idf
    tf_char, idf_char, tf_idf_char = count_tf_idf(ques_tf_all, ques_idf_all)
    print(len(tf_char))
    print('tf-idf ' + str(time.time()-time_start))
    print('tf-idf ok!')
    # 保存, tf,idf,tf-idf
    save_tf_idf_json(path_dir, ques_tf_all, ques_idf_all, tf_char, idf_char, tf_idf_char)
    gg=0
 
 
if __name__=="__main__":
    # 测试1
    path = None # 语料地址, 格式为切分后的句子, 例如'孩子 气 和 紫色 angle'
    create_TFIDF(path)
 
    # # 测试2, 调用class、json, input预测
    # path_dir = 'tf_idf_freq/'
    # path_tf = path_dir + '/tf.json'
    # path_idf = path_dir + '/idf.json'
    # path_tf_idf = path_dir + '/tf_idf.json'
    #
    # tfidf = TFIDF(path_tf=path_tf, path_idf=path_idf, path_tf_idf=path_tf_idf)
    # score1 = tfidf.extract_tf_of_sentence('大漠帝国')
    # score2 = tfidf.extract_idf_of_sentence('大漠帝国')
    # score3 = tfidf.extract_tfidf_of_sentence('大漠帝国')
    # print('tf: ' + str(score1))
    # print('idf: ' + str(score2))
    # print('tfidf: ' + str(score3))
    # while True:
    #     print("请输入: ")
    #     ques = input()
    #     tfidf_score = tfidf.extract_tfidf_of_sentence(ques)
    #     print('tfidf:' + str(tfidf_score))