+关注
已关注

分类  

暂无分类

标签  

暂无标签

日期归档  

2019-04(1)

2019-06(2)

2019-07(2)

2019-08(87)

2019-09(90)

自己手写TF-IDF算法

发布于2020-08-01 16:23     阅读(519)     评论(0)     点赞(1)     收藏(4)


python 实现tf-idf 

  1. #!/usr/bin/env python
  2. # -*- coding: utf-8 -*-
  3. # @Time : 2020/7/30 20:38
  4. # @Author : fuGuoWen
  5. # @Site :
  6. # @File : test05.py
  7. # @Software: PyCharm
  8. import math
  9. from collections import Counter
  10. corpus = [
  11. 'hello world hello ',
  12. 'hello go',
  13. ]
  14. word_list = []
  15. for i in range(len(corpus)):
  16. word_list.append(corpus[i].split(' '))
  17. print(word_list)
  18. countlist = []
  19. for i in range(len(word_list)):
  20. count = Counter(word_list[i])
  21. countlist.append(count)
  22. # word可以通过count得到,count可以通过countlist得到
  23. # count[word]可以得到每个单词的词频, sum(count.values())得到整个句子的单词总数
  24. def tf(word, count):
  25. return count[word] / sum(count.values())
  26. # 统计的是含有该单词的句子数
  27. def n_containing(word, count_list):
  28. return sum(1 for count in count_list if word in count)
  29. # len(count_list)是指句子的总数,n_containing(word, count_list)是指含有该单词的句子的总数,加1是为了防止分母为0
  30. def idf(word, count_list):
  31. # return math.log(len(count_list) / (1 + n_containing(word, count_list)))
  32. return math.log(len(count_list) / ( n_containing(word, count_list)))
  33. # 将tf和idf相乘
  34. def tfidf(word, count, count_list):
  35. return tf(word, count) * idf(word, count_list)
  36. # for i, count in enumerate(countlist):
  37. # print("Top words in document {}".format(i + 1))
  38. # scores = {word: tfidf(word, count, countlist) for word in count}
  39. # sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
  40. # for word, score in sorted_words[:]:
  41. # # print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))
  42. # print("\tWord: {}, TF-IDF: {}".format(word, score))
  43. # def simall(self, doc):
  44. # """
  45. # 找出训练数据中所有相似的句子概率
  46. # :param doc: 一句话的分词list
  47. # :return:
  48. # """
  49. # scores = []
  50. # for index in range(self.D):
  51. # score = self.sim(doc, index)
  52. # scores.append(score)
  53. # return scores
  54. # for i, count in enumerate(countlist):
  55. # print("Top words in document {}".format(i + 1))
  56. # scores = {word: tfidf(word, count, countlist) for word in count}
  57. # sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
  58. # for word, score in sorted_words[:]:
  59. # # print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))
  60. # print("\tWord: {}, TF-IDF: {}".format(word, score))
  61. # 相似度计算
  62. def simall(countlist,word):
  63. scores = []
  64. for i,count in enumerate(countlist):
  65. print(tfidf(word, count, countlist))
  66. scores.append(tfidf(word, count, countlist))
  67. return scores
  68. if __name__ == '__main__':
  69. scores=simall(countlist,"hello")
  70. print(scores)

 

原文链接:https://blog.csdn.net/u011243684/article/details/107720917



所属网站分类: 技术文章 > 博客

作者:智慧星辰

链接: https://www.pythonheidong.com/blog/article/468419/

来源: python黑洞网

任何形式的转载都请注明出处,如有侵权 一经发现 必将追究其法律责任

1 0
收藏该文
已收藏

评论内容:(最多支持255个字符)