程序员最近都爱上了这个网站  程序员们快来瞅瞅吧!  it98k网:it98k.com

本站消息

站长简介/公众号

  出租广告位,需要合作请联系站长

+关注
已关注

分类  

暂无分类

标签  

暂无标签

日期归档  

bert中文短文本句向量生成、相似度计算(GPU版、windows、win10、linux、django和flask可用)

发布于2019-08-07 14:40     阅读(3194)     评论(0)     点赞(3)     收藏(2)


        BERT句向量GPU线上调用等。出现Floating point exception and SystemError: error return without exception set 。

        最近上线需要用到bert,走过了很多坑,有的甚至是不知道怎么回事,而且也很容易从解决一个问题,跳到另外一个问题,巨坑呀有木有。https://github.com/hanxiao/bert-as-service这种做成服务的,其实还是挺好的,但对做成服务的,完全无感呀。

        又比如这种https://github.com/terrifyzhao/bert-utils,生成的句向量和相似度计算可调用的。但是不知道是不是yield、队列queue或者gpu、cuda、cudnn的问题,Linux的GPU上有时候会报: Floating point exception.   win10和linux上debug会报: SystemError: error return without exception set 。不太敢用呀。

一.方案Keras+修改(项目地址在https://github.com/yongzhuo/nlp_xiaojiang/tree/master/FeatureProject/bert):

       左思右想,只能默默地上线我一直不太爱用地keras版本了。谁让google的tensorflow也这么做呢,趋势也去迎合迎合吧。keras版本的bert和gpt-2,https://github.com/CyberZHG/keras-bert这个项目其实还很不错啦。

       不说废话,直接上代码:

二、代码:

       其实这种直接调用google训练好模型的,不微调的,简单的cpu也可以调用,还不费多少内存,就是速度慢些。

       2.1   首先是模型,google预训练好的模型你得下载吧,可以去官方地址下,也可以来我这里前往链接: https://pan.baidu.com/s/1I3vydhmFEQ9nuPG2fDou8Q 提取码: rket

        2.2    然后是主要的代码,extract_keras_bert_feature.py

               

  1. # -*- coding: UTF-8 -*-
  2. # !/usr/bin/python
  3. # @time :2019/5/8 20:04
  4. # @author :Mo
  5. # @function :extract feature of bert and keras
  6. import codecs
  7. import os
  8. import keras.backend.tensorflow_backend as ktf_keras
  9. import numpy as np
  10. import tensorflow as tf
  11. from keras.layers import Add
  12. from keras.models import Model
  13. from keras_bert import load_trained_model_from_checkpoint, Tokenizer
  14. from FeatureProject.bert.layers_keras import NonMaskingLayer
  15. from conf.feature_config import gpu_memory_fraction, config_name, ckpt_name, vocab_file, max_seq_len, layer_indexes
  16. # 全局使用,使其可以django、flask、tornado等调用
  17. graph = None
  18. model = None
  19. # gpu配置与使用率设置
  20. os.environ['CUDA_VISIBLE_DEVICES'] = '0'
  21. config = tf.ConfigProto()
  22. config.gpu_options.per_process_gpu_memory_fraction = gpu_memory_fraction
  23. sess = tf.Session(config=config)
  24. ktf_keras.set_session(sess)
  25. class KerasBertVector():
  26. def __init__(self):
  27. self.config_path, self.checkpoint_path, self.dict_path, self.max_seq_len = config_name, ckpt_name, vocab_file, max_seq_len
  28. # 全局使用,使其可以django、flask、tornado等调用
  29. global graph
  30. graph = tf.get_default_graph()
  31. global model
  32. model = load_trained_model_from_checkpoint(self.config_path, self.checkpoint_path,
  33. seq_len=self.max_seq_len)
  34. print(model.output)
  35. print(len(model.layers))
  36. # lay = model.layers
  37. #一共104个layer,其中前八层包括token,pos,embed等,
  38. # 每4层(MultiHeadAttention,Dropout,Add,LayerNormalization)
  39. # 一共24层
  40. layer_dict = [7]
  41. layer_0 = 7
  42. for i in range(12):
  43. layer_0 = layer_0 + 4
  44. layer_dict.append(layer_0)
  45. # 输出它本身
  46. if len(layer_indexes) == 0:
  47. encoder_layer = model.output
  48. # 分类如果只有一层,就只取最后那一层的weight,取得不正确
  49. elif len(layer_indexes) == 1:
  50. if layer_indexes[0] in [i+1 for i in range(12)]:
  51. encoder_layer = model.get_layer(index=layer_dict[layer_indexes[0]]).output
  52. else:
  53. encoder_layer = model.get_layer(index=layer_dict[-2]).output
  54. # 否则遍历需要取的层,把所有层的weight取出来并拼接起来shape:768*层数
  55. else:
  56. # layer_indexes must be [1,2,3,......12...24]
  57. # all_layers = [model.get_layer(index=lay).output if lay is not 1 else model.get_layer(index=lay).output[0] for lay in layer_indexes]
  58. all_layers = [model.get_layer(index=layer_dict[lay-1]).output if lay in [i+1 for i in range(12)]
  59. else model.get_layer(index=layer_dict[-1]).output #如果给出不正确,就默认输出最后一层
  60. for lay in layer_indexes]
  61. print(layer_indexes)
  62. print(all_layers)
  63. # 其中layer==1的output是格式不对,第二层输入input是list
  64. all_layers_select = []
  65. for all_layers_one in all_layers:
  66. all_layers_select.append(all_layers_one)
  67. encoder_layer = Add()(all_layers_select)
  68. print(encoder_layer.shape)
  69. print("KerasBertEmbedding:")
  70. print(encoder_layer.shape)
  71. output_layer = NonMaskingLayer()(encoder_layer)
  72. model = Model(model.inputs, output_layer)
  73. # model.summary(120)
  74. # reader tokenizer
  75. self.token_dict = {}
  76. with codecs.open(self.dict_path, 'r', 'utf8') as reader:
  77. for line in reader:
  78. token = line.strip()
  79. self.token_dict[token] = len(self.token_dict)
  80. self.tokenizer = Tokenizer(self.token_dict)
  81. def bert_encode(self, texts):
  82. # 文本预处理
  83. input_ids = []
  84. input_masks = []
  85. input_type_ids = []
  86. for text in texts:
  87. print(text)
  88. tokens_text = self.tokenizer.tokenize(text)
  89. print('Tokens:', tokens_text)
  90. input_id, input_type_id = self.tokenizer.encode(first=text, max_len=self.max_seq_len)
  91. input_mask = [0 if ids == 0 else 1 for ids in input_id]
  92. input_ids.append(input_id)
  93. input_type_ids.append(input_type_id)
  94. input_masks.append(input_mask)
  95. input_ids = np.array(input_ids)
  96. input_masks = np.array(input_masks)
  97. input_type_ids = np.array(input_type_ids)
  98. # 全局使用,使其可以django、flask、tornado等调用
  99. with graph.as_default():
  100. predicts = model.predict([input_ids, input_type_ids], batch_size=1)
  101. print(predicts.shape)
  102. for i, token in enumerate(tokens_text):
  103. print(token, [len(predicts[0][i].tolist())], predicts[0][i].tolist())
  104. # 相当于pool,采用的是https://github.com/terrifyzhao/bert-utils/blob/master/graph.py
  105. mul_mask = lambda x, m: x * np.expand_dims(m, axis=-1)
  106. masked_reduce_mean = lambda x, m: np.sum(mul_mask(x, m), axis=1) / (np.sum(m, axis=1, keepdims=True) + 1e-9)
  107. pools = []
  108. for i in range(len(predicts)):
  109. pred = predicts[i]
  110. masks = input_masks.tolist()
  111. mask_np = np.array([masks[i]])
  112. pooled = masked_reduce_mean(pred, mask_np)
  113. pooled = pooled.tolist()
  114. pools.append(pooled[0])
  115. print('bert:', pools)
  116. return pools
  117. if __name__ == "__main__":
  118. bert_vector = KerasBertVector()
  119. pooled = bert_vector.bert_encode(['你是谁呀', '小老弟'])
  120. print(pooled)
  121. while True:
  122. print("input:")
  123. ques = input()
  124. print(bert_vector.bert_encode([ques]))

        2.3    再就是layers_keras.py

            

  1. # -*- coding: UTF-8 -*-
  2. # !/usr/bin/python
  3. # @time :2019/5/10 10:49
  4. # @author :Mo
  5. # @function :create model of keras-bert for get [-2] layers
  6. from keras.engine import Layer
  7. class NonMaskingLayer(Layer):
  8. """
  9. fix convolutional 1D can't receive masked input, detail: https://github.com/keras-team/keras/issues/4978
  10. thanks for https://github.com/jacoxu
  11. """
  12. def __init__(self, **kwargs):
  13. self.supports_masking = True
  14. super(NonMaskingLayer, self).__init__(**kwargs)
  15. def build(self, input_shape):
  16. pass
  17. def compute_mask(self, input, input_mask=None):
  18. # do not pass the mask to the next layers
  19. return None
  20. def call(self, x, mask=None):
  21. return x
  22. def compute_output_shape(self, input_shape):
  23. return input_shape

        2.4    最后是配置文件

           

  1. # -*- coding: UTF-8 -*-
  2. # !/usr/bin/python
  3. # @time :2019/5/10 9:13
  4. # @author :Mo
  5. # @function :path of FeatureProject
  6. import os
  7. # path of BERT model
  8. file_path = os.path.dirname(__file__)
  9. file_path = file_path.replace('conf', '') + 'Data'
  10. model_dir = os.path.join(file_path, 'chinese_L-12_H-768_A-12/')
  11. config_name = os.path.join(model_dir, 'bert_config.json')
  12. ckpt_name = os.path.join(model_dir, 'bert_model.ckpt')
  13. vocab_file = os.path.join(model_dir, 'vocab.txt')
  14. # gpu使用率
  15. gpu_memory_fraction = 0.2
  16. # 默认取倒数第二层的输出值作为句向量
  17. layer_indexes = [-2]
  18. # 序列的最大程度,单文本建议把该值调小
  19. max_seq_len = 26

希望对你有所帮助!

       

       



所属网站分类: 技术文章 > 博客

作者:343ueru

链接:https://www.pythonheidong.com/blog/article/11372/6034872bc1af8a44a009/

来源:python黑洞网

任何形式的转载都请注明出处,如有侵权 一经发现 必将追究其法律责任

3 0
收藏该文
已收藏

评论内容:(最多支持255个字符)