paddlenlp Windows本地搭建语义检索系统网站首页 学无止境

paddlenlp Windows本地搭建语义检索系统

Jin· 2024-09-08 00:01:03

简介paddlenlp Windows本地搭建语义检索系统

paddlenlp Windows本地搭建语义检索系统

一. 运行环境

软件环境：

python >= 3.8.16
paddlenlp = 2.5.2
paddlepaddle-gpu =2.4.2.post112
paddleocr = 2.6.1.3
numpy = 1.24.3
opencv-contrib-python =4.6.0.66
CUDA Version: 11.2
cuDNN Version 8.2
win11

硬件环境：

NVIDIA RTX 3050 4GB
11th Gen Intel® Core™ i7-11800H @ 2.30GHz 2.30 GHz

依赖安装：

首先需要安装PaddlePaddle，PaddlePaddle的安装请参考文档官方安装文档

安装以下依赖:

git clone https://github.com/tvst/htbuilder.git
cd htbuilder/
python setup install

下载pipelines源代码:

# pip 一键安装
pip install --upgrade paddle-pipelines -i https://pypi.tuna.tsinghua.edu.cn/simple
# 或者源码进行安装最新版本
cd ${HOME}/PaddleNLP/pipelines/
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
python setup.py install

# 下载pipelines源代码
git clone https://github.com/PaddlePaddle/PaddleNLP.git
cd PaddleNLP/pipelines

【注意】以下的所有的流程都只需要在pipelines根目录下进行，不需要跳转目录

二. 数据说明

paddlenlp 预置了基于DuReader-Robust数据集搭建语义检索系统的代码示例，以通过如下命令快速体验语义检索系统的效果(建议使用GPU)

python examples/semantic-search/semantic_search_example.py --device gpu

如果只有 CPU 机器，安装CPU版本的Paddle后，可以通过 --device 参数指定 cpu 即可, 运行耗时较长

python examples/semantic-search/semantic_search_example.py --device cpu

三. 构建Web可视化语义检索系统

启动ANN服务
1. 参考官方文档下载安装 elasticsearch-8.3.2 并解压。
2. 启动 ES 服务把xpack.security.enabled 设置成false，如下：
```
pack.security.enabled: false
```
  找不到的话可以到elasticsearch/config目录下 elasticsearch.yml中手动添加
  
  然后直接双击bin目录下的elasticsearch.bat即可启动。
3. 检查确保 ES 服务启动成功
```
curl http://localhost:9200/_aliases?pretty=true
```
  备注：ES 服务默认开启端口为 9200
文档数据写入ANN索引库
```
# 以DuReader-Robust 数据集为例建立 ANN 索引库
python utils/offline_ann.py --index_name dureader_robust_query_encoder --doc_dir data/dureader_dev
```
参数含义说明
- index_name: 索引的名称
- doc_dir: txt文本数据的路径
- host: Elasticsearch的IP地址
- port: Elasticsearch的端口号
- delete_index: 是否删除现有的索引和数据，用于清空es的数据，默认为false
可以使用下面的命令来查看数据：
```
# 打印几条数据
curl http://localhost:9200/dureader_robust_query_encoder/_search
```

启动Rest API模型服务

# 指定语义检索系统的Yaml配置文件,Linux/macos
export PIPELINE_YAML_PATH=rest_api/pipeline/semantic_search.yaml
# 指定语义检索系统的Yaml配置文件,Windows powershell
$env:PIPELINE_YAML_PATH='rest_api/pipeline/semantic_search.yaml'
# 使用端口号 8891 启动模型服务
python rest_api/application.py 8891

启动后可以使用curl命令验证是否成功运行：

curl -X POST -k http://localhost:8891/query -H 'Content-Type: application/json' -d '{"query": "衡量酒水的价格的因素有哪些?","params": {"Retriever": {"top_k": 5}, "Ranker":{"top_k": 5}}}'

启动WebUI

# 配置模型服务地址, Linux/macos
export API_ENDPOINT=http://127.0.0.1:8891
# 配置模型服务地址, windows
$env:API_ENDPOINT='http://127.0.0.1:8891'
# 在指定端口 8502 启动 WebUI
python -m streamlit run ui/webapp_semantic_search.py --server.port 8502

然后就可以在浏览器上访问了http://127.0.0.1:8502 !

搭建过程中踩得一些坑

语义检索系统可以跑通，但终端输出字符是乱码怎么解决？

通过如下命令设置操作系统默认编码为 zh_CN.UTF-8
```
export LANG=zh_CN.UTF-8
```

faiss 安装上了但还是显示找不到faiss怎么办？

推荐您使用anaconda进行单独安装，安装教程请参考faiss

# CPU-only version
conda install -c pytorch faiss-cpu

# GPU(+CPU) version
conda install -c pytorch faiss-gpu

如何更换pipelines中预置的模型？

更换系统预置的模型以后，由于模型不一样了，需要重新构建索引，并修改相关的配置文件。以语义索引为例，需要修改2个地方，第一个地方是utils/offline_ann.py,另一个是rest_api/pipeline/semantic_search.yaml，并重新运行：

首先修改utils/offline_ann.py：

python utils/offline_ann.py --index_name dureader_robust_base_encoder 
                            --doc_dir data/dureader_dev 
                            --query_embedding_model rocketqa-zh-base-query-encoder 
                            --passage_embedding_model rocketqa-zh-base-para-encoder 
                            --embedding_dim 768 
                            --delete_index

然后修改rest_api/pipeline/semantic_search.yaml文件：

components:    # define all the building-blocks for Pipeline
  - name: DocumentStore
    type: ElasticsearchDocumentStore  # consider using MilvusDocumentStore or WeaviateDocumentStore for scaling to large number of documents
    params:
      host: localhost
      port: 9200
      index: dureader_robust_base_encoder # 修改索引名
      embedding_dim: 768   # 修改向量的维度
  - name: Retriever
    type: DensePassageRetriever
    params:
      document_store: DocumentStore    # params can reference other components defined in the YAML
      top_k: 10
      query_embedding_model: rocketqa-zh-base-query-encoder  # 修改Retriever的query模型名
      passage_embedding_model: rocketqa-zh-base-para-encoder # 修改 Retriever的para模型
      embed_title: False
  - name: Ranker       # custom-name for the component; helpful for visualization & debugging
    type: ErnieRanker    # pipelines Class name for the component
    params:
      model_name_or_path: rocketqa-base-cross-encoder  # 修改 ErnieRanker的模型名
      top_k: 3

然后重新运行：

# 指定语义检索系统的Yaml配置文件
export PIPELINE_YAML_PATH=rest_api/pipeline/semantic_search.yaml
# 使用端口号 8891 启动模型服务
python rest_api/application.py 8891

Elastic search 日志显示错误 `exception during geoip databases update`

需要编辑config/elasticsearch.yml，在末尾添加：

ingest.geoip.downloader.enabled: false

Windows出现运行前端报错`requests.exceptions.MissingSchema: Invalid URL 'None/query': No scheme supplied. Perhaps you meant http://None/query?`

环境变量没有生效，请检查一下环境变量，确保PIPELINE_YAML_PATH和API_ENDPOINT生效：

方法：打开poweshell终端，输入：

$env:PIPELINE_YAML_PATH='rest_api/pipeline/semantic_search.yaml'
$env:API_ENDPOINT='http://127.0.0.1:8891'

Windows运行应用的时候出现了下面的错误：`RuntimeError: (NotFound) Cannot open file C:Usersmy_name/.paddleocr/whldetchch_PP-OCRv3_det_infer/inference.pdmodel, please confirm whether the file is normal.`

这是Windows系统用户命名为中文的原因，详细解决方法参考issue. https://github.com/PaddlePaddle/PaddleNLP/issues/3242

运行后台程序出现了错误：`Exception: Failed loading pipeline component 'DocumentStore': RequestError(400, 'illegal_argument_exception', 'Mapper for [embedding] conflicts with existing mapper: Cannot update parameter [dims] from [312] to [768]')`

以语义检索为例，这是因为模型的维度不对造成的，请检查一下 elastic search中的文本的向量的维度和semantic_search.yaml里面DocumentStore设置的维度embedding_dim是否一致，如果不一致，请重新使用utils/offline_ann.py构建索引。总之，请确保构建索引所用到的模型和semantic_search.yaml设置的模型是一致的。

注意：修改后重新构建索引时，一定要先将以前的索引全部删除否则无效。可以执行以下命令：

curl -XDELETE http://localhost:9200/dureader_robust_query_encoder

风语者！平时喜欢研究各种技术，目前在从事后端开发工作，热爱生活、热爱工作。

上一篇
【数据结构与算法】线性表 01 链表...

下一篇
自学黑客（网络安全），一般人我劝你还是...

站长推荐

U8W/U8W-Mini使用与常见问题解决
U8W/U8W-Mini使用与常见问题解决
QT多线程的5种用法，通过使用线程解决UI主界面的耗时操作代码，防止界面卡死。
QT多线程的5种用法，通过使用线程解决UI主界面的耗时操作代码，防止界面卡死。...
stm32使用HAL库配置串口中断收发数据（保姆级教程）
stm32使用HAL库配置串口中断收发数据（保姆级教程）
分享几个国内免费的ChatGPT镜像网址(亲测有效)
分享几个国内免费的ChatGPT镜像网址(亲测有效)
SpringSecurity实现前后端分离认证授权
SpringSecurity实现前后端分离认证授权

您现在的位置是：首页 >学无止境 >paddlenlp Windows本地搭建语义检索系统网站首页学无止境

paddlenlp Windows本地搭建语义检索系统

paddlenlp Windows本地搭建语义检索系统

一. 运行环境

【注意】以下的所有的流程都只需要在pipelines根目录下进行，不需要跳转目录

二. 数据说明

三. 构建Web可视化语义检索系统

搭建过程中踩得一些坑

语义检索系统可以跑通，但终端输出字符是乱码怎么解决？

faiss 安装上了但还是显示找不到faiss怎么办？

如何更换pipelines中预置的模型？

Elastic search 日志显示错误 exception during geoip databases update

Windows出现运行前端报错requests.exceptions.MissingSchema: Invalid URL 'None/query': No scheme supplied. Perhaps you meant http://None/query?

Windows运行应用的时候出现了下面的错误：RuntimeError: (NotFound) Cannot open file C:Usersmy_name/.paddleocr/whldetchch_PP-OCRv3_det_infer/inference.pdmodel, please confirm whether the file is normal.

运行后台程序出现了错误：Exception: Failed loading pipeline component 'DocumentStore': RequestError(400, 'illegal_argument_exception', 'Mapper for [embedding] conflicts with existing mapper: Cannot update parameter [dims] from [312] to [768]')

上一篇 【数据结构与算法】线性表 01 链表...

下一篇 自学黑客（网络安全），一般人我劝你还是...

站长推荐

Elastic search 日志显示错误 `exception during geoip databases update`

Windows出现运行前端报错`requests.exceptions.MissingSchema: Invalid URL 'None/query': No scheme supplied. Perhaps you meant http://None/query?`

Windows运行应用的时候出现了下面的错误：`RuntimeError: (NotFound) Cannot open file C:Usersmy_name/.paddleocr/whldetchch_PP-OCRv3_det_infer/inference.pdmodel, please confirm whether the file is normal.`

运行后台程序出现了错误：`Exception: Failed loading pipeline component 'DocumentStore': RequestError(400, 'illegal_argument_exception', 'Mapper for [embedding] conflicts with existing mapper: Cannot update parameter [dims] from [312] to [768]')`

上一篇
【数据结构与算法】线性表 01 链表...

下一篇
自学黑客（网络安全），一般人我劝你还是...