2023-07-21

字数统计: 12.2k字 | 阅读时长: 64分

建行实习

7-17

上午进行了报道，竞选班委，签署了实习协议。
下午进行了人力政策和职业发展，北分概况和建行历史，乡村金融业务介绍和建行生活业务介绍。

建行三大战略：住房租赁，普惠金融和金融科技。

建行生活平台，没有中间商；研究中心（学习平台）

7-18

上午进行了个人业务，公司业务，惠普金融。

个人业务包括考核日均新增，时点新增，存款余额，银行的收益主要在于存款和贷款的利息差，买基金(基金是专业人员进行投资，盈利的几率更高)；国债是理财产品，但是不保本；贵金属(金银首饰，压岁金钞)；数字人民币(功能类似支付宝微信，但是是国家管控)；养老金：社保保障，养老金账户

住房租赁：可以为员工提供低于社会平均价格的住房，目前已经和许多企业达成了合作

普惠金融：主要是立足机会平等原则，为了社会各阶层群体提供适当有效的服务
下午介绍了城市更新住房租赁业务，科创金融和绿色金融相关内容。

7-19

上午去金融科技部进行了报道，领取餐券，并参观了食堂和健身房。
下午去了下乡实践，去了门头沟区水峪嘴村京西古道附近，参观了建行提供的金融服务业务，金融培训，了解了水峪嘴村的历史和发展路径，早期通过煤矿业发展，后来国家提倡环保，通过旅游业实现了振兴，建设了京西古道，军事主题酒吧等。

7-20

上午：到部门进行报道，听了经理的报告，金融科技部主要包括交易开发科和管理开发科，管理开发科负责人何涛。被安排了任务，了解chatGLM和闻达框架，后续要基于这两部分进行相关的开发
下午：交易开发科的负责人进行了讲话，国企都用国产的龙芯CPU和相关的操作系统，银行科技部门主要关注的几大领域：人工智能，大数据，区块链，云计算，移动互联，物联网等；

建行的科技部门分出去成立的一个建信金科，交易开发科主要是资金交易和存取款的开发，主要是甲方，起到统筹规划的作用，部门开发人员只有十几个，很多模块都是外包出去做。但是项目有完整的流程，类似与软件工程的流程。
晚上：安装了一下chatGLM，主要参考

手把手教你本地部署清华大学的ChatGLM-6B模型——Windows+6GB显卡本地部署 | 数据学习者官方网站(Datalearner)

其中，安装CUDA和pytorch参考

Pytorch环境配置（anaconda安装+独显+CUDA+cuDNN)_anaconda换源安装cuda_睡不醒的凉白开的博客-CSDN博客

模型部分使用的是chatglm-6b-int4模型

安装完体验了一下，感觉问答质量比较一般

7-21

上午：继续试了试跑chatGLM的demo，图片检测的两个没跑通，其余的都可以，然后老师讲解了一下百发百中北京分行招投标平台建设项目，主要的需求如下：
1. 统计功能：标书生成量，明细，项目搜寻各行各人点击次数；
2. 配置功能：加入支持后台上传智能问答语句的功能；
3. 标书生成：可对用户选择的一二三级分类进行排序；
4. 标准问答：有一些标准的输入输出，感觉就是在数据库中找，如果找到了，就输出最匹配的五个，如果没有找到，就调模型输出，也就是智能问答

下午：去进行了代码的修改，实现了需求3中的对用户选择的一二三级分类进行排序的目标。

主要实现流程：

在原来的基础上增加了@click选项，该方法调用sort函数，sort函数有两个参数，第一个对应实际的值，最后一个item只是为了记录count，代表点击到哪个item。

<v-row v-for="i in 5">
	<v-col cols="12" sm="2">
		<v-switch hide-details color="purple" inset v-model="items[(i-1)*4].val" @click="sort(items[(i-1)*4], items[20])" :label="items[(i-1)*4].lab">
		</v-switch>
	</v-col>
	<v-col cols="12" sm="2">
		<v-switch hide-details color="purple" inset v-model="items[(i-1)*4+1].val" @click="sort(items[(i-1)*4+1], items[20])" :label="items[(i-1)*4+1].lab">
		</v-switch>
	</v-col>
	<v-col cols="12" sm="2">
		<v-switch hide-details color="purple" inset v-model="items[(i-1)*4+2].val" @click="sort(items[(i-1)*4+2], items[20])" :label="items[(i-1)*4+2].lab">
		</v-switch>
	</v-col>
	<v-col cols="12" sm="2">
		<v-switch hide-details color="purple" inset v-model="items[(i-1)*4+3].val" @click="sort(items[(i-1)*4+3], items[20])" :label="items[(i-1)*4+3].lab">
		</v-switch>
	</v-col>

sort函数的实现，实际上是一个计数器，越后点击到的id越大

sort = async(id, count) => {
	id.id = count.id
	console.log(id.id)
	count.id = count.id + 1
	console.log(count.id)
}

加入冒泡排序，这样就可以按照id的大小对item进行交换,这样打印出来的顺序就是按照点击的顺序执行的了。

for(let i = 0; i < items.length - 2; i++) {
	for(let j = 0; j < items.length - i - 2; j++) {
		if(items[j].id > items[j + 1].id) {
			tmp = items[j]
			items[j] = items[j + 1]
			items[j + 1] = tmp
		}
	}
}
for(let i = 0; i < items.length; i++) {
	console.log(items[i].id)
}

7.24

上午进行了代码的修改，增加了排序按钮，改完后的代码可以根据选择的顺序进行排序，然后再生成文件。

新增了排序的button,调用sort_all逻辑，对所有的元素进行排序

<v-btn color="purple" dark @click="sort_all()">排序</v-btn>
<v-btn color="purple" dark @click="report_1_all_v2(report_1_bankname,test_zsk_step)">生成</v-btn>
<v-btn color="purple" dark @click="clear_1_all()">清空</v-btn>
<v-btn color="purple" dark @click="report_1_export_v2(report_1_bankname)">导出</v-btn>

排序的具体实现

sort_all = async () => {
	let tmp = ''
	let items1 = app.items1
	let items2 = app.items2
	for(let i = 0; i < items1.length - 2; i++) {
		for(let j = 0; j < items1.length - i - 2; j++) {
			if(items1[j].id > items1[j + 1].id) {
				tmp = items1[j]
				items1[j] = items1[j + 1]
				items1[j + 1] = tmp
			}
		}
	}
	for(let i = 0; i < items2.length - 2; i++) {
		for(let j = 0; j < items2.length - i - 2; j++) {
			if(items2[j].id > items2[j + 1].id) {
				tmp = items2[j]
				items2[j] = items2[j + 1]
				items2[j + 1] = tmp
			}
		}
	}
	app.report_1_yd = []
}

将items分成了items1和items2，一级标题和二级标题；

下午：老师讲解了SMP 2023 ChatGLM金融大模型挑战赛；本次比赛要求参赛选手以ChatGLM2-6B模型为中心制作一个问答系统，回答用户的金融相关的问题，不允许使用其他的大语言模型。参赛选手可以使用其他公开访问的外部数据来微调模型，也可以使用向量数据库等技术。

7.25

上午：增加了上传文件的功能.

首先增加了上传的按钮

<v-col>
<v-btn color="purple" dark @click="sort_all()">排序</v-btn>
<v-btn color="purple" dark @click="report_1_all_v2(report_1_bankname,test_zsk_step)">生成</v-btn>
<v-btn color="purple" dark @click="clear_1_all()">清空</v-btn>
<v-btn color="purple" dark @click="report_1_export_v2(report_1_bankname)">导出</v-btn>
<v-btn color="purple" dark @click="upload_data()">上传</v-btn>
</v-col>

定义了upload_data()方法

upload_data = async () => {
	const filePromise = new Promise((resolve, reject) => {
	  const fileInput = document.createElement('input');
	  fileInput.type = 'file';
	  fileInput.style.display = 'none';
	  fileInput.addEventListener('change', () => {
		if (fileInput.files.length > 0) {
		  resolve(fileInput.files[0]);
		} else {
		  reject(new Error('No file selected.'));
		}
	  });
  
	  // Trigger the file input dialog
	  document.body.appendChild(fileInput);
	  fileInput.click();
  
	  // Clean up the file input after the user selects a file or cancels the dialog
	  fileInput.addEventListener('focusout', () => {
		document.body.removeChild(fileInput);
	  });
	});
  
	try {
	  // Wait for the user to select a file
	  const selectedFile = await filePromise;
  
	  // Get the original file name and type
	  //const originalFileName = selectedFile.name;
	  const originalFileName = encodeURIComponent(selectedFile.name);
	  const originalFileType = selectedFile.type;
  
	  // Replace 'YOUR_UPLOAD_ENDPOINT' with the actual server endpoint where you handle the file upload.
	  const uploadEndpoint = 'http://localhost:8888/upload';
  
	  // Create a FormData object and append the selected file to it with the original name and type
	  const formData = new FormData();
	  formData.append('file', selectedFile, originalFileName);
  
	  // Use fetch or any other method to perform the file upload.
	  const response = await fetch(uploadEndpoint, {
		method: 'POST',
		body: formData,
	  });
  
	  // Handle the response from the server if needed
	  if (response.ok) {
		console.log('File uploaded successfully.');
	  } else {
		console.error('File upload failed:', response.status, response.statusText);
	  }
	} catch (error) {
	  // Handle any errors that occurred during the file selection or upload process
	  console.error('Error during file upload:', error.message);
	}
}

通过nodejs进行了本地服务器的配置，端口8888，定义了post和get方法；

const express = require('express');
const multer = require('multer'); // npm install multer
const path = require('path');
const cors = require('cors'); // 导入 cors 模块
const fs = require('fs');

const app = express();
const storage = multer.diskStorage({
  destination: 'uploads/',
  filename: (req, file, cb) => {
    //const uniqueSuffix = Date.now() + '-' + Math.round(Math.random() * 1E9);
    //const originalExtension = path.extname(file.originalname);
    //const fileName = file.fieldname + '-' + uniqueSuffix + originalExtension;
    //const utf8FileName = file.originalname;
    const utf8FileName = decodeURIComponent(file.originalname);
    cb(null, utf8FileName);
    //cb(null, file.originalname);
  },
});

//const upload = multer({ dest: 'uploads/' }); // 设置上传目录
const upload = multer({ storage });
// 使用 cors 中间件来设置允许跨域请求
app.use(cors());

// 处理文件上传请求
app.post('/upload', upload.single('file'), (req, res) => {
  if (!req.file) {
    return res.status(400).send('No file uploaded.');
  }

  // 在这里对上传的文件进行处理，例如将其保存到特定目录
  // req.file.path 是上传文件在服务器上的临时路径
  // 可以使用文件系统模块(fs)将文件从临时路径移到指定目录

  res.status(200).send('File uploaded successful');
});

// 定义一个 GET 方法，用于列出指定目录下的所有文件
app.get('/listFiles', (req, res) => {
  const directoryPath = 'uploads/'; // 指定目录路径

  fs.readdir(directoryPath, (err, files) => {
    if (err) {
      return res.status(500).send('Error listing files.');
    }

    // 将文件名列表作为响应返回
    res.status(200).json({ files });
  });
});

const port = 8888;
app.listen(port, () => {
  console.log(`Server running on http://localhost:${port}`);
});

下午：测试了pdf2txt的工具，安装proxima数据库。

7.26

上午：继续安装proxima数据库，并对数据库进行了测试，安装过程参考github链接：

GitHub - alibaba/proxima

安装指南 | Proxima (proximabilin.github.io)

首先需要通过docker启动数据库服务端

sudo docker run -d --name proxima_be -p 16000:16000 -p 16001:16001 -v $HOME/proxima-be/conf:/var/lib/proxima-be/conf -v $HOME/proxima-be/data:/var/lib/proxima-be/data -v $HOME/proxima-be/log:/var/lib/proxima-be/log ghcr.io/proximabilin/proxima-be /var/lib/proxima-be/bin/mysql_repository --config /var/lib/proxima-be/conf/mysql_repo.conf

然后通过python的pyproximabe库与数据库进行交互，具体代码为：

from pyproximabe import *

# Init client
client = Client('127.0.0.1', 16000)

# Init index column，创建索引列
index_column = IndexColumnParam(name='ImageVector',
                                dimension=8,
                                data_type=DataType.VECTOR_FP32,
                                index_type=IndexType.PROXIMA_GRAPH_INDEX)
# Init collection config，创建表，表名为Plants，增加了ImageVector列，Price列和Description列
collection_config = CollectionConfig('Plants',
                                     index_column_params=[index_column],
                                     max_docs_per_segment=0,
                                     forward_column_names=['Price','Description'])
# Create collection
status = client.create_collection(collection_config)

# Check Return
# print(status)

# Get collection info
status, collection_info = client.describe_collection('Plants')

# print(status)
print(collection_info)

# Set record data format，设置数据的格式，ImageVector为向量格式，维数是8，Price是float格式，Description是STRING格式
index_column_meta = WriteRequest.IndexColumnMeta(name='ImageVector',
                                                 data_type=DataType.VECTOR_FP32,
                                                 dimension=8)
row_meta = WriteRequest.RowMeta(index_column_metas=[index_column_meta],
                                forward_column_names=['Price','Description'],
                                forward_column_types=[DataType.FLOAT, DataType.])
# Send 100 records，向数据库中插入100个元素
rows = []
for i in range(0, 100):
  vector = [i+0.1, i+0.2, i+0.3, i+0.4, i+0.5, i+0.6, i+0.7, i+0.8]
  price = i + 0.1
  description = "ginkgo tree with number " + str(i)
  row = WriteRequest.Row(primary_key=i,
                         operation_type=WriteRequest.OperationType.INSERT,
                         index_column_values=[vector],
                         forward_column_values=[price, description])
  rows.append(row)

write_request = WriteRequest(collection_name='Plants',
                             rows=rows,
                             row_meta = row_meta)
status = client.write(write_request)
print(status)

# Query，进行向量查询，理论上应该比较查询向量和数据库中的向量，输出距离最小的前5个，距离规则是每个维度计算差再求平方和
query_vector = [5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, 5.8]	
status, knn_res = client.query(collection_name='Plants',
                               column_name='ImageVector',
                               features=query_vector,
                               data_type=DataType.VECTOR_FP32,
                               topk = 5)
print(status)
print(knn_res)

# Get collection statistics
status, collection_stats = client.stats_collection('Plants')

print(status)
# print(collection_stats)# Drop collection
# status = client.drop_collection('Plants')

# print(status)

下午：学习了一个相似度检测的工具，langchain可以根据文本信息生成向量文件，然后当输入文件时就可以通过相似度进行检测了。

import pandas as pd
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.schema import Document
from langchain.vectorstores import FAISS
from tqdm import tqdm

# 中文Wikipedia数据导入示例：
embedding_model_name = "E:\\ChatGLM-6B-main\\ChatGLM-6B-main\\THUDM\\text2vec-large-chinese"
embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name)


docs = []
rows = []
import glob 
lol_df = glob.glob("E:\\chatglm_llm_fintech_raw_dataset\\alltxt_inside\\*.txt")

idx = 0
for file_path in lol_df:
    for line in open(file_path, "r", encoding="utf-8").read().split("\n"):
        rows.append(line)

# 去重并创建文档对象
rows = list(set(rows))
for row in rows:
    metadata = {"source": f'doc_id_{idx}'}
    idx += 1
    if isinstance(row, str):
        docs.append(Document(page_content=row, metadata=metadata))

# 计算并保存向量文件
vector_store = FAISS.from_documents(docs, embeddings)
vector_store.save_local('cache/')

# 查询相似度
query_text = "上海建工集团的总经理是谁"
print(vector_store.similarity_search(query_text))

这是一个直接加载已经生成好的向量库的例子。

import faiss
import pickle

# # 查看index.faiss文件
# index_faiss = faiss.read_index("E:\\chatglm_llm_fintech_raw_dataset\\cache\\index.faiss")
# print(index_faiss)

# 查看index.pkl文件
# with open("E:\\chatglm_llm_fintech_raw_dataset\\cache\\index.pkl", 'rb') as f:
#     index_metadata = pickle.load(f)
# print(index_metadata)
from langchain.vectorstores import FAISS
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
embedding_model_name = "E:\\ChatGLM-6B-main\\ChatGLM-6B-main\\THUDM\\text2vec-large-chinese"
embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name)

# 加载向量库
vector_store = FAISS.load_local('cache/', embeddings)

query_text = "中国建设银行2019年利润是多少？"
print(vector_store.similarity_search(query_text))

7.27

上午：写了一个可以将输出的txt按章节划分的程序

import pdfplumber
import os
import glob

# 提取PDF目录
def extract_table_of_contents(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        table_of_contents = []
        for page in pdf.pages:
            for element in page.extract_words():
                if element['text'].startswith('第') and element['text'].endswith("节"):
                    table_of_contents.append(element['text'])
    return table_of_contents

# 切分PDF内容并保存到txt文件中
def split_content(pdf_path, table_of_contents):
    with pdfplumber.open(pdf_path) as pdf:
        content_sections = {}
        for title in table_of_contents:
            content_sections[title] = ""
        current_title = table_of_contents[0]

        pdf_filename = os.path.splitext(os.path.basename(pdf_path))[0]
        output_folder = os.path.join('output', pdf_filename)
        os.makedirs(output_folder, exist_ok=True)
        
        for page in pdf.pages:
            page_text = page.extract_text()
            for title in table_of_contents:
                if title in page_text:
                    current_title = title
                    break
            content_sections[current_title] += page_text
        
        # 将内容写入txt文件
        for title, content in content_sections.items():
            output_file_path = os.path.join(output_folder, f'{title}.txt')
            with open(output_file_path, 'w', encoding='utf-8') as output_file:
                output_file.write(content)

if __name__ == "__main__":
    pdf_files = glob.glob("E:\\chatglm_llm_fintech_raw_dataset\\allpdf2\\*.pdf")
    for pdf_file in pdf_files:
        example_chapter_titles = extract_table_of_contents(pdf_file)
        split_content(pdf_file, example_chapter_titles)

一个pdf2table的程序，可以将pdf中的表都提取出来放到excel中。

import glob
import pdfplumber
import re
import xlwt
import pandas as pd
from multiprocessing import  Process

def check_lines(page, top, buttom):
    lines = page.extract_words()[::]
    text = ''
    last_top = 0
    last_check = 0
    for each_line in lines:
        if top == '' and buttom == '':
            if abs(last_top - each_line['top']) <= 2:
                text = text + each_line['text']
            elif last_check > 0 and not re.search('(?:。|；|\d|报告全文)$', text):
                text = text + each_line['text']
            else:
                text = text + '\n' + each_line['text']

        elif top == '':
            if each_line['top'] > buttom:
                if abs(last_top - each_line['top']) <= 2:
                    text = text + each_line['text']
                elif last_check > 0 and not re.search('(?:。|；|\d|报告全文)$', text):
                    text = text + each_line['text']
                else:
                    text = text + '\n' + each_line['text']
        else:
            if each_line['top'] < top and each_line['top'] > buttom:
                if abs(last_top - each_line['top']) <= 2:
                    text = text + each_line['text']
                elif last_check > 0 and not re.search('(?:。|；|\d|报告全文)$', text):
                    text = text + each_line['text']
                else:
                    text = text + '\n' + each_line['text']
        last_top = each_line['top']
        last_check = each_line['x1'] - page.width * 0.85

    return text

def change_pdf_to_txt(name, save_path):
    pdf = pdfplumber.open(name)

    save_path_table = save_path + '\\table_' + name.split('\\')[-1].replace('.pdf', '_table.xlsx')

    writer = pd.ExcelWriter(save_path_table,engine='xlsxwriter')

    all_text = {}
    allrow = 0
    
    for i in range(len(pdf.pages)):
        page = pdf.pages[i]
        buttom = 0
        tables = page.find_tables()
        if len(tables) >= 1:
            id = 0
            # worksheet = workbook.add_sheet("Sheet %d" % (i+1))
            count = len(tables)
            row = 0
            for table in tables:
                if table.bbox[3] < buttom:
                    pass
                else:
                    count = count - 1
                
                table_content = table.extract()
                
                table_df = pd.DataFrame(table_content[1:], columns=table_content[0])
                
                table_df.to_excel(writer, sheet_name="Sheet_%d_%d" % ((i+1),id), index=False)
                id = id + 1
    
    writer.save()  # 保存!!!!!!!!!!!!!!
    writer.close() # 关闭
    

def loop_file(name_list, file_names, save_path):
    for file_name in file_names:
        print(file_name)
        name_list.append(file_name)
        allname = file_name.split('/')[-1]
        date = allname.split('__')[0]
        name = allname.split('__')[1]
        year = allname.split('__')[4]
        
        print(date, name, year)
        change_pdf_to_txt(file_name, save_path)


if __name__ == '__main__': 
    # 文件夹路径
    # folder_path = '../data'
    folder_path = "E:\\chatglm_llm_fintech_raw_dataset\\allpdf2"
    # save_path = '../txt'
    save_path = 'E:\\chatglm_llm_fintech_raw_dataset\\table'
    # 获取文件夹内所有文件名称
    file_names = glob.glob(folder_path + '\\*')
    file_names = sorted(file_names, reverse=True)
    # 打印文件名称
    name_list = []
    print(file_names)

    np = 2
    # print("total = %d" % len(file_names))
    num_per_process = (int)(len(file_names) / np)
    res = len(file_names) % np
    
    print( "num_per_process = %d" % num_per_process)
    print ("res = %d" % res)
    
    # print(file_names[0:num_per_process+res])
    
    start = 0
    
    process_list = []
    for i in range(np):
        if i < res:
            # print("process %d" % i+"[ %d, %d ]" % (i*(num_per_process+1), min(len(file_names), (i+1)*(num_per_process+1))))
            p = Process(target=loop_file, args=(name_list, file_names[i*(num_per_process +1)+start:min(len(file_names), (i+1)*(num_per_process+1))], save_path))
        else : 
            # print("process %d" % i+"[ %d, %d ]" % (i*num_per_process+res, min(len(file_names), (i+1)*num_per_process+res)))
            p = Process(target=loop_file, args=(name_list, file_names[i*num_per_process+res+start:min(len(file_names), (i+1)*num_per_process+res)], save_path))
        
        p.start()
        process_list.append(p)

有的表格显示正常的，有的在同一个单元格被分成多行在表格中出现在不同的单元格中，目前还没有解决。

下午：又研究了半天怎么处理txt和表格中换行的问题，但是一直不能解决。

又试了一下提取文本中出现频率最高的汉字，想要通过此方法作为构建金融数据库的依据，但是代码现在划分的单词太细了，代码需要完善。可以根据自定义字典实现

  import json
  import re
  import jieba
  from collections import Counter
  
  def extract_chinese_words(text):
      # 使用正则表达式提取中文单词
      chinese_words = re.findall(r'[\u4e00-\u9fff]+', text)
      return chinese_words
  
  def load_custom_dictionary():
      # 加载自定义词典
      jieba.load_userdict("E:\\chatglm_llm_fintech_raw_dataset\\custom_dict.txt")
  
  def segment_chinese_text(text):
      # 使用jieba进行中文分词
      seg_list = jieba.cut(text)
      return " ".join(seg_list)
  
  def get_most_common_chinese_words(json_path, top_n=1000):
      all_text = ""
      with open(json_path, 'r', encoding='utf-8') as file:
          for line in file:
              data = json.loads(line)
              all_text += segment_chinese_text(data['question']) + " "
  
      # 提取中文单词
      chinese_words = extract_chinese_words(all_text)
  
      # 统计中文单词出现频率
      word_count = Counter(chinese_words)
  
      # 获取出现频率最高的中文单词
      most_common_words = word_count.most_common(top_n)
  
      return most_common_words
  
  if __name__ == "__main__":
      json_file_path = "E:\\chatglm_llm_fintech_raw_dataset\\test_questions.jsonl"
      load_custom_dictionary()  # 加载自定义词典
      most_common_words = get_most_common_chinese_words(json_file_path)
  
      print("出现频率最高的中文单词：")
      with open("topk.txt", "w", encoding='utf-8') as outfile:
          for word, frequency in most_common_words:
              print(f"{word}: {frequency} 次")
              outfile.write(f"{word}: {frequency} 次\n")
  



### 7.28

- 上午：搜索了一下金融词库，将词库中的词放到`custom_dict.txt`中，可以提高查找词的准确性，然后维护了金融报表。

- 下午：将问题进行了分类，按照年份和没有年份(开放性问题进行了分类)，然后利用chatglm回答了开放性问题。

  首先对文件进行分类，分成了19,20,21和开放性问题。

  ```python
  import json
  import glob
  
  def json_output(file_name, json_dict):
      with open(file_name, 'w', encoding='utf-8', newline='') as fw:
          for item in json_dict:
              json.dump(item, fw, ensure_ascii=False, indent=None)
              print("", file=fw)
      
      
  if __name__ == '__main__': 
      
      # 文件夹路径
      # folder_path = '../data'
      # folder_path = "../chatglm_llm_fintech_raw_dataset/allpdf"
      
      save_path = "E:\\chatglm_llm_fintech_raw_dataset\\dataset"
      
      
      test_questions = open(save_path+"\\test_questions.jsonl", "r", encoding="utf-8").readlines()
      
      question_open = []
      question_2019 = []
      question_2020 = []
      question_2021 = []
      
      
      for test_question in test_questions:
          id = json.loads(test_question)["id"]
          question = json.loads(test_question)["question"]
          item = {"id":id, "question":question}
          
          if "2019" in question:
              question_2019.append(item)
          elif  "2020" in question:
              question_2020.append(item)
          elif  "2021" in question:
              question_2021.append(item)
          else:
              question_open.append(item)
          
      
      # print(question_2019)
      print("question include 2019 : %d" % len(question_2019))
      json_output(save_path+"\\question_2019.jsonl", question_2019)
  
      # print(question_2020)
      print("question include 2020 : %d" % len(question_2020))
      json_output(save_path+"\\question_2020.jsonl", question_2020)
      
      # print(question_2021)
      print("question include 2021 : %d" % len(question_2021))
      json_output(save_path+"\\question_2021.jsonl", question_2021)
      
      # print(question_open)
      print("open question : %d" % len(question_open))
      json_output(save_path+"\\question_open.jsonl", question_open)
      
      print("total questions %d" % (len(question_2019)+len(question_2020)+len(question_2021)+len(question_open)))

然后对于开放性问题，直接调用模型生成结果

from transformers import AutoTokenizer, AutoModel
import json

# 加载模型和tokenizer
tokenizer = AutoTokenizer.from_pretrained("THUDM\chatglm-6b-int4", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM\chatglm-6b-int4", trust_remote_code=True).half().cuda()
model = model.eval()

# 从JSON文件中读取问题数据
def read_questions_from_json(json_file_path):
    questions = []
    with open(json_file_path, 'r', encoding='utf-8') as file:
        for line in file:
            question = json.loads(line)
            questions.append(question)
    return questions


# 将问题和回答保存为JSON文件
def save_answers_to_json(answers, json_file_path):
    with open(json_file_path, 'w', encoding='utf-8') as file:
        for answer in answers:    
            json.dump(answer, file, ensure_ascii=False, indent=None)
            print("", file=file)

if __name__ == '__main__':
    # 从test_questions.jsonl中读取问题
    test_questions = read_questions_from_json("E:\\chatglm_llm_fintech_raw_dataset\\dataset\\question_open.jsonl")

    # 使用模型回答问题并保存回答
    answers = []
    for item in test_questions:
        
        question_id = item["id"]
        question_text = item["question"]
        
        # 调用模型进行回答
        response, _ = model.chat(tokenizer, question_text, history=[])
        response = response.replace("\n\n", "")

        # 将问题和回答保存为字典格式
        answer_dict = {
            "id": question_id,
            "question": question_text,
            "answer": response
        }
        answers.append(answer_dict)
        save_answers_to_json(answers, "E:\\chatglm_llm_fintech_raw_dataset\\dataset\\answers.json")
    # 将回答保存为answers.json文件

对于其他问题，先直接修改格式，answer内容为空

from transformers import AutoTokenizer, AutoModel
import json

# 加载模型和tokenizer
# tokenizer = AutoTokenizer.from_pretrained("THUDM\chatglm-6b-int4", trust_remote_code=True)
# model = AutoModel.from_pretrained("THUDM\chatglm-6b-int4", trust_remote_code=True).half().cuda()
# model = model.eval()

# 从JSON文件中读取问题数据
def read_questions_from_json(json_file_path):
    questions = []
    with open(json_file_path, 'r', encoding='utf-8') as file:
        for line in file:
            question = json.loads(line)
            questions.append(question)
    return questions


# 将问题和回答保存为JSON文件
def save_answers_to_json(answers, json_file_path):
    with open(json_file_path, 'w', encoding='utf-8') as file:
        for answer in answers:    
            json.dump(answer, file, ensure_ascii=False, indent=None)
            print("", file=file)

if __name__ == '__main__':
    # 从test_questions.jsonl中读取问题
    test_questions = read_questions_from_json("E:\\chatglm_llm_fintech_raw_dataset\\dataset\\question_2019.jsonl")

    # 使用模型回答问题并保存回答
    answers = []
    for item in test_questions:
        question_id = item["id"]
        question_text = item["question"]
        
        # 调用模型进行回答
        # response, _ = model.chat(tokenizer, question_text, history=[])
        # response = response.replace("\n\n", "")

        # 将问题和回答保存为字典格式
        answer_dict = {
            "id": question_id,
            "question": question_text,
            "answer": ''
        }
        answers.append(answer_dict)
        save_answers_to_json(answers, "E:\\chatglm_llm_fintech_raw_dataset\\dataset\\answers_2019.json")
    # 将回答保存为answers.json文件

最后将所有拆分过的json文件按照id号进行合并

import json
import os

def merge_answers_to_json(path, answer_files, output_file):
    all_answers = []
    for file in answer_files:
        datas = open(path + "\\" + file, 'r', encoding='utf-8').readlines()
        for data in datas:
            id = json.loads(data)["id"]
            question = json.loads(data)["question"]
            answer = json.loads(data)["answer"]
            item = {"id":id, "question":question, "answer":answer}
            all_answers.append(item)

    # 按照"id"号进行排序
    all_answers_sorted = sorted(all_answers, key=lambda x: x["id"])

    # 保存到新的JSON文件
    with open(output_file, 'w', encoding='utf-8') as fw:
        for i in all_answers_sorted:
            json.dump(i, fw, ensure_ascii=False, indent=None)
            print("", file=fw)

if __name__ == '__main__':
    # 获取所有以"answer"开头的JSON文件
    answer_files = [file for file in os.listdir("E:\\chatglm_llm_fintech_raw_dataset\\dataset") if file.startswith("answers_")]
    print(answer_files)
    path = "E:\\chatglm_llm_fintech_raw_dataset\\dataset"
    # 输出到answer_all.json文件中
    merge_answers_to_json(path, answer_files, "E:\\chatglm_llm_fintech_raw_dataset\\dataset\\answer_all.json")

7.31

增加了能够从问题中抽取公司名和年份的功能，写入json文件

# 根据问题定位文件

import os
import jieba
import json
from sklearn.feature_extraction.text import TfidfVectorizer
from multiprocessing import  Process

import re
# 提取文件名中的公司名
def extract_text(full_text, start_text, end_text):
    pattern = re.escape(start_text) + r"(.*?)" + re.escape(end_text)  # 构建正则表达式模式
    result = re.search(pattern, full_text)  # 在完整文本中搜索匹配的内容

    if result:
        extracted_text = result.group(1)  # 提取匹配的内容
        return extracted_text.strip()  # 返回提取的文字（去除首尾空格）
    else:
        return ""  # 没有找到匹配的内容，返回空字符串

# 计算相似度
def calculate_similarity(sentence1, sentence2):
    seg_list1 = jieba.lcut(sentence1)
    seg_list2 = jieba.lcut(sentence2)

    vectorizer = TfidfVectorizer()
    vectors = vectorizer.fit_transform([sentence1, sentence2])

    similarity = vectors[0] * vectors[1].T  # 计算向量相似度
    similarity_score = similarity.toarray()[0][0]

    return similarity_score

def find_most_similar_filename(sentence, folder_path):
    highest_similarity = 0
    most_similar_filename = ""

    for filename in os.listdir(folder_path):
        file_path = os.path.join(folder_path, filename)

        if os.path.isfile(file_path):
            similarity_score = calculate_similarity(sentence, extract_text(filename,'__','股份有限公司'))

            if similarity_score > highest_similarity:
                highest_similarity = similarity_score
                most_similar_filename = filename
    
    highest_similarity = 0
    
    # print(most_similar_filename)
    
    if most_similar_filename == "":
        for filename in os.listdir(folder_path):
            file_path = os.path.join(folder_path, filename)

            if os.path.isfile(file_path):
                similarity_score = calculate_similarity(sentence, filename.split('/')[-1].split('__')[3])

                if similarity_score > highest_similarity:
                    highest_similarity = similarity_score
                    most_similar_filename = filename
            
    return most_similar_filename

if __name__ == '__main__':
    
    ## company name
    jieba.load_userdict("E:\\chatglm_llm_fintech_raw_dataset\\dataset\\company_names.txt")
    jieba.load_userdict("E:\\chatglm_llm_fintech_raw_dataset\\dataset\\company_short_names.txt")
    
    
    folder_path = "E:\\chatglm_llm_fintech_raw_dataset\\alltxt"
    
    save_path = "E:\\chatglm_llm_fintech_raw_dataset\\dataset"
    test_questions = open(save_path+"\\question_2019.jsonl","r", encoding="utf-8").readlines()
    
    match_dict = []
    
    for test_question in test_questions:
        id = json.loads(test_question)["id"]
        question = json.loads(test_question)["question"]
        
    # question = "2021年浙江东晶电子股份有限公司的营业成本是多少元?"
    
        seg = jieba.lcut(question)
        seg = [seg[i].replace('2019','') for i in range(len(seg))]
        seg = [seg[i].replace('2020','') for i in range(len(seg))]
        seg = [seg[i].replace('2021','') for i in range(len(seg))]
        # print(seg)
        sentence_a = max(seg, key=len, default='')
        print(sentence_a)
    

        most_similar = find_most_similar_filename(sentence_a, folder_path)
        # match_dict.append({"id":id,"question":question,"filename":most_similar})
        with open(save_path + "/match_2019.jsonl", 'a', encoding='utf-8', newline='') as fw:
            json.dump({"id": id, "question": question, "filename": most_similar, "year": '2019'}, fw, ensure_ascii=False, indent=None)
            print("", file=fw) # Add a newline after each JSON object
        
        # print(f"最相似的文件名是：{most_similar}")
    
    # match_dict = sorted(match_dict, key=lambda x: x['id'])
    
    # with open(save_path+"/match_2019.jsonl", 'w', encoding='utf-8', newline='') as fw:
    #     for item in match_dict:
    #         json.dump(item, fw, ensure_ascii=False, indent=None)
    #         print("", file=fw)
    
    
    # sentence_a = '浙江东晶电子'
    
    # sentence_b = '常熟汽饰'
    
    # most_similar = find_most_similar_filename(sentence_b, folder_path)
    # print(f"最相似的文件名是：{most_similar}")

8.1

将每个问题对应的公司名称提取出来，并合并到了一起，思路就是根据jieba库将原来文件中的文件名作为字典，和问题中的问题进行匹配，然后取出最长的字段，但是问题是如果公司名很短可能就匹配不上，可以考虑增加一些判断，分割的时候删掉一些长单词。

后续又进行了一些改进，首先通过比对jieba分出来的词和字典库，把公司名字抽取出来，然后抽取出年份，规则是如果问题中有靠后的年份就用靠后的，最后在搜索文件时先去对应年份的文件夹中找，如果找不到再去其他文件夹中找。

# 根据问题定位文件

import os
import jieba
import json
from sklearn.feature_extraction.text import TfidfVectorizer
from multiprocessing import  Process

import re
# 提取文件名中的公司名
def extract_text(full_text, start_text, end_text):
    pattern = re.escape(start_text) + r"(.*?)" + re.escape(end_text)  # 构建正则表达式模式
    result = re.search(pattern, full_text)  # 在完整文本中搜索匹配的内容

    if result:
        extracted_text = result.group(1)  # 提取匹配的内容
        return extracted_text.strip()  # 返回提取的文字（去除首尾空格）
    else:
        return ""  # 没有找到匹配的内容，返回空字符串

# 计算相似度
def calculate_similarity(sentence1, sentence2):
    seg_list1 = jieba.lcut(sentence1)
    seg_list2 = jieba.lcut(sentence2)

    vectorizer = TfidfVectorizer()
    vectors = vectorizer.fit_transform([sentence1, sentence2])

    similarity = vectors[0] * vectors[1].T  # 计算向量相似度
    similarity_score = similarity.toarray()[0][0]

    return similarity_score

def find_most_similar_filename(sentence, folder_path):
    highest_similarity = 0
    most_similar_filename = ""

    for filename in os.listdir(folder_path):
        file_path = os.path.join(folder_path, filename)

        if os.path.isfile(file_path):
            similarity_score = calculate_similarity(sentence, extract_text(filename,'__','股份有限公司'))

            if similarity_score > highest_similarity:
                highest_similarity = similarity_score
                most_similar_filename = filename
    
    highest_similarity = 0
    
    # print(most_similar_filename)
    
    if most_similar_filename == "":
        for filename in os.listdir(folder_path):
            file_path = os.path.join(folder_path, filename)

            if os.path.isfile(file_path):
                similarity_score = calculate_similarity(sentence, filename.split('/')[-1].split('__')[3])

                if similarity_score > highest_similarity:
                    highest_similarity = similarity_score
                    most_similar_filename = filename
            
    return most_similar_filename

if __name__ == '__main__':
    
    ## company name
    jieba.load_userdict("E:\\chatglm_llm_fintech_raw_dataset\\dataset\\company_names.txt")
    jieba.load_userdict("E:\\chatglm_llm_fintech_raw_dataset\\dataset\\company_short_names.txt")
    # 假设company_names.txt和company_short_names.txt包含的词汇
    with open("E:\\chatglm_llm_fintech_raw_dataset\\dataset\\company_names.txt", "r", encoding="utf-8") as file:
        company_names = set(file.read().splitlines())

    with open("E:\\chatglm_llm_fintech_raw_dataset\\dataset\\company_short_names.txt", "r", encoding="utf-8") as file:
        company_short_names = set(file.read().splitlines())

    # 将两个词汇集合合并为一个自定义词典
    custom_dictionary = company_names.union(company_short_names)
    
    folder_path = "E:\\chatglm_llm_fintech_raw_dataset\\alltxt"
    
    save_path = "E:\\chatglm_llm_fintech_raw_dataset\\dataset"
    test_questions = open(save_path+"\\question_2019.jsonl","r", encoding="utf-8").readlines()
    
    match_dict = []
    id_list = []
    for test_question in test_questions:
        id = json.loads(test_question)["id"]
        question = json.loads(test_question)["question"]
    
    # question = "2021年浙江东晶电子股份有限公司的营业成本是多少元?"

        seg = jieba.lcut(question)
        # seg = [seg[i].replace('2019','') for i in range(len(seg))]
        # seg = [seg[i].replace('2020','') for i in range(len(seg))]
        # seg = [seg[i].replace('2021','') for i in range(len(seg))]
        year = ''
        for i in range(len(seg)):
            if(seg[i] == '2019' and year == ''):
                year = '2019'
            if(seg[i] == '2020' and (year == '' or year == '2019')):
                year = '2020'
            if(seg[i] == '2021' and (year == '' or year == '2019' or year == '2020')):
                year = '2021'
        # print(seg)
        # sentence_a = max(seg, key=len, default='')
        sentence_a = ""
        most_similar = ""
        for word in seg:
            if(word in custom_dictionary):
                sentence_a = word
                print(sentence_a)
                most_similar = find_most_similar_filename(sentence_a, folder_path + '\\' + year)
                if(most_similar == "" and year == '2019'):
                    most_similar = find_most_similar_filename(sentence_a, folder_path + '\\' + '2020')
                    if(most_similar == ""):
                        most_similar = find_most_similar_filename(sentence_a, folder_path + '\\' + '2021')
                if(most_similar == "" and year == '2020'):
                    most_similar = find_most_similar_filename(sentence_a, folder_path + '\\' + '2021')
                    if(most_similar == ""):
                        most_similar = find_most_similar_filename(sentence_a, folder_path + '\\' + '2019')
                if(most_similar == "" and year == '2021'):
                    most_similar = find_most_similar_filename(sentence_a, folder_path + '\\' + '2020')
                    if(most_similar == ""):
                        most_similar = find_most_similar_filename(sentence_a, folder_path + '\\' + '2019')
                match_dict.append({"id":id,"question":question,"filename":most_similar})
                # with open(save_path + "/test.jsonl", 'a', encoding='utf-8', newline='') as fw:
                #     json.dump({"id": id, "question": question, "companany_name": sentence_a, "year": year}, fw, ensure_ascii=False, indent=None)
                #     print("", file=fw) # Add a newline after each JSON object
                with open(save_path + "/match2_2019.jsonl", 'a', encoding='utf-8', newline='') as fw:
                    json.dump({"id": id, "question": question, "companany_name": sentence_a, "filename": most_similar, "year": year}, fw, ensure_ascii=False, indent=None)
                    print("", file=fw) # Add a newline after each JSON object
        if(sentence_a == ""):
            with open(save_path + "/match2_2019.jsonl", 'a', encoding='utf-8', newline='') as fw:
                json.dump({"id": id, "question": question, "companany_name": sentence_a, "filename": most_similar, "year": year}, fw, ensure_ascii=False, indent=None)
                print("", file=fw) # Add a newline after each JSON object

然后根据文件名和公司名称对应的json文件，可以先把所有的证券代码问题解决

from transformers import AutoTokenizer, AutoModel
import json
import re

# 加载模型和tokenizer
# tokenizer = AutoTokenizer.from_pretrained("THUDM\chatglm-6b-int4", trust_remote_code=True)
# model = AutoModel.from_pretrained("THUDM\chatglm-6b-int4", trust_remote_code=True).half().cuda()
# model = model.eval()

# 从JSON文件中读取问题数据
def read_questions_from_json(json_file_path):
    questions = []
    with open(json_file_path, 'r', encoding='utf-8') as file:
        for line in file:
            question = json.loads(line)
            questions.append(question)
    return questions


# 将问题和回答保存为JSON文件
def save_answers_to_json(answers, json_file_path):
    with open(json_file_path, 'w', encoding='utf-8') as file:
        for answer in answers:    
            json.dump(answer, file, ensure_ascii=False, indent=None)
            print("", file=file)

if __name__ == '__main__':
    # 从test_questions.jsonl中读取问题
    test_questions = read_questions_from_json("E:\\chatglm_llm_fintech_raw_dataset\\dataset\\match_2021.jsonl")

    # 使用模型回答问题并保存回答
    answers = []
    for item in test_questions:
        question_id = item["id"]
        question_text = item["question"]
        filename = item["filename"]
        if "证券代码" in question_text:
            # 调用模型进行回答
            # response, _ = model.chat(tokenizer, question_text, history=[])
            # response = response.replace("\n\n", "")
            year_match = re.search(r'(\d{4})年', filename)
            company_name_match = re.search(r'__(.*?)__', filename)
            security_code_match = re.search(r'__(\d{6})__', filename)

            answer = ""
            if year_match and company_name_match and security_code_match:
                year = year_match.group(1)
                company_name = company_name_match.group(1)
                security_code = security_code_match.group(1)

                # Store the extracted information in the answer string
                answer = f"{year}年{company_name}的证券代码是{security_code}"
            # 将问题和回答保存为字典格式
            answer_dict = {
            "id": question_id,
            "question": question_text,
            "answer": answer
            }
        else:
            answer_dict = {
            "id": question_id,
            "question": question_text,
            "answer": ""
            }
        answers.append(answer_dict)
        save_answers_to_json(answers, "E:\\chatglm_llm_fintech_raw_dataset\\dataset\\answers_2021.json")
    # 将回答保存为answers.json文件

然后就是根据json文件，找到对应的txt文件，根据问题生成对应的向量文件

import os
import nltk
import json
from pypinyin import pinyin, Style
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.schema import Document
from langchain.vectorstores import FAISS
from tqdm import tqdm
import os

# Show reply with source text from input document
REPLY_WITH_SOURCE = False

def chinese_to_english(input_text):
    pinyin_list = pinyin(input_text, style=Style.NORMAL)
    english_text = ''.join([p[0] for p in pinyin_list])
    return english_text.lower()  # 将拼音转换为小写英文字符

# 从JSON文件中读取问题数据
def read_questions_from_json(json_file_path):
    questions = []
    with open(json_file_path, 'r', encoding='utf-8') as file:
        for line in file:
            question = json.loads(line)
            questions.append(question)
    return questions


# 将问题和回答保存为JSON文件
def save_answers_to_json(answers, json_file_path):
    with open(json_file_path, 'w', encoding='utf-8') as file:
        for answer in answers:    
            json.dump(answer, file, ensure_ascii=False, indent=None)
            print("", file=file)

def main():
    # 中文Wikipedia数据导入示例：
    embedding_model_name = "E:\\ChatGLM-6B-main\\ChatGLM-6B-main\\THUDM\\text2vec-large-chinese"
    embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name)
    test_questions = read_questions_from_json("E:\\chatglm_llm_fintech_raw_dataset\\dataset\\match_2019.jsonl")
    # 读取文件夹中的所有txt文件并生成向量文件
    # txt_folder = "E:\\chatglm_llm_fintech_raw_dataset\\alltxt\\"
    answers = []
    test_questions = test_questions[501:]
    id_list_2019 = [1, 32, 39, 65, 96, 153, 171, 371, 379, 415, 607, 616, 630, 721, 757, 793, 831, 835, 917, 996, 1001, 1153, 1253, 1257, 1294, 1472, 1638, 1644, 1657, 1711, 1840, 1945, 1954, 2038, 2059, 2068, 2095, 2182, 2250, 2283, 2287, 2326, 2395, 2399, 2541, 2654, 2730, 2783, 2799, 2876, 2921, 2980, 2982, 3043, 3049, 3061, 3082, 3115, 3119, 3195, 3227, 3265, 3275, 3277, 3307, 3492, 3595, 3667, 3946, 3975, 4000, 4003, 4027, 4030, 4099, 4113, 4199, 4202, 4342, 4345, 4437, 4449, 4506, 4518, 4524, 4765, 4861]
    for item in test_questions:
        question_id = item["id"]
        question_text = item["question"]
        file_name = item["filename"]
        file_path = ("E:\\chatglm_llm_fintech_raw_dataset\\alltxt\\" + file_name)
        vector_path = ("E:\\chatglm_llm_fintech_raw_dataset\\vector_files\\" + chinese_to_english(file_name[:-4]) + "\\")
        # print(vector_path)
        if os.path.exists(vector_path):
            print(vector_path + "exist!!")
            continue
        if(question_id in id_list_2019):
            rows = []
            if os.path.exists(file_path):
                for line in open(file_path, "r", encoding="utf-8").read().split("\n"):
                    rows.append(line)
            else:
                print("File does not exist. Skipping the loop.")

            #去重并创建文档对象
            rows = list(set(rows))
            docs = []
            idx = 0
            for row in rows:
                metadata = {"source": f'doc_id_{idx}'}
                idx += 1
                if isinstance(row, str):
                    docs.append(Document(page_content=row, metadata=metadata))

            # 计算并保存向量文件
            vector_store = FAISS.from_documents(docs, embeddings)
            output_folder = "vector_files"  # 定义保存向量文件的文件夹路径
            if not os.path.exists(output_folder):
                os.makedirs(output_folder)
            txt_file_name = os.path.splitext(os.path.basename(file_name))[0]
            txt_file_name = chinese_to_english(txt_file_name)
            vector_store.save_local(f'vector_files/{txt_file_name}/')
            print(file_name, "向量文件生成完成")


if __name__ == "__main__":
    main()

基于langchain-chatglm项目，写了一个脚本，可以读取本地目录下的向量文件，直接生成回答并保存在json文件中

from configs.model_config import *
from chains.local_doc_qa import LocalDocQA
import os
import nltk
from models.loader.args import parser
import models.shared as shared
from models.loader import LoaderCheckPoint
import json
from pypinyin import pinyin, Style
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.schema import Document
from langchain.vectorstores import FAISS
from tqdm import tqdm
import os
nltk.data.path = [NLTK_DATA_PATH] + nltk.data.path

# Show reply with source text from input document
REPLY_WITH_SOURCE = False

def chinese_to_english(input_text):
    pinyin_list = pinyin(input_text, style=Style.NORMAL)
    english_text = ''.join([p[0] for p in pinyin_list])
    return english_text.lower()  # 将拼音转换为小写英文字符

# 从JSON文件中读取问题数据
def read_questions_from_json(json_file_path):
    questions = []
    with open(json_file_path, 'r', encoding='utf-8') as file:
        for line in file:
            question = json.loads(line)
            questions.append(question)
    return questions


# 将问题和回答保存为JSON文件
def save_answers_to_json(answers, json_file_path):
    with open(json_file_path, 'a', encoding='utf-8') as file:    
        json.dump(answers, file, ensure_ascii=False, indent=None)
        print("", file=file)

def main():

    llm_model_ins = shared.loaderLLM()
    llm_model_ins.history_len = LLM_HISTORY_LEN

    local_doc_qa = LocalDocQA()
    local_doc_qa.init_cfg(llm_model=llm_model_ins,
                          embedding_model=EMBEDDING_MODEL,
                          embedding_device=EMBEDDING_DEVICE,
                          top_k=VECTOR_SEARCH_TOP_K)
    # 中文Wikipedia数据导入示例：
    embedding_model_name = "E:\\ChatGLM-6B-main\\ChatGLM-6B-main\\THUDM\\text2vec-large-chinese"
    embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name)
    test_questions = read_questions_from_json("E:\\chatglm_llm_fintech_raw_dataset\\dataset\\match_2021.jsonl")
    # 读取文件夹中的所有txt文件并生成向量文件
    # txt_folder = "E:\\chatglm_llm_fintech_raw_dataset\\alltxt\\"
    answers = []
    id_list_2019 = [1, 32, 39, 65, 96, 153, 171, 371, 379, 415, 607, 616, 630, 721, 757, 793, 831, 835, 917, 996, 1001, 1153, 1253, 1257, 1294, 1472, 1638, 1644, 1657, 1711, 1840, 1945, 1954, 2038, 2059, 2068, 2095, 2182, 2250, 2283, 2287, 2326, 2395, 2399, 2541, 2654, 2730, 2783, 2799, 2876, 2921, 2980, 2982, 3043, 3049, 3061, 3082, 3115, 3119, 3195, 3227, 3265, 3275, 3277, 3307, 3492, 3595, 3667, 3946, 3975, 4000, 4003, 4027, 4030, 4099, 4113, 4199, 4202, 4342, 4345, 4437, 4449, 4506, 4518, 4524, 4765, 4861]
    # id_list_2020 = [0, 118, 159, 185, 203, 256, 266, 293, 360, 410, 536, 640, 810, 820, 1075, 1095, 1101, 1111, 1170, 1268, 1455, 1469, 1554, 1625, 1679, 1742, 1849, 2098, 2258, 2327, 2379, 2433, 2500, 2551, 2580, 2611, 2801, 2809, 2858, 2945, 2959, 3042, 3068, 3336, 3373, 3375, 3393, 3535, 3546, 3610, 3656, 3665, 3717, 3726, 3746, 3843, 4029, 4044, 4255, 4260, 4363, 4502, 4600, 4795, 4833]
    # id_list_2020 = [4600, 4795, 4833]
    #id_list_2021 = [56, 190, 201, 208, 257, 344, 361, 376, 420, 656, 685, 720, 778, 787, 796, 817, 870, 899, 1261, 1264, 1314, 1368, 1378, 1492, 1550, 1712, 1716, 1744, 1810, 1826, 1829, 2049, 2066, 2084, 2203, 2603, 2803, 2807, 2871, 2915, 2978, 2981, 3343, 3353, 3459, 3524, 3597, 3653, 3794, 4005, 4074, 4103, 4115, 4308, 4314, 4526, 4659]
    id_list_2021 = [1378, 1492, 1550, 1712, 1716, 1744, 1810, 1826, 1829, 2049, 2066, 2084, 2203, 2603, 2803, 2807, 2871, 2915, 2978, 2981, 3343, 3353, 3459, 3524, 3597, 3653, 3794, 4005, 4074, 4103, 4115, 4308, 4314, 4526, 4659]
    # test_questions = test_questions[:501]
    for item in test_questions:
        torch.cuda.empty_cache()
        question_id = item["id"]
        question_text = item["question"]
        file_name = item["filename"]
        file_path = ("E:\\chatglm_llm_fintech_raw_dataset\\alltxt\\" + file_name)
        if(question_id in id_list_2021):
            if file_name is "":
                answer_dict = {
                    "id": question_id,
                    "question": question_text,
                    "answer": "很抱歉，我的知识库中不包含问题中提到的公司"
                }
                # answers.append(answer_dict)
                save_answers_to_json(answer_dict, "E:\\chatglm_llm_fintech_raw_dataset\\dataset\\train_2021.json")
                print("答案保存完成")
                continue
            # rows = []
            # if os.path.exists(file_path):
            #     for line in open(file_path, "r", encoding="utf-8").read().split("\n"):
            #         rows.append(line)
            # else:
            #     print("File does not exist. Skipping the loop.")

            # #去重并创建文档对象
            # rows = list(set(rows))
            # docs = []
            # idx = 0
            # for row in rows:
            #     metadata = {"source": f'doc_id_{idx}'}
            #     idx += 1
            #     if isinstance(row, str):
            #         docs.append(Document(page_content=row, metadata=metadata))

            # # 计算并保存向量文件
            # vector_store = FAISS.from_documents(docs, embeddings)
            # output_folder = "vector_files"  # 定义保存向量文件的文件夹路径
            # if not os.path.exists(output_folder):
            #     os.makedirs(output_folder)
            txt_file_name = os.path.splitext(os.path.basename(file_name))[0]
            txt_file_name = chinese_to_english(txt_file_name)
            #vector_store = FAISS.load_local(f'vector_files/{txt_file_name}/')
            #print(file_name, "向量文件生成完成")
            vs_path = ("E:\\chatglm_llm_fintech_raw_dataset\\vector_files\\" + txt_file_name + "\\")
            # vs_path = ("E:\\chatglm_llm_fintech_raw_dataset\\vector_files\\2020-03-12__jiangyinjianghuaweidianzicailiaogufenyouxiangongsi__603078__jianghuawei__2019nian__niandubaogao\\")
            print(vs_path) 
            history = []  
            query = question_text
            last_print_len = 0
            for resp, history in local_doc_qa.get_knowledge_based_answer(query=query,
                                                                        vs_path=vs_path,
                                                                        chat_history=history,
                                                                        ):
                if STREAMING:
                    # print(resp["result"][last_print_len:], end="", flush=True)
                    output_text = resp["result"].strip()
                    output_text = output_text.replace("\n", "")
                    # print(output_text, end="", flush=True)
                    last_print_len = len(resp["result"])
                    # print(resp["result"])
            answer_dict = {
                "id": question_id,
                "question": question_text,
                "answer": output_text
            }
            # answers.append(answer_dict)
            save_answers_to_json(answer_dict, "E:\\chatglm_llm_fintech_raw_dataset\\dataset\\train_2021.json")
            print("答案保存完成")
            # if REPLY_WITH_SOURCE:
            #     source_text = [f"""出处 [{inum + 1}] {os.path.split(doc.metadata['source'])[-1]}：{doc.page_content}"""
            #                     # f"""相关度：{doc.metadata['score']}\n\n"""
            #                     for inum, doc in
            #                     enumerate(resp["source_documents"])]
            #     print("".join(source_text))


if __name__ == "__main__":
#     # 通过cli.py调用cli_demo时需要在cli.py里初始化模型，否则会报错：
    # langchain-ChatGLM: error: unrecognized arguments: start cli
    # 为此需要先将
    # args = None
    # args = parser.parse_args()
    # args_dict = vars(args)
    # shared.loaderCheckPoint = LoaderCheckPoint(args_dict)
    # 语句从main函数里取出放到函数外部
    # 然后在cli.py里初始化
    args = None
    args = parser.parse_args()
    args_dict = vars(args)
    shared.loaderCheckPoint = LoaderCheckPoint(args_dict)
    main()

8.2

写了一个根据内容匹配关键词的程序，并打印到数组中，这样在生成答案时可以只生成某一类问题的答案

import json

# 读取JSON文件
file_path = "E:\chatglm_llm_fintech_raw_dataset\dataset\question_2019.jsonl"

def read_questions_from_json(json_file_path):
    questions = []
    with open(json_file_path, 'r', encoding='utf-8') as file:
        for line in file:
            question = json.loads(line)
            questions.append(question)
    return questions

data = read_questions_from_json(file_path)

# 提取包含"情况"的"id"字段并保存在列表中
ids_with_keyword = []

for item in data:
    if "question" in item and ("情况" in item["question"] or "简要" in item["question"]):
        ids_with_keyword.append(item["id"])

print(ids_with_keyword)
print(len(ids_with_keyword))

写了一个merge2json文件，可以将一个json文件的answer复制到目标json文件中，如果目标json文件为空，就覆盖，否则就不覆盖

import json

# 读取JSON文件
src_path = "E:\chatglm_llm_fintech_raw_dataset\dataset\\train_2020.json"
des_path = "E:\chatglm_llm_fintech_raw_dataset\dataset\\answer_all.json"

def read_questions_from_json(json_file_path):
    questions = []
    with open(json_file_path, 'r', encoding='utf-8') as file:
        for line in file:
            question = json.loads(line)
            questions.append(question)
    return questions

# 将问题和回答保存为JSON文件
def save_answers_to_json(answers, json_file_path):
    with open(json_file_path, 'w', encoding='utf-8') as file:
        for answer in answers:    
            json.dump(answer, file, ensure_ascii=False, indent=None)
            print("", file=file)

srcs = read_questions_from_json(src_path)
dests = read_questions_from_json(des_path)

# 提取包含"情况"的"id"字段并保存在列表中
ids_with_keyword = []
answers = []
for dest in dests:
    dest_id = dest["id"]
    dest_question = dest["question"]
    dest_answer = dest["answer"]
    # print(dest_answer)
    if dest_answer == "":
        for src in srcs:
            src_id = src["id"]
            src_question = src["question"]
            src_answer = src["answer"]
            # print(dest_answer)
            if src_id == dest_id and src_answer != "":
                dest_answer = src_answer
                dest["answer"] = dest_answer
                # print(dest["answer"] )
    answers.append(dest)
    save_answers_to_json(answers, "E:\\chatglm_llm_fintech_raw_dataset\\dataset\\answer_all.json")

之后就是生成向量数据库和生成答案的漫长等待。

8.3

将证券名称的问题回答了，然后就是复习超新星考试。

8.4

复习超新星考试。

8.7

增加了登录窗口，login.html文件，现在可以登录后跳转到原来的界面

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Login</title>
</head>
<body>
    <center>
    <div id="login-app">
        <h2>登录</h2>
        <p><input type="text" v-model="username" placeholder="用户名"></p>
        <p><input type="password" v-model="password" placeholder="密码"></p>
        <button @click="login">登录</button>
    </div>

    <script src="https://cdn.jsdelivr.net/npm/vue@2.6.14/dist/vue.min.js"></script>
    <script>
        new Vue({
            el: '#login-app',
            data: {
                username: '',
                password: '',
            },
            methods: {
                login() {
                    // 读取凭据文件的内容
                    fetch('credentials.txt')
                        .then(response => response.text())
                        .then(text => {
                            const credentials = text.split('\n');
                            const enteredUsername = this.username.trim();
                            const enteredPassword = this.password.trim();
                            let validCredentials = false;

                            // 遍历凭据并进行比较
                            for (const credential of credentials) {
                                const [storedUsername, storedPassword] = credential.split(':').map(item => item.trim());
                                if (storedUsername === enteredUsername && storedPassword === enteredPassword) {
                                    validCredentials = true;
                                    break;
                                }
                            }
                            if (validCredentials) {
                                localStorage.setItem('loggedIn', 'true');
                                window.location.href = 'index.html';
                                alert('登录成功');
                            } else {
                                alert('无效的用户名或密码');
                            }
                        })
                        .catch(error => {
                            console.error('读取凭据时出错：', error);
                            alert('登录失败');
                        });
                }
            }
        });
    </script>
    </center>
</body>
</html>

此外，还需要将wenda中默认路径下打开的文件改成login.html文件。

从PDF中提取表格,主要思路是根据问题去找对应的pdf文件，通过extract_all_table函数提取PDF中所有的表格，保存在df_tables中；每个表格后面还有其对应的编号。然后通过extract_table_data抽取出所有需要抽取的条目，作为数据的列名，需要抽取的信息是根据问题划分的。最后通过save_table_data将表格数据保存起来。

import glob表格
import pdfplumber
import re
import xlwt
import pandas as pd
from tqdm import tqdm
import json
from multiprocessing import  Process

def extract_table_from_txt(table_header, file_name):
    with open(file_name, 'r', encoding='utf-8') as txt_file:
        lines = txt_file.readlines()

    temp_table = None
    df_tables = []
    table_name = []

    for line in lines:
        line = line.strip()
        if line:  # Non-empty line
            row_data = line.split(',')
            if temp_table is None:
                temp_table = [table_header]  # Assuming the header is provided
            temp_table.append(row_data)
        else:  # Empty line, start a new table
            if temp_table is not None:
                df = pd.DataFrame(temp_table[1:], columns=temp_table[0])
                df_tables.append([df, table_name])
                temp_table = None
                table_name = []

    if temp_table is not None:
        df = pd.DataFrame(temp_table[1:], columns=temp_table[0])
        df_tables.append([df, table_name])
    print(len(df_tables))
    return df_tables

def extract_all_table(table_header, file_name, save_path):
    pdf = pdfplumber.open(file_name)
    # print("done")
    temp_table = None
    df_tables = []
    table_name = []
    cnt = 0
    for page in pdf.pages[1:]:
        text = page.extract_text()
        # print(text)
        try:
            page_num = int(text.strip().split("\n")[-1].split('/')[0])  # 文本最后一行为页码
        except (ValueError, IndexError):
            # print(cnt, " no page num")
            page_num = cnt
            # continue
        cnt += 1
        if len(page.extract_tables()) == 0:
            # print(cnt, " no table")
            continue
        else:
            # print("--------------- 第{}页 --------------".format(page_num))
            num_table = len(page.extract_tables())
            for table_id in range(num_table):
                table = page.extract_tables()[table_id]
                if temp_table is not None and table_id == 0:  # 上页有可能没结束的表
                    if page.bbox[3]-page.find_tables()[table_id].bbox[1] + 30 >= page.chars[0].get('y1'):  # 该表是页首（-30代表去掉空隙，是针对招股书pdf设计的特定值）
                        # df = pd.DataFrame(table[1:], columns=table[0])  # TODO: 判断是否有表头，有的话可能要合并表头
                        # 与temp拼合
                        df = pd.DataFrame(table)
                        temp_table = pd.concat([temp_table, df], axis=0)
                        table_name.append(str(page_num) + "-" + str(table_id + 1))
                        #print(table_name)
                        if page.chars[-1].get('y0') < page.bbox[3] - page.find_tables()[table_id].bbox[3]:  # 该表不是页尾，结束拼接，加入table_list，temp置空
                            df_tables.append([temp_table, table_name])
                            temp_table = None
                            table_name = []
                        else:  # 该表是页尾，继续拼接下一页表格
                            break
                        
                    else:  # 该表不是页首，上页页尾表格结束，加入table_list
                        df_tables.append([temp_table, table_name])
                        temp_table = None
                        table_name = []
                        if page.chars[-1].get('y0') < page.bbox[3] - page.find_tables()[table_id].bbox[3]:  # 该表不是页尾，直接加入table_list
                            df = pd.DataFrame(table)
                            table_name = [str(page_num) + "-" + str(table_id + 1)]
                            df_tables.append([df, table_name])
                            table_name = []
                        else:  # 该表是页尾，存入temp
                            temp_table = pd.DataFrame(table)
                            table_name.append(str(page_num) + "-" + str(table_id + 1))
                else:  # temp无值（上个表页尾不是表）
                    if page.chars[-1].get('y0') < page.bbox[3] - page.find_tables()[table_id].bbox[3]:  # 该表不是页尾，直接加入table_list
                        df = pd.DataFrame(table)
                        table_name = [str(page_num) + "-" + str(table_id + 1)]
                        df_tables.append([df, table_name])
                        table_name = []
                    else:  # 该表是页尾，存入temp
                        temp_table = pd.DataFrame(table)
                        table_name.append(str(page_num) + "-" + str(table_id + 1))

    if False:
        table_path = save_path + '/' + file_name.split('/')[-1].replace('.pdf', '_table.xlsx')
        print(table_path)
        writer = pd.ExcelWriter(table_path, engine='xlsxwriter')
        
        
        for table_ele in df_tables:
            name = "_".join(table_ele[1])
            # print(name)
            table_tmp = pd.DataFrame(table_ele[0])
            table_tmp.to_excel(writer, sheet_name="%s" % (name[:30]), index=False)
            
        writer.close() # 关闭
    print(df_tables[0])
    return df_tables

def find_item(name, item_list):
    for item in item_list:
        if type(item) == float:
            continue
        if item != None and (item.startswith(name) or item.endswith(name)):
            return name, item
    return name, None

def extract_table_data(file_name, table_header, df_tables, save_path):
    allname = file_name.split('/')[-1]
    company_name = allname.split('__')[1]
    company_short_name = allname.split('__')[3]
    company_code = allname.split('__')[2]
    year = allname.split('__')[4]
    
    table_data = {}
    
    for id in range(len(df_tables)):
        table_ele = df_tables[id][0]
        
        if table_ele.shape[1] == 1:
            continue
         
        col_pos = 0   
        value_pos = 1
        
        row0 = table_ele.loc[0]
        # print(row0.shape)
        # if row0.shape[0]>1:
        #     row0 = row0.iloc[0]
        
        if (type(row0) != 'str'):
            if row0.ndim <= 1:
                # print(row0[1])
                if (row0[1] == "附注"):
                    value_pos = 2
        
        column0 = table_ele.loc[:, col_pos]
        
        cnt = 0
        total = len(column0)
        ## 第一列为空
        for i in column0[:]:
            if i=='':
                cnt += 1
        
        if (cnt == total):
            col_pos += 1
            value_pos += 1
            column0 = table_ele.loc[:, col_pos]
        
        for i in table_header:
            name, item = find_item(str(i), column0)
            if item != None : 
                if len(item)<20:
                    # print(item)
                    row = (table_ele.values[:,col_pos]==item).tolist().index(True)
                    # print(row)
                    if value_pos < len(table_ele.values[row]):
                        item_value = table_ele.values[row][value_pos]
                        if item_value == '' or item_value == None:
                            tmp_pos = value_pos
                            while tmp_pos+1 < len(table_ele.values[row]):
                                item_value = table_ele.values[row][tmp_pos+1]
                                if item_value != None and item_value != '':
                                    break
                                tmp_pos += 1
                    else : 
                        continue
                    # print(item_value)
                    
                    if item_value == '' or item_value == None:
                        # print(row)
                        if table_ele.values[row-1][col_pos] == None or table_ele.values[row-1][col_pos] == '':
                            if row+1 < table_ele.shape[0]:
                                if table_ele.values[row+1][col_pos] == None or table_ele.values[row+1][col_pos] == '':
                                    if table_ele.values[row+1][value_pos] != None and table_ele.values[row+1][value_pos] != '' and table_ele.values[row-1][value_pos] != None and table_ele.values[row-1][value_pos] != '' :
                                        item_value = str(table_ele.values[row-1][value_pos]) + str(table_ele.values[row+1][value_pos])
                                else :
                                    if table_ele.values[row-1][value_pos] != None and table_ele.values[row-1][value_pos] != '' :
                                        item_value = table_ele.values[row-1][value_pos]
                            else :
                                if table_ele.values[row-1][value_pos] != None and table_ele.values[row-1][value_pos] != '' :
                                    item_value = table_ele.values[row-1][value_pos]
                        elif row+1 < table_ele.shape[0]:
                            if table_ele.values[row+1][col_pos] == None or table_ele.values[row+1][col_pos] == '':
                                if table_ele.values[row+1][value_pos] != None and table_ele.values[row+1][value_pos] != '' :
                                    item_value = str(table_ele.values[row+1][value_pos])
                            
                    
                    if item_value != None and item_value != '' and type(item_value) != float:
                        if '、' in item_value or '附注' in item_value or '，' in item_value:
                            # if value_pos+1 < len(table_ele.values[row]):
                            #     item_value = table_ele.values[row][value_pos+1]
                            tmp_pos = value_pos
                            while tmp_pos+1 < len(table_ele.values[row]):
                                item_value = table_ele.values[row][tmp_pos+1]
                                if item_value != None and item_value != '':
                                    break
                                tmp_pos += 1
                            
                    if item_value != None and item_value != '':
                        if type(item_value) != float:
                            item_value.replace("\n", '')
                        # print(item, item_value)
                        if name not in table_data.keys():
                            # print("yes", name, item_value)
                            table_data[name] = item_value
                        # table_data[name] = item_value
            else:
                continue
    
    
    table_data['公司名称'] = company_name
    table_data['年份'] = year
    table_data['证券代码'] = company_code
    
    return table_data

def save_table_data(table_path, table_data):
    ## 保存表格数据
    writer = pd.ExcelWriter(table_path, engine='xlsxwriter')
    
    table_template={}
    table_template['id']  = ""
    table_template['公司名称'] = ""
    table_template['年份'] = ""
    table_template['证券代码'] = ""
    for item in table_header:
        table_template[item] = ""
        
    table_data = [table_template] + table_data
    
    for id in range(len(table_data)):
        table_tmp = pd.DataFrame(table_data[id], index=[id])
        if id > 0:
            table = pd.concat([table, table_tmp])
        else:
            table = table_tmp
                        
    table.to_excel(writer, index=False)
    
    writer.close()
    

def loop_file(table_header, name_list, folder_path, questions, save_path):
    table_data = []
    id_list = [18]
    # print(id_list)
    for question in questions:
        id = json.loads(question)['id']
        file = json.loads(question)['filename']
        if file == "" or id not in id_list:
            continue
        
        # file_name = folder_path + "\\" + file
        file_name = folder_path + "\\" + file.split('.txt')[0] + '.pdf'
        print(file_name)
        
        name_list.append(file_name)
        
        allname = file_name.split('/')[-1]
        # print(allname)
        date = allname.split('__')[0]
        company_name = allname.split('__')[1]
        year = allname.split('__')[4]
        
        # print(date, company_name, year)
        
        df_tables = extract_all_table(table_header, file_name, save_path)
        # df_tables = extract_table_from_txt(table_header, file_name)
        
        table_data_loc = extract_table_data(file_name,table_header, df_tables, save_path)
        # print(table_data_loc)
        table_data_loc["id"] = id
        
        table_data.append(table_data_loc)

        table_path = save_path + '\\question_type-1_table_data9.xlsx'
        save_table_data(table_path, table_data)
        
    return table_data



def loop_file_folder(table_header, name_list, file_names, save_path):
    table_data = []
    
    for file_name in file_names:
        print(file_name)
        name_list.append(file_name)
        allname = file_name.split('/')[-1]
        date = allname.split('__')[0]
        name = allname.split('__')[1]
        year = allname.split('__')[4]
        
        print(date, name, year)
        
        df_tables = extract_all_table(table_header, file_name, save_path)
        
        table_data_loc = extract_table_data(file_name,table_header, df_tables, save_path)
        # print(table_data_loc)
        table_data_loc["id"] = id
        
        table_data.append(table_data_loc)

        table_path = save_path + '/test_table_data.xlsx'
        save_table_data(table_path, table_data)
        
    return table_data

if __name__ == '__main__': 
    # 文件夹路径
    # folder_path = '../data'
    # folder_path = '../chatglm_llm_fintech_raw_dataset/allpdf'
    dataset_path = 'E:\\chatglm_llm_fintech_raw_dataset\\dataset'
    save_path = 'E:\\chatglm_llm_fintech_raw_dataset\\table'
    
    table_header = []
    with open(dataset_path+"\\table_new.txt", "r", encoding="utf-8") as f:
        for line in f.readlines():
            if line.strip() != "":
                table_header.append(line.strip())
    
    # print(table_header)
    
    # 打印文件名称
    name_list = []

    folder = False
    
    if folder:
        # 获取文件夹内所有文件名称
        folder_path = '..\\data'
        file_names = glob.glob(folder_path + '\\*')
        file_names = sorted(file_names, reverse=True)
        table_data = loop_file_folder(table_header, name_list, file_names, save_path)
    
        table_path = save_path + '\\test_table_data.xlsx'
    
    else:
        # 获取问题类型1的json文件
        # folder_path = 'E:\\chatglm_llm_fintech_raw_dataset\\newtxt_v2'
        folder_path = 'E:\\chatglm_llm_fintech_raw_dataset\\allpdf'
        filename = dataset_path + '\\match2_all.json'
        test_questions = open(filename,"r", encoding="utf-8").readlines()
        
        table_data = loop_file(table_header, name_list, folder_path, test_questions, save_path)

        
        table_path = save_path + '\\question_type-1_table_data.xlsx'