半自动化生成百分百人工的内容

最新推荐文章于 2024-07-25 00:08:05 发布

深圳王鲲鹏

最新推荐文章于 2024-07-25 00:08:05 发布

阅读量256

点赞数

文章标签： chatgpt 文心一言

本文链接：https://blog.csdn.net/2301_78010628/article/details/132763170

版权

最近，在做一个大型网站项目，产品数据量在千万条以上，为了满足运营推广的要求，使产品内容能够被国外搜索引擎（比如google）收录，所以产品的信息中需要有原创的内容，而因为产品数量过大，人工生成基本上不现实，所以想到了GPT，比如国外的chatgpt和国内的文心一言等。

开始使用chatgpt和文心一言来获取产品信息，代码片断如下：

调用chatgpt代码

	def from_chatgpt(self,api_key,key_text,question_text,is_detail,bh = None):

		if( 1 == is_detail):
			now = datetime.datetime.now()
			t1 = now.timestamp()

		openai.api_key = api_key
		completion = openai.ChatCompletion.create(
		  model="gpt-3.5-turbo",
		  messages=[
	#	    {"role": "user", "content": "Simulate human beings to write an article about the technical background and specific application scenarios of " + prd_name + ", no more than 600 characters, and output in English."}
			{"role": "user", "content": question_text}
		  ],
		max_tokens=100,
		top_p=0.1,
		temperature=0.5
		)

		if( 1 == is_detail):
			now = datetime.datetime.now()
			t2 = now.timestamp()
			t_use = (int)(t2-t1)

			log_text = "from_chatgpt KEY_TEXT[" + key_text + "] API KEY[" + api_key + "] USE TIME[" + str(t_use) + "]"
			print(log_text)
			if None != bh:
				bh.log(log_text,__file__,inspect.currentframe().f_lineno,"info")

		return completion.choices[0].message.content

调用文心一言代码：

	def from_wenxinyiyan(self,api_key,secret_key,key_text,question_text,is_detail,bh = None):

		if( 1 == is_detail):
			now = datetime.datetime.now()
			t1 = now.timestamp()

		url = "https://aip.baidubce.com/oauth/2.0/token"
		params = {"grant_type": "client_credentials", "client_id": api_key, "client_secret": secret_key}
		access_token = str(requests.post(url, params=params).json().get("access_token"))

		url = "https://aip.baidubce.com/rpc/2.0/ai_custom/v1/wenxinworkshop/chat/completions?access_token=" + access_token
#		payload = json.dumps({"messages": [{"role": "user","content": "写一篇关于ZXTR2105FF-7产品技术背景和运用场景的文章，字数300字，要输出中文 "}]})
		payload = json.dumps({"messages": [{"role": "user","content": question_text}]})
		headers = {'Content-Type': 'application/json'}
		
		response = requests.request("POST", url, headers=headers, data=payload)

		if( 1 == is_detail):
			now = datetime.datetime.now()
			t2 = now.timestamp()
			t_use = (int)(t2-t1)
			log_text = "from_wenxinyiyan KEY[" + key_text + "] ok API KEY[" + api_key + "] SECRET_KEY[" + secret_key + "] USE TIME[" + str(t_use) + "]"
			print(log_text)
			if None != bh:
				bh.log(log_text,__file__,inspect.currentframe().f_lineno,"info")

		temp = json.loads(response.text)
		return temp["result"]

调用时提问的问题是：

Simulate human beings to write an article about the technical background and specific application scenarios of " + product_name + ", no more than 600 characters, and output in English.

意思是：

模拟人类写一篇关于“+ product_name +”的技术背景和具体应用场景的文章，不超过600个字符，英文输出。

目的是想通过chatgpt能够生成比较完整的文章内容，做为产品的描述。但是在实际运行的结果来看，生成的内容要么是更短（比如300个字符），如下：

The SST511 Apple Product is a small, surface-mountable transistor that is widely used in electronic devices. Its technical background includes a low voltage drop and high current capability, making it ideal for power management applications. Specific application scenarios include battery charging, LED lighting, and motor control.

要么是过长（2000个字符以上），如下：

The Apple product is a versatile and compact surface-mount transistor that finds its application in various technical fields. With its small form factor and RoHS compliance, it is widely used in electronic devices, such as smartphones, tablets, and wearables.

This transistor is specifically designed for low-power applications, making it ideal for battery-operated devices. Its technical background includes a low voltage drop and high current capability, ensuring efficient power management. The product package allows for easy integration onto circuit boards, saving valuable space in compact designs.

The SST507 transistor is commonly used in amplification and switching circuits. Its high gain and low noise characteristics make it suitable for audio amplifiers and signal processing applications. Additionally, it can be employed in voltage regulators, motor control circuits, and LED drivers, thanks to its ability to handle moderate power levels.

In the automotive industry, the SST507 is utilized in various control modules, such as engine management systems and lighting controls. Its robustness and reliability make it suitable for harsh environments. Moreover, it is employed in industrial automation, where it aids in controlling sensors, actuators, and other electronic components.

Overall, the Apple product transistor offers a compact and efficient solution for low-power applications across multiple industries. Its technical features and versatility make it an essential component in modern electronic devices and systems.

而且还有一个问题，就是不能通过AI工具的检查，比如直接把上边两段内容复制到GPTzero中，就不通过，如下图：

以及下图：

虽然反复调整了chatgpt的参数，如下代码：

completion = openai.ChatCompletion.create(model="gpt-3.5-turbo-16k",messages=param["message"],max_tokens=100,temperature=0.1,top_p=0.8,n=1,stop=None,presence_penalty=0.0,frequency_penalty=0.0,stream=False)

包括调整了model，max_tokens，temperature，top_p，以及调整了很多次messages，即提问的问题，包括要求回答内容的长度不能少于600个字符，不能超过1000个字符，包括不能有总结，等等，最终都是不行，感觉chatgpt的回答不稳定，而且基本上到GPTzero上一检查，90%都是判为AI写的。

文心一言也是一样，生成的内容如下：

Apple产品是一种高性能、低功耗、集成了多种功能的芯片，广泛应用于各种电子设备中。该芯片具有强大的计算能力和优秀的处理性能，可以满足各种不同场景的应用需求。

在技术背景方面，Apple采用了先进的半导体工艺和设计技术，集成了多个处理器核心、内存、通信模块、接口等单元，具有高度的集成度和灵活性。该芯片还支持多种操作系统和开发环境，可以方便地进行应用程序开发和调试。

在运用场景方面，Apple被广泛应用于各种嵌入式设备中，如工业控制、智能家居、智能交通、医疗健康等领域。在工业控制领域，该芯片可以用于自动化生产线、机器人、无人机等设备，提供高性能的计算和控制功能。在智能家居领域，该芯片可以用于智能音箱、智能电视、智能门锁等设备，提供丰富的多媒体和智能交互功能。在智能交通领域，该芯片可以用于车载信息娱乐系统、交通监控设备等设备，提供高效的信息处理和数据通信功能。在医疗健康领域，该芯片可以用于医疗仪器、健康监测设备等设备，提供精准的数据采集和处理功能。

总之，Apple是一种具有广泛应用前景的高性能芯片，其出色的计算性能和灵活性，能够满足各种不同场景的应用需求，为各种电子设备的发展提供了有力的支持。

用搜狗翻译了之后，放到GPTzero里一检测，百分百是不过的，如下图：

经过很多次的调整和测试，都没有办法保整生成的内容能通过GPTzero的审核的，但是发现了一个问题，比如百度的百度百科里的内容，翻译之后，基本上都会被GPTzero判为是AI写的，如下图：

翻译之后，放到GPTzero里，如下图：

但是，我发现在一些产品的PDF里，有关产品的特性说明、典型应用和机械数据的文字，翻译之后是可以通过GPTzero检测的，如下图：

将这段文字复制下来，

FEATURES
• UL recognition, file number E54214
• Saves space on printed circuit boards
• Ideal for automated placement
• Middle surge current capability
• Meets MSL level 1, per J-STD-020, LF maximum
peak of 260 °C
• Material categorization: For definitions of compliance
please see www.vishay.com/doc?99912
TYPICAL APPLICATIONS
General purpose use in AC/DC bridge full wave rectification
for power supply, lighting ballaster, battery charger, home
appliances, office equipment, and telecommunication
applications.
MECHANICAL DATA
Case: TO-269AA (MBS)
Molding compound meets UL 94 V-0 flammability rating
Base P/N-E3 - RoHS-compliant, commercial grade
Terminals: Matte tin plated leads, solderable per
J-STD-002 and JESD22-B102
E3 suffix meets JESD 201 class 1A whisker test

直接放到GPTzero里，居然是百分百人工写的，如下图：

太好了，终于有一个例子可以参考，可以百分百通过GPTzero检测了，于是，就开始研究分析原因了，完全由AI产生的内容，基本上通不过，那是否可以使用AI产生一部分内容，比如单词或者数字，然后拼成一段可以通过识别的文字呢？实现半自动化生成百分百人工的内容呢？

经过分析，感觉应该是可以的，而且得到以下结论：

内容用词跨度不能太大，否则就会被判为AI写的，因为人写的内容是会围绕着一个主题展开的，但AI生成的跳跃性很强，其实一眼就可以看出来，就是总觉着是怪怪的；
不能是文章或者长句的描述，不然很容易被判为是AI，最好是只列出关键词，不写句子；
内容要有数字，不然会被判为部分人工部分AI；
内容不能太短，200字符以下无法检测；

大致总结了以上几条，然后就想了一下解决方案，就是做一段文字描述的固定格式，即一个模板，然后从AI采集下来关键词和数字内容，再填充到模板里，形成一个固定形式的文章内容，这样就可以通过GPTzero检测了，希望是能得到类似于下边的内容：

Introduction of Apple product

PRODUCT FEATURES:

High accuracy
Durable material
Robust construction
Easy to use
Versatile application
Low maintenance
Longevity
Repeatability
Resistance to harsh environments
Ease of calibration

MECHANICAL DATA:

Dimensions - 120 x 60 x 30 mm
Weight - 40 g
Material - Aluminum or stainless steel
Thread Diameter - M10 x 1
Hole Diameter - ∅8 mm
Clearance - 3/8" (9.525 mm)

然后希望检测结果是这样的，如下图：

有了思路了就，就好办了，首先，按照之前的分析，写了一个文字模板，如下：

/NIntroduction of [product_name] /N/NPRODUCT FEATURES:/N/N[product_features]/N/NMECHANICAL DATA:/N/N[product_mechanical_data]/N

介绍一下：

这个模板分三段，第一段是XX的介绍，第二段是产品特性，第三段是机械数据
这里的[product_name] ，[product_features]和[product_mechanical_data]分别是要填充的产品名字，产品特性和产品机械数据内容；
/N是特殊标记，后续通过程序替换成\n换行符

模板有了，下一步就是通过chatgpt或者文心一言产生格式化的数据了，我的问题设置如下：

列出[content]的10个产品特点和10个机械数据的单词和具体的数字，用英文输出

注意，实际上提问时，[content]是会被替换成具体的产品的。

文心一言输出的结果为：

产品特点：

1. High accuracy
2. Durable material
3. Robust construction
4. Easy to use
5. Versatile application
6. Low maintenance
7. Longevity
8. Stable performance
9. Shockproof design
10. Water-resistant

机械数据：

1. Diameter (D) - 30mm
2. Length (L) - 60mm
3. Weight (W) - 80g
4. Thread pitch (P) - 1.25mm
5. Thread diameter (d) - 2mm
6. Hole depth (H) - 5mm
7. Lubrication requirement (Grease Type) - 3A
8. Lubrication volume (V) - 0.02L
9. Shear stress (τ) - 17N/mm²
10. Tensile strength (σ) - 25N/mm²

好的，然后把产品特点和机械数据的内容取出来，然后填充到模板里，代码如下：

	def _assembly_content(self,param,re):
		if ("" == param["content"]) or ("" == param["product_name"]):
			re["error"] = "param empty"
			return 0

		print(param["content"])

		# 解析内容

		contents = param["content"].split('\n')  
		#print("_assembly_content content[" + param["content"]+ "]")

		ri = 0
		pi = 0
		product_features = []
		product_mechanical_data = []
		for text in contents:
			print(str(ri) + ":" + text)			
			ri = ri + 1

			if "" == text:
				pi = pi + 1
				continue

			text = self.m_help.regular_sub_string(r'^\d+','',text)
			text = self.m_help.regular_sub_string(r'^\.','',text)
			text = self.m_help.regular_sub_string(r'^\s+','',text)			

			if 1 == pi:
				product_features.append(text)
			elif 3 == pi:
				product_mechanical_data.append(text)

		#print("product_features:")
		#print(product_features)

		#print("product_mechanical_data:")
		#print(product_mechanical_data)

		product_features = self.m_help.array_to_string(product_features,"\n")
		product_mechanical_data = self.m_help.array_to_string(product_mechanical_data,"\n")		

		# 组装内容
		content = self.m_host_info["template"]
		content = self.m_help.string_replace(content,"[product_name]",param["product_name"])				
		content = self.m_help.string_replace(content,"[product_features]",product_features)		
		content = self.m_help.string_replace(content,"[product_mechanical_data]",product_mechanical_data)				
		content = self.m_help.string_replace(content,"/N","\n")						

		re["data"] = content
		return 1

这里注意，为什么要用正则，把产生的内容每行的前边的数字，一个点号和空格去掉呢？

例如，获取到的是：

1. High accuracy
2. Durable material
3. Robust construction
4. Easy to use
5. Versatile application
6. Low maintenance
7. Longevity
8. Stable performance
9. Shockproof design
10. Water-resistant

我们希望把前边的数字和点号和空格去掉，变成这样：

High accuracy
Durable material
Robust construction
Easy to use
Versatile application
Low maintenance
Longevity
Stable performance
Shockproof design
Water-resistant

是因为有数字点号和空格的号，GPTzero会判为AI写的，而没有的话，就不会，效果很好。

于是最终生成的内容是这样的：

Introduction of B6S-E3/80

PRODUCT FEATURES:

High accuracy
Durable material
Robust construction
Easy to use
Versatile application
Low maintenance
Longevity
Stable performance
Shockproof design
Water-resistant

MECHANICAL DATA:

Diameter (D) - 30mm
Length (L) - 60mm
Weight (W) - 80g
Thread pitch (P) - 1.25mm
Thread diameter (d) - 2mm
Hole depth (H) - 5mm
Lubrication requirement (Grease Type) - 3A
Lubrication volume (V) - 0.02L
Shear stress (τ) - 17N/mm²
Tensile strength (σ) - 25N/mm²