php mysql检查重复_在MYSQL / PHP中检查重复TEXT字段的最佳方式是什么?

bd96500e110b49cbb3cd949968f18be7.png

My code pulls ~1000 HTML files, extracts the relevant information & then stores that information in a MySQL TEXT field (as it is usually quite long). I am looking for a system to prevent duplicate entries in the DB

My first idea is to add a HASH field to the table (probably MD5), pull the hash list at the beginning of each run & check for duplicates before inserting into the DB.

Second idea is to store the file length (bytes or chars or whatever), index that, & check for duplicate file lengths, doublechecking content if a duplicate length is found.

No idea what is the best solution performance-wise. Perhaps there is a better way?

If there is an efficient way to check if files are >95% similar that would be ideal, but I doubt there is?

Thanks for any help!

BTW I am using PHP5/Kohana

EDIT:

just had an idea on checking for similarity: I could count all alphanumeric characters & log the occurrence of each

eg: 17aB... = 1a,7b,10c,27c,...

potential problem would be the upper limit for a char count (around 61?)

I imagine false positives would still be rare . . .

good idea/bad idea?

解决方案

The hash idea is probably the best. You might have collisions, but they would be exceedingly rare.

Make the hash field a unique key for the table, and catch the duplicate error code. Or use insert ignore or insert replace.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值