MySQL获取最多重复记录,获取MySQL数据库中重复次数最多的相似字段

Let's assume we have a database like:

Actions_tbl:

--------------------------------------------------------

id | Action_name | user_id|

--------------------------------------------------------

1 | John reads one book | 1

2 | reading the book by john | 1

3 | Joe is jumping over fire | 2

4 | reading another book | 2

5 | John reads the book in library | 1

6 | Joe read a book | 2

7 | read a book | 3

8 | jumping with no reason is Ronald's habit| 3

Users_tbl:

-----------------------

user_id | user_name |

-----------------------

1 | John

2 | Joe

3 | Ronald

4 | Araz

-----------------------

Wondering if I can choose the most repeated similar action regardless of it's user and replace my own user_name with its current user!

Read one book, reading the book, reading another book, read the book in library, read a book and read a book are the ones who have most common WORDS so the staffs related to reading the book is repeated 6 times, my system should show one of those six sentences randomly and replace Araz with user_name

Like: Araz reads the book

My Idea was to

select replace(a.action_name , b.user_name) from actions_tbl a, user_tble b where a.user_id = b.user_id group_by

and then check the similarities one by one in php using

levenshtein()

But this one doesn't have performance at all!

Assume that I want to do the same thing for a big db and for few different tables.

This will destroy my server!!!

Any better IDEA?

in http://www.artfulsoftware.com/infotree/queries.php#552 the

levenshtein() function is implemented as a MySQL function but firstly, do u think it has enough performance? and then, how to use it in my case?

Maybe a self-join van fix this issue but I'm not that good with sql!

* similar action, are the actions that have more than X% common words

** More information and notes:**

I'm limited to PHP and MySQL.

This is just an example, in my real project the actions are long paragraphs. That's why the performance is a matter. The real scenario is: user inputted the description of its project for several projects, those data may be too similar(users would have the same area of work), I want to fill automatically(base on previous fillings) the description of next project, to save time.

I would appreciate if you can have any pragmatical Solution. I checked the NLP related solutions, although they r interesting, but I don't think many of them can be accurate and can be implemented using PHP.

The output should make sense and be a proper paragraph like all other projects. That's why I was thinking of choosing from previous ones.

Thanks for your intellectual answers, its really appreciated if you could shed some light on the situations

解决方案

What you are talking about is a text clustering process. You are trying to find similar pieces of text, and arbitrarily choosing one of them. I am not familiar with any database that does this form of text mining.

For what you describe, a pretty basic text mining technique would probably work. Create a term-document matrix with all the words except the user names. Then use singular value decomposition to get the largest singular value and vector (this is the first principal component of the correlation matrix). The similar activities should cluster along this line.

If you have a limited vocabulary and have the terms in a table, you could measure distance between two actions by the proportion of words that overlap. Do you have a list of all words in the actions?

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值