Let's assume we have a database like this:
Project_tbl:
-----------------
id | Project_name
-----------------
1 | A
2 | B
3 | C
-----------------
personel_project_tbl:
--------------------
user_id | Project_id
--------------------
1 | 1
2 | 2
3 | 1
3 | 2
2 | 3
--------------------
instrument_project_tbl:
--------------------------
instrument_id | Project_id
--------------------------
1 | 1
1 | 2
2 | 2
2 | 1
1 | 3
--------------------------
Now, I need to sort the list of projects and rank them with regard to their similarity to the project A.
For example:
A and B have 1 users in common over the 3 users and 2 instruments over the 2 instrument so their similarity ranking is (1/2 + 2/2) / 2 = 75%
A and C have no user in common but have 1 over 2 instruments so it will be (1/2)/2 = 25%
So B is more similar than be and output should be
--------------
Project | Rank
--------------
2 | 75
3 | 25
That's the first solution came to my mind...
If I did it in PHP and MySQL, it would be something like:
for all tables as table_x
for all projects (except A) as prj_y
unique = (Select distinct count(items) from table_x where project is A)
count += (Select distinct count(items) from table_x
where project is prj_x and items are in
(select distinct items from table_x where project is a)
)/unique
So the complexity would be O(n2) and with indexing the select also would cost O(log n) which wouldn't be affordable.
Do you have any idea to do it totally in MySQL or do it in a better and faster way?
******** More information and notes:**
I'm limited to PHP and MySQL.
This is just an example, in my real project the tables are more than 20 tables so the solution should have high performance.
this question is the supplementary question for this one : Get the most repeated similar fields in MySQL database if yr solution can be used or applied in a way for both of them (somehow) It would be more than great.
I want to multiply the value of related projects with the similarity of items to get the best option...
In conclusion, these two questions will : get the most related projects, get the similar items of all projects and find the most similar item for current project where the project is also similar to the current one! yo
Thanks for your intellectual answers, its really appreciated if you could shed some light on the situations
解决方案
You could do it this way:
SET @Aid = (SELECT id
FROM Project_tbl
WHERE Project_name = 'A');
SELECT P.id
, (IFNULL(personel.prop, 0) +
IFNULL(instrument.prop, 0)
)/2*100 Rank
, personel.prop AS personell
, instrument.prop AS instrument
FROM Project_tbl P
LEFT JOIN
( SELECT B.Project_id pid, COUNT(*)/C.ref prop
FROM personel_project_tbl A,
personel_project_tbl B,
(SELECT COUNT(*) AS ref
FROM personel_project_tbl
WHERE Project_id = @Aid
) AS C
WHERE A.user_id = B.user_id
AND A.Project_id = @Aid
GROUP BY B.Project_id
) personel ON P.id = personel.pid
LEFT JOIN
( SELECT B.Project_id pid, COUNT(*)/C.ref prop
FROM instrument_project_tbl A,
instrument_project_tbl B,
(SELECT COUNT(*) AS ref
FROM instrument_project_tbl
WHERE Project_id = @Aid
) AS C
WHERE A.instrument_id = B.instrument_id
AND A.Project_id = @Aid
GROUP BY B.Project_id
) instrument ON P.id = instrument.pid
WHERE P.id <> @Aid
ORDER BY Rank DESC
The idea is to have one subquery for each table, and each of these subqueries maps project id to correspondence ratio for a given table.
I'm saying nothing at all about performance. You'll have to try and see whether it is fast enough for your needs, but as I see it there is no way to beat the O(n2) complexity you mention, as you have to inspect all the data.