I am have a table named 'LINK_INFO' with URLs in a field called 'URL'. The problem is, many duplicates URLs exist EXCEPT some have used a trailing / to get around the unique field requirement.
Example:
What is the statement I can use to select these cases of near duplicates, so I can delete one of them? Many thanks if you can help.
解决方案
You can just use TRIM to find all unique values;
SELECT DISTINCT TRIM(TRAILING '/' FROM url) url
FROM link_info
To delete the duplicates right away, just do a delete join;
DELETE li1
FROM link_info li1
JOIN link_info li2
WHERE TRIM(TRAILING '/' FROM li1.url) =
TRIM(TRAILING '/' FROM li2.url)
AND li1.id
Always back up your tables before running arbitrary SQL found on the net, even mine :)
EDIT: If your database machine is limited, you may want to do it using indexes and avoid loading more into memory than necessary;
-- remove all trailing slashes
UPDATE link_info
SET url=TRIM(TRAILING '/' FROM url);
-- create an index on the resulting strings (if there isn't already one)
CREATE INDEX url_index ON link_info(url);
-- delete all duplicates
DELETE li1
FROM link_info li1
JOIN link_info li2
WHERE li1.url = li2.url
AND li1.id
-- drop the index if not needed anymore
DROP INDEX url_index ON link_info;
Yet another SQLfiddle.