A many to many relationship table – solving the exclude relation problem
A MySQL many to many relationship is a relationship that is multi-valued in both directions. Given that, how can you select rows that don’t have a specific relationship (the exclude relation problem)?
I will start by defining the MySQL Many to Many relationship (Experts can skip to the next paragraph)
What is A MySQL many to many relationship
A MySQL many to many relationship is a relationship that is multi-valued in both directions.
This type of relationship is helped by the use of a linking table.
For example, a Question can have more than one Category, and a Category can have more than one Question.
CREATE TABLE link ( Question_id int unsigned NOT NULL, Category varchar(20), PRIMARY KEY (`Question_id `,` Category`), KEY `category_question` (`Category`,`Question_id`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1 +-------------+---------------+ | Question_id | Category | +-------------+---------------+ | 1 | Animals | | 1 | Business | | 1 | Vehicles | | 2 | Animals | | 3 | Business | | 3 | Vehicles | | 4 | Animals | | 4 | Vehicles | | 5 | Business | +-------------+---------------+ |
This is a linking table called link which help us join relationships between the two tables, Questions and Categories.
As we can see, this table is reflecting a many to many relationship (Question 1 has categories Animals, Business and Vehicles. In addition, Category Animals has questions 1, 2 and 4).
What is the exclude relation problem?
Given a many to many relationship table, how can you select rows that don’t have a specific relationship? In terms of the question-category example, this question is translated to the following:
How can you select (say, the first 1,000) questions that do not have a category of Business. In our example, we only want Question_id 2 and 4 returned.
A bad solution
A bad solution would be:
SELECT DISTINCT Question_id FROM link WHERE Category <> "Business" ORDER BY Question_id DESC LIMIT 1000; |
This query will work if a question is only allowed one category. In our case, since a question has at least one category, as long as a question is associated with another category other than Business, it will return. Therefore, the result from this query is Question_id: 1, 2, 3, and 4 which is not satisfied the exclude relation.
First solution
SELECT DISTINCT Question_id FROM link WHERE Question_id NOT IN (SELECT Question_id FROM link WHERE Category="Business") ORDER BY Question_id DESC LIMIT 1000; |
The dependent inner query (SELECT Question_id FROM link WHERE Category<>”Business”) has a performance issue – it will be executed at least 1000 times, and if it is not fast, the delays are multiplied by a number that is at least 1000. When the access method (the type field at the EXPLAIN statement output) is index_subquery, an index lookup optimization is used and a lot of overhead is avoided. When it is range, the sub query is being re-executed as is for each row. You must check your EXPLAIN details.
Second solution
CREATE TEMPORARY TABLE not_wanted_Question_id ( UNIQUE KEY(Question_id) ) ENGINE=MEMORY SELECT DISTINCT Question_id FROM link WHERE Category="Business"; SELECT DISTINCT link.Question_id FROM link LEFT JOIN not_wanted_Question_id nw ON (link.Question_id = nw.Question_id) WHERE nw.Question_id is NULL ORDER BY link.Question_id DESC LIMIT 1000; DROP TABLE not_wanted_Question_id; |
The multiple inner query execution problem is reduced to a lookup on a unique key inside an in-memory table, which is basically a hash lookup and is very fast. However, the building of the unwanted Questions can be a bit expensive for large tables. Note that the creating table, selecting and then dropping should be run together for every time you need the results.
Third solution (the best performed)
This is my best solution for that problem and the most performed for heavy tables:
SELECT Question_id , SUM(IF (Category="Business", 1, 0)) AS exclude FROM link GROUP BY Question_id DESC HAVING exclude=0 LIMIT 10000; |
This works great in our case because we have an index on the Question_id and Category columns. The SUM function is counting the number of the unwanted category occurrences for every question_id in the order and the Having statement is filtering them.
The LIMIT terminates the calculations at the points it gets to the limit number.
I will be happy to receive your comments
How does the independent subquery perform for your data set?
SELECT
DISTINCT Question_id
FROM
link
LEFT JOIN (SELECT Question_id FROM link WHERE Category=”Business”) dt USING (Question_id)
WHERE
dt.Question_id IS NULL
LIMIT
1000;
whenever a standard sql construct exists, always use that instead of the proprietary alternative
in this instance, use CASE instead of IF
makes the sql a lot more portable
;o)
Hi Scott,
This is actually the same query as in the Second solution.
The Third solution gives much faster results.
What’s about a question_id that exist on questions table but NO have category related?
For example if exist a question_id = 6 without relations on link table
This question_id no have a category of Bussiness and it will not appear on result, however, match the condition.
Pablo,
The best to do with a question that don’t have a related category, is to attach it a dummy category (can be called Uncategorized).
That way this question will be in the link table.
Please note that the link table is the best place to select lists of questions related to a given category.
This query may solve this problem including the case of questions without category, but I’m not compared if this is a fastest solution, but is well
Pablo,
Your query will not solve the Exclude Relation problem.
For example, in the link table described in the post example, your query will return the question ids: 1, 2, 3 and 4.
First off, your “NOT IN” solution (First Solution, that has a performance issue) is silly, you should be using NOT EXISTS, ala:
SELECT
DISTINCT Question_id
FROM
Question
WHERE
NOT EXISTS (SELECT 1 FROM link WHERE (link.Question_id = Question.Question_ID) AND (Category = “Business”))
ORDER BY
Question_id DESC
LIMIT
1000;
NOT EXISTS is far cheaper than NOT IN. ‘course, this is still not ideal, as it is an RBAR solution.
As for Pablo, he’s close. What you want is a left outer join for matches, and a null check. eg:
SELECT
DISTINCT Questions.Question_id
FROM
Questions LEFT OUTER
JOIN link ON (Questions.Question_id = link.Question_id and Category = “Business” )
WHERE
link.Question_ID IS NULL
LIMIT
1000;
That said, I haven’t tested the performance of this, and left outer joins aren’t as cheap as inner joins. This is also very similar, semantically, to the sub-select approach.
Brett,
Your solution is good. You have been using the Question table for it.
I haven’t tested your query performance yet.
Thanks for your comment.
CREATE TABLE link (
Question_id int unsigned NOT NULL,
Category varchar(20),
PRIMARY KEY (`Question_id`,`Category`),
KEY `category_question` (`Category`,`Question_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
INSERT INTO link (Question_id,Category)
VALUES (1,’Animals’),(1,’Business’),(1,’Vehicles’),(2,’Animals’),(3,’Business’),(3,’Vehicles’),(4,’Animals’),(4,’Vehicles’),(5,’Business’);
CREATE TABLE Question (
Question_id int unsigned NOT NULL,
PRIMARY KEY (`Question_id`));
INSERT INTO Question
VALUES (1),(2),(3),(4),(5);
mysql> SELECT DISTINCT Question.Question_id
FROM Question LEFT OUTER JOIN link ON (Question.Question_id = link.Question_id and Category = “Business” )
WHERE link.Question_ID IS NULL
LIMIT 1000;
+—————-+
| Question_id |
+—————-+
| 2 |
| 4 |
+—————-+
2 rows in set (0.00 sec)