Optimizing the MySQL IN() Comparison Operations Which Include the Indexed Field

The MySQL IN() Comparison Operator is said to be very quick if all the values are constants (the values are then evaluated and sorted first, and the search is done using a binary search).
However, what if the field which the IN clause refers to, is part of the index used to execute the query?
For example, lets create a table t with columns a to e

Create table t (
a int unsigned NOT NULL,
b int unsigned NOT NULL,
c int unsigned NOT NULL,
d int unsigned NOT NULL,
e int unsigned NOT NULL,
KEY `test` (`b`,`c`,`a`));

Now, let’s evaluate the following query:

WHERE b=4 AND c IN (1,3,5,7,9)

In order to execute this query MySQL will need to use the ‘test’ index.

At this case mysql will first evaluate a full and unlimited sub query for each value in the IN clause, then, at completion, will merge the results and limit the returned result set.
This full and unlimited evaluation can be very expensive for large tables.
In order to work around this performance caveat we can force a limited sub query evaluation for each value in the IN clause before the merge. This can be done by using a “union all” clause instead of the IN clause, as in the example below.

This will run much faster:

(SELECT a from t FORCE INDEX (test) where b=4 AND c =1 ORDER BY a DESC LIMIT 100)
(SELECT a from t FORCE INDEX (test) where b=4 AND c =3 ORDER BY a DESC LIMIT 100)
(SELECT a from t FORCE INDEX (test) where b=4 AND c =5 ORDER BY a DESC LIMIT 100)
(SELECT a from t FORCE INDEX (test) where b=4 AND c =7 ORDER BY a DESC LIMIT 100)
(SELECT a from t FORCE INDEX (test) where b=4 AND c =9 ORDER BY a DESC LIMIT 100)
) temp

Some notes:
1. For selecting more than a very few fields, it is much faster to select only the primary field and then to join the result with the same table in order to retrieve the rest of the fields (since it is much faster to sort the primary field without the extra selected fields).
2. Substituting the IN clause with OR clauses does not improve the performance of the query.

Thank you Jeffrey Doak for your comments!

MySQL Quiz
  1. Jeffrey Doak says:

    Are you testing to see if someone is reading this? I created the table and started inserting data, then realized only one row will ever be returned in the query since a is the primary key and you’re specifying a=4.
    Better example would be to take a out of the “test” key or us a > 0 or something.

  2. Jeffrey Doak says:

    I have other issues with your example:
    You can’t order by c in the combined result because c isn’t selected in the query. If you want to order by c, you will need to select it in all the subqueries.

    From my understanding, your statement isn’t correct which is why I was going to check it out. If I use the query analyser, it doesn’t look like the unions would be more efficient at all.

  3. Hazan Ilan says:

    Hi Jeffrey,
    Thank you very much for your messages.
    The problem in my example is that I included the primary a in the second key and in the query.
    I have tested it on my tables and some how mixed thing when I wrote in at the doc.
    I have fixed the example in the post..
    I have checked my fixed example as it is with 3128429 records. As I state, the union query is working much faster.

  4. Jeffrey Doak says:

    You are correct about your observation. I tried it out myself. Using the query profiler, I see the extra time is spent on “Sending data”. This leads me to believe there is an issue with the code in MySQL. Somehow using subqueries and/or unions the slower code is being avoided.

  5. Jeffrey Doak says:

    This is a possible 2 to 10X speed-up if the bug is fixed in the code. I did some quick looking and the closest I found were bug reports that were “explained” away as fragmented data in the table file. That doesn’t explain your example though that’s faster simply by changing the syntax.

  6. rajesh says:

    SELECT twitter_posts.writer_id AS writer_id,twitter_posts.id AS post_id,
    twitter_posts.created_at AS created_at,
    twitter_posts.twitter_id AS tweet_id,
    twitter_posts.text AS text,
    twitter_writers.profile_image_url AS profile_image_url,
    twitter_writers.screen_name AS screen_name
    twitter_writers ON twitter_writers.writer_id=twitter_posts.writer_id
    WHERE twitter_posts.writer_id IN (46,5,6,7,8,10) GROUP BY text ORDER BY twitter_posts.created_at DESC LIMIT 50

    I have used this query it takes more than 40 sec to load data and have 3 billion record in twitter_posts table.I have used indexing on these fileds in search.Please let me know if have better idea about this to optimize this query.

    Thanks In Advance

  7. Hazan Ilan says:

    Please reply with:
    1. The “Show create table” output of the two tables.
    2. The “extended explain with “G” output of this query.

Leave a Reply