A Hint for Entity Framework Multi Join Performance Problem

How I have started my adventure with Entity framework Some of you may refer to one of my previous entries, when I was discussing tool I would use for my new web project. Look for (2. Database and ORM). I made a “brave” decision to choose Entity framework as my Object Relational Mapping tool. Well, I must say, i slightly regret this. To be honest, I would consider choosing other tools - or at least wait a little until more mature version of EF arrive (looking forward to play with it in .NET 4.0). Why I don’t like it anymore There are few things that annoys me with that. Tools that enable to quickly generate model objects from database schema (ADO.NET Entity Data Model Designer) is sort of an “error prone”. Each time i make modifications to db, and try to update model - it fails. And each time it is for a different reason. I completely stopped using it. And how i almost ended up using it But there are far serious problems. Or maybe ,better say, issues. One of them - one that really makes me feel scary was t-sql code generated by entity framework. In so many cases not really efficient one. But really strange when it comes to multiple joins. Have you ever looked at those gigantic “select” statement when you try to “Include” multiple entity collections to the result. I deliberately choose to name it “an issue”, cause i understand that is not an error. Moreover such queries cause some serious performance problems. In my project, I ended up with following code:

var product = _productRepository.GetProductSet()
    .Include("Tags")
    .Include("Attachments")
    .Include("MarkAsDuplicatedBy")
    .Include("MarkAsDuplicatedBy.UserDetails")
    .Include(";FavouriteMarks")
    .Include("FavouriteMarks.User")
    .Include("FavoutiteMarks.User.UserDetails")
    .Include(";Orders")
    .Include("Orders.OrderDetails")
    .Include("Orders.User")
    .Include("Orders.Comments")
    .Include("Orders.Comments.User")
    .Include("Comments")
    .Include("Comments.User")
    .Include("Comments.User.UserDetails")
    .FirstOrDefault(p => p.ProductId == productId);

It is pretty straightforward. You have a product, products have a list of orders. Users can leave comments for each product, order or mark one as their favourite. Moreover, each product can be marked as a duplicate - notifying that there is another exactly the same product in in the store. Easy? Easy. And Now it is lesson I recently learned. Such query is a KILLER! Believe me or not, it produces query of 3500 lines, where word “join” occurs 713 times. Unbelievable! It takes 4 seconds for this query to execute on my machine. Now, playing with it a little I have managed to optimize it a lot. Don’t know if i am such a genius or my first approach was so bad… Rather second option, anyway… First thing we may do is to reduce this code by separating “MarkAsDuplicate” property. That is something that will be used pretty rarely (le us hope so) so we can move it outside main query.

    if(product.MarkAsDuplicatedByReference.EntityKey != null){
        var user = product.MarkAsDuplicatedByReference
                                                .CreateSourceQuery()
                                                .Include("UserDetails").FirstOrDefault();
        product.MarkAsDuplicatedBy = user;
}

We first check if there is an assigned entity key for navigation property. Thanks to that we load more information only under condition there is a reason for that. Such a object is furhere assigned to Secondly we may separate favourite marks, orders and comments.

var favs = product.FavouriteMarks.CreateSourceQuery()
                                      .Include("User")
                                      .Include("User.UserDetails")
                                      .ToList();

foreach (var fav in favs){
    product.FavouriteMarks.Attach(fav);
}

var orders = product.Orders.CreateSourceQuery()
                                      .Include("OrderDetails")
                                      .Include("User")
                                      .Include("Comments")
                                      .Include("Comments.User")

foreach (var order in orders){
    product.Orders.Attach(order);
}

And similarly with comments…. Although such an approach produces four db calls instead of one (like previously) achieved performance improvement is significant. In my case I have managed to drop it from 4 seconds to 1 second. Look at the table below to compare performance:

No of sql statements No of JOINs Time[sec]
Before splitting 1 713 4
After splitting 4 360 0.8

As you can see, number of joins was reduced twice and now our query is 5 times faster. Another problem that arrives with such an approach is that it makes writing unit tests slightly harder. both Include and CreateSourceQuery methods are specific to Entity Framework. One way to overcome this is to prepare extension methods that would work differently depending on the context. Similar to example below:

public static class RepositoryExtensions {

        public static IQueryable<TEntity> LoadWith(this IQueryable<TEntity> obj, string path) {
             var query = obj as ObjectQuery<TEntity>;
             if (query != null)
                 return query.Include(path);
            return obj;
        }

        public static IQueryable<TEntity> CreateQuery(this EntityCollection<TEntity> entityCollection) where TEntity : EntityObject {
             var query = entityCollection.CreateSourceQuery();
             if (query == null)
                 return entityCollection.AsQueryable();
             return entityCollection.CreateSourceQuery();
        }
}

All that is left is to replace all calls to Include with LoadWith, and all calls to CreateSourceQuery with CreateQuery. Basic aim for those extension methods is to enable them to behave differently in case when we deal with ObjectQuery in entity framework environment, and differently when we have just a plain objects. Summary I hope some of You - especially those that are new to Entity Framework, like I am - will find this hint helpful. I have shown that sometimes, especially when we try to load entity with plenty of relations we made end up in a sql hell. In such a case it can be useful to split single query into few smaller. Despite the fact that we loose some performance due to those additional calls we end up with much higher gain. I have also demonstrated how further on such code may be improved for better unit testing. Thanks and happy coding

comments powered by Disqus