Joining data

Abstract

A join combines records from two or more tables in the database.

A join – or a join statement – combines records from two or more tables in the database. This topic describes when to use joins and what you must consider when you use them.

The LINQ layer supports the following four joins: Join, Self Join, Group Join, and Select Many. These four methods use subqueries that are evaluated as users page through data. Lucene.Net does not support these join methods natively, and Solr only supports certain aspects.

Important

You must try to avoid using these methods when you can. They are all bad for performance, both for processing time, amount of I/O, and for memory usage, and this is true for both Lucene and Solr.

For example, you use a Join query with LINQ to join Contacts with the Engagement States they have been in. When users page through the data, the join or subquery is only executed for the first page. That is: given an ID, a subquery is run to look up another document via that ID.

When you execute a Join query in LINQ, it effectively returns a union set of one or many documents. You must consider performance when you write LINQ queries, in the same way as when you write SQL queries with joins. You must also store and index your data properly. Sitecore does not support many-to-one or many-to-many queries in LINQ (for example, a Contact that stores multiple references to Engagement States and then joins those on the Engagement States documents).

You can attach the “foreign-key” to the Engagement State, so that you can join using something like the ContactId instead:

public class Contact {
public string Name { get; set; }
public Guid ContactId { get; set; }
}
public class EngagementState {
public string Name { get; set; }
public Guid ContactId { get; set; }
public Guid Id { get; set; }
}

Given the classes above, and even though it results in duplication of data, this lets you join Contacts to Engagement States via the ContactId. The LINQ layer supports this, for example:

var repo = this.CreateVisitors();
var repoPlans = this.GetStates();
var result = from t in repo
join x in repoPlans on t.ContactId equals x.ContactId
where x.Id == new Guid("E1B604F1-EE0E-408E-A344-869CC45D25D9")
select t;

You can test joins on large amounts of data in the LinqScratchPad.aspx, like this:

using (var context = ContentSearchManager.GetIndex("sitecore_master_index").CreateSearchContext())
{
    using (var context2 = ContentSearchManager.GetIndex("sitecore_web_index").CreateSearchContext())
        {
            return context.GetQueryable<SearchResultItem>().Join(context2.GetQueryable<SearchResultItem>()
            .Where(i => i.Name.StartsWith("S")), i => i.ItemId, o => o.ItemId, (o, i) => o).Take(10).ToList();
        }
}

This opens two separate search contexts and then runs a join on them, based on the ItemId. In this example, this query checks which items where the name field starts with an S are in both the web and the master index, and then returns the results of the web index (outer).

When you use the Solr provider, Self Join is the only join that runs a real Solr join. The other methods (such as Join and Group Join) use the enumeration technique that the Lucene.net provider uses: at enumeration time, subqueries are executed to get the other documents.

You need to use a join in the LINQ layer when you have a document that contains a reference to another document. The reference is typically an ID reference.

Sitecore does not flatten the objects it crawls. The crawler implementation has to tell Sitecore how to store data. The Join, Self Join, or Group Join methods run a subquery for the join to get another document and evaluate it based on the ID/Key.

You must make sure that you store your data properly. You can prepare your data so that it is ready for joining, or you can prepare it for a completely join-free retrieval. Both approaches have advantages as well as disadvantages.

Follow these rules to ensure that you solution scales well:

  • Limit the amount of joins.

  • Consider flattening data instead of using a join. This results in multivalued fields and potentially many columns.

  • If you use a join, keep your paging of the data small – 10 or 20 items at a time. The reason is that if you have a query that takes 100 milliseconds to run, it could take 1.1 seconds to return the subqueries and the initial query.

  • Only store or index what is 100% necessary (unless this has performance implications).

For example, if you have the following two tables of data:

ID

Value

1

A

2

B

3

C

ID

ChildValue

1

a1

1

a2

1

a3

2

b1

2

b2

When you use the Join method on the two lists on the Id field, the result is:

Value

ChildValue

A

a1

A

a2

A

a3

B

b1

B

b2

If you use the Group Join method on the two lists on the Id field, the result is:

Value

ChildValue

A

[a1, a2, a3]

B

[b1, b2]

C

[]

If you use the Self Join method on the two lists on the Id field, the result is:

Value

ChildValue

A

a1

A

a2

A

a3

B

b1

B

b2

If you use the Self Join method, you do not need to provide a resultSelector parameter. It is inferred that it returns the outer result. This is essentially the difference between an inner and outer join.

Note

When you run join queries, your inner or outer joins should have a filter that limits the results. Otherwise, the join queries will try to join on every single document in your index.