Jul 26, 2009

Data virtualization

52DataVirtualization

My last post covered the current UI virtualization support on Silverlight and WPF. In this post, I will discuss another technique that helps improve performance when dealing with large data sets: data virtualization.

A control that uses data virtualization knows to keep in memory no more data items than the ones needed for display. For example, if a data-virtualized ListBox is data bound to 1,000,000 data items and only displays 100 at a single point in time, it only keeps about 100 data items in memory. If the user scrolls, the data-virtualized ListBox knows to swap the items in memory in a way that optimizes performance and memory consumption.

I talked about a possible solution that provides partial data virtualization for TreeView in a previous post. I discussed that solution in the context of WPF, but it could just as easily be used for Silverlight. The discussion in this post addresses data virtualization for non-hierarchical scenarios.

Data virtualization in WPF

WPF has no built-in generic support for data virtualization.

Fortunately, Paul McClean and Vincent Van Den Berghe have recently come up with approaches to work around this limitation (the link to Vincent’s code in the pdf is incorrect, so I’ve uploaded his code here). These two solutions haven’t been advertised enough, in my opinion! They’re both great solutions and easy to adapt to your specific scenario.

Paul and Vincent’s solutions are very similar to each other – they both implement a layer that is capable of managing data items that are kept in memory. This layer knows when to fetch more data items and when to let old data items be garbage collected, based on the user’s scrolling patterns.

Both solutions take advantage of the fact that when a WPF ListBox is data bound to an IList, it only accesses the list’s Count property and the data items it needs to display on the screen. Both Paul and Vincent implemented a “fake” collection that knows the overall count of the real collection and knows how to retrieve a particular data item, even if that data item is not in memory. This collection essentially manages a cache of items and, when asked for new data items, it’s capable of retrieving a new set of data items from the database, discarding data items that are no longer in use.

The implementation details are quite different, though (this is expected – neither of them was aware of the other one’s work).

Paul implemented his own collection called VirtualizingCollection<T>. This collection has a handle to an IItemsProvider, which is a class that knows how to fetch the overall item count and a consecutive list of items from the database (or web service, XML file or whatever data store you’re using). VirtualizingCollection<T> keeps the list of non-virtualized data items and decides when new items should be retrieved. Its indexer immediately returns a data item if it’s already in memory, or loads more data from the database if it’s not. Paul also implemented an extension to this collection called AsyncVirtualizingCollection<T> that provides the ability to do expensive query operations without blocking the UI thread.

lang=en-US>Vincent implemented not only his own collection, but also a custom view type for that collection. His collection (VirtualList<T>) implements lang=pt>ICollectionViewFactory, which allows the collection to say that it wants WPF to create a VirtualListCollectionView lang=en-US><T> when binding to it. His view has a handle to the collection and knows to forward any data loading operations to the collection, as well as any sorting and filtering the user may have specified. The data item caching code is in VirtualListBase<T>, which is the base class for both the view and the collection. Vincent also implements a DataRefBase<T> class that wraps each data item, exposing all of its properties using the ICustomTypeDescriptor interface, and adding property change notifications if not present in the original data type.

At the end of this post, I link to an xbap that shows both solutions running side-by-side. Below I compare different aspects of these solutions, and discuss pros and cons of each solution when applicable.

Using these solutions in your project

If you want to use Paul’s solution, you need to provide your own implementation of IItemsProvider. This is where you would add code that retrieves the overall count and a page of items from your database. Then you just need to bind your control to a collection of type AsyncVirtualizingCollection<T>. Here is the example included in Paul’s sample code:

    public class DemoCustomerProvider : IItemsProvider<PaulsCustomer>
    {
        private readonly int _count;
        private readonly int _fetchDelay;

        public DemoCustomerProvider(int count, int fetchDelay)
        {
            _count = count;
            _fetchDelay = fetchDelay;
        }

        public int FetchCount()
        {
            Thread.Sleep(_fetchDelay);
            return _count;
        }

        public IList<PaulsCustomer> FetchRange(int startIndex, int count)
        {
            Thread.Sleep(_fetchDelay);

            List<PaulsCustomer> list = new List<PaulsCustomer>();
            for (int i = startIndex; i < startIndex + count; i++)
            {
                PaulsCustomer customer = new PaulsCustomer { Id = i, Name = "Customer " + i };
                list.Add(customer);
            }
            return list;
        }
    }

    DemoCustomerProvider customerProvider = new DemoCustomerProvider(numItems, fetchDelay);
    sample1.DataContext = new AsyncVirtualizingCollection<PaulsCustomer>(customerProvider, pageSize, pageTimeout);

In Vincent’s solution, you need to define a “Load” function somewhere in your ViewModel and pass that as a parameter to the collection. For example:

    private int Load(SortDescriptionCollection sortDescriptions, Predicate<object> filter, VincentsCustomer[] customers, int startIndex)
    {
        Thread.Sleep(fetchDelay);

        // Sorting
        bool isDescending = sortDescriptions != null && sortDescriptions.Count > 0 && sortDescriptions[0].Direction == ListSortDirection.Descending;
        int customerIndex;

        for (int i = 0; startIndex < numItems && i < customers.Length; ++i, ++startIndex)
        {
            customerIndex = isDescending ? numItems – startIndex – 1 : startIndex;
            customers[i] = new VincentsCustomer { Id = customerIndex, Name = "Customer " + customerIndex };
        }

        return numItems;
    }

    sample2.DataContext = new VirtualList<VincentsCustomer>(Load, 6 /* # of pages in memory at the same time */, pageSize);

Caching

Paul’s solution divides the original collection in pre-defined pages. When an item is accessed, Paul’s solution brings into memory the page that includes it, as well as the previous or next page, depending on whether the item belongs to the first or second half of its page. Paul also keeps track of the time each page is accessed. Every time a new item is retrieved, pages that haven’t been touched for a certain amount of time are discarded.

Vincent keeps a linked list of pages in memory. When a new page is brought into memory or an existing page is accessed, it is moved to the beginning of the list. If the number of pages exceeds a specified limit, the pages at the end of the list are discarded.

Sorting and filtering

Paul’s solution doesn’t address sorting and filtering. One way to add this functionality to his code is to change the FetchRange method of IItemsProvider to accept sorting and filtering information. In your implementation of IItemsProvider, you would implement FetchRange taking that information into account.

Because Vincent implements his own view, you can add SortDescriptions and a filter delegate as usual. Vincent takes care of passing this information to the Load method. It’s up to you to create a database query that incorporates that information in your implementation of Load.

Threading

Paul’s solution fetches the data in a second thread, therefore it doesn’t block the UI. The code that enables this scenario is in AsyncVirtualizingCollection<T>.

In Vincent’s solution, everything happens synchronously. This means that the UI will be unresponsive while the database is being queried.

DataTemplates

In Paul’s solution, we can use implicit DataTemplates as usual (a DataTemplate is “implicit” when it specifies a DataType but no key).

In Vincent’s solution, the items in the collection are of type DataRefBase<T>, which wraps your original type and provides property change notifications. Unfortunately, you can’t use an implicit template in this scenario because the type is generic. So, in this post’s project, I opted to use the DataTemplate explicitly when using Vincent’s solution (I refer to the DataTemplate using an explicit key).

(XAML 2009 adds support for generic types, but this feature is not yet available for compiled XAML scenarios. You can read more about this in Rob Relyea’s blog.)

Master-detail Scenario

Vincent’s solution works well when I implement a master-detail scenario by synchronizing the current and selected items:

    <ListView ItemsSource="{Binding}" IsSynchronizedWithCurrentItem="True" …>
    <ContentControl Content="{Binding Path=/}" … />

Paul’s solution doesn’t work with the XAML above because WPF’s ListCollectionView calls IndexOf to track the current item, and Paul’s collection doesn’t support IndexOf. As an alternative, I implemented the master-detail scenario by binding the ContentControl’s Content to the ListView’s SelectedItem:

    <ListView ItemsSource="{Binding}" Name="lv1" …>
    <ContentControl Content="{Binding ElementName=lv1, Path=SelectedItem}" … />

This second master-detail implementation works just as well as the first, and in fact, many people find the second solution easier to understand.

Selection

One shortcoming of Paul’s solution is the fact that a “collection reset” notification is raised every time a new page is loaded. As a result, selection is lost every time a new page is loaded. For example, if a user holds the down-arrow key to scroll quickly through a list, selection will jump back to the top of the list when a new page of data items is loaded.

This problem might be addressed by providing more granular collection change notifications when new data is available (look at the FireCollectionReset method in AsyncVirtualizingCollection<T>).

Summary

When comparing Paul’s solution to Vincent’s, you may prefer one over the other, depending which features you find valuable. An ideal solution would combine the best parts of both. Let me know if you come up with a good hybrid!

Data virtualization in Silverlight

Like WPF, Silverlight does not have built-in support for data virtualization based on scrolling. Unlike WPF, Silverlight has support for data virtualization through paging using PagedCollectionView. I won’t talk about PagedCollectionView here – instead I’ll focus the discussion on the scrolling based solution.

Vincent’s solution requires several features that are not present in the subset of .NET supported by Silverlight, but Paul’s solution compiles in Silverlight with very minor changes. However, the implementation of views in Silverlight is less optimized than in WPF, preventing this approach to data virtualization from working.

The internal implementation of views in WPF accesses only the items that it needs for display (as well as the count of the collection). This is what makes custom data virtualization solutions conceptually easy – we only need to keep in memory the data items that are accessed, and can discard all others.

The implementation of views in Silverlight, on the other hand, accesses all data items in the original collection at load time, independent of how many items it displays on the screen. This affects performance at load time in all scenarios, but because the delay is proportional to the count of the collection, it especially affects scenarios with large data sets. There is no way around this, other than re-implementing large portions of Silverlight’s collection views code. If you use .NET Reflector, you can look at this code in the InitializeSnapshot method of EnumerableCollectionView. This is, in my opinion, a big limitation of Silverlight that I am hoping will be addressed soon.

This fact increases the load time significantly, but by itself it shouldn’t cause the UI to hang for too long (depending on the number of items in the collection) because Paul’s solution loads pages asynchronously. However, because the collection is reset every time a new page is brought into memory, the effects of looping through all items are magnified. For example, if your collection has 100,000 items and the page size is 100 items, InitializeSnapshot’s loop will bring 1,000 pages into memory at load time, asynchronously. As each of those pages is ready to be loaded, a collection reset is triggered, causing Silverlight to loop through all 100,000 items again, causing all pages to be loaded into memory again. At the end of this post, I provide a link to a Silverlight project with Paul’s data virtualization implementation so you can see it hang and debug it in case you’re curious.

I am not aware of any custom data virtualization solution based on scrolling that works with Silverlight at this point, so here’s a fun challenge for you! Please send me email if you come up with a solution for this problem.


You can run an xbap showing both data virtualization solutions side by side, and you can download the corresponding WPF project.

You can also download a Silverlight project that shows data virtualization *not* working.

Update: If you’re looking for a data virtualization solution for WPF, make sure you also read my more recent post.

30 Comments
    • Bea

      Thanks Vlad, this is good info for DB4 users.

      Although, by reading the description in the link you sent, I see that it only supports grouping for the page in memory in paging scenarios, not for the whole list. Also, I’m unclear about the grouping support for non-paging scenarios. If you (or someone) can expand a bit on the information in that link, I’m sure others would appreciate.

      Bea

  1. sacha

    Great post as usual Bea. The Binding Queen is back, and with a new Blog to boot. And I still owe you that beer for letting me use (uh steal) your planet listbox in one my codeproject articles.

    Hopefully at next years MVP summit, Ill buy, hold me to it.

    Thanks for sharing this

    • Bea

      Nice to hear from you Sacha. I will certainly hold you to your promise :)

      Bea

  2. Teusje

    nice, tnx for sharing this!

  3. jean-paul

    I have been testing the virtualization tricks by binding to a dumb IList implementation that just returns the requested item index as a string. I’ve verified that even for huge virtualized lists only the visible items are requested (as expected).. however, when you scroll towards the end of the list the UI updates become really slow, even though for every update it only fetches the 10 or so visible items from the list. On my machine the delays become obvious at around 15 million items, and >1 second delays at around 50 million.

    Just a heads up for the 2 people in the galaxy that have to display billion item lists :)

    • Bea

      Hi Jean-Paul,

      I’m actually glad to hear that this data virtualization solution can handle billions of data items (something I expected but never tested) :)
      However, I do find it a bit strange that it would slow down with *any* amount of data, even with huge data sets like the ones you’re using. Are you sure that the delays are in the rendering of the UI, and not in the fetching of the data? Do you have a way to profile your app and know exactly where the delay is?

      Thanks,
      Bea

  4. dan

    Hi Bea,

    I am looking for a fix to the selection problem with Pauls AsyncVirtualizationCollection but have yet to get a working solution. Have you had any ideas on this?

    Many thanks for the excellent blog..
    Dan

  5. Richard

    Bea, I implemented Vincents solution in my WPF app, however I came accross an issue that I can’t get around. I need to programmatically set the selected value of my listbox and scroll the selected item into view. This works the first time I do it but causes an exception on each subsequent time that I try to call ListBox.ScrollIntoView(ListBox.SelectedItem). Do you have any idea why this is and even better, what I can do to get past it?

    • Bea

      Hi Richard,

      I’ll put you in contact with Vincent privately. He may be able to help.

      Thanks,
      Bea

    • Bea

      Hi Andrus,

      Yes, that is the best solution for data virtualization if you’re using Silverlight. PagedCollectionView makes it very easy to add paging to Silverlight, and it enables large data set scenarios that weren’t possibly previously. I added a short reference to it in the blog post.
      (Although in my opinion, data virtualization through scrolling offers a much better user experience than paging.)

      Thanks for your comment!

      Bea

  6. Smitha R Mangalore

    Hi Bea,

    I am great fan of your WPF blogs and was recently looking for UI and Data Virtualization in WPF. I read your article where you talked about Paul and Vincent approaches.
    Paul article seems to be good. but needs work around to make it work with WPF Toolkit DataGrid which Paul promised to get it done in future.

    There is one more article on the same topic…
    http://beta.blogs.microsoft.co.il/blogs/tomershamam/archive/2009/10/01/ui-virtualization-vs-data-virtualization-part-2.aspx

    May be you can also add this into your article and compare all the three approaches and would be helpful for WPF newbie like me  

    Thanks and Regards
    Smitha

    • Bea

      Hi Smitha,

      My data virtualization implementation which merges Paul’s and Vincent’s solutions works with DataGrid. Check out this post for a sample with DataGrid.

      I read the article you link to and it seems to be another very good data virtualization solution. I like the fact that he uses proxies for each data item, and I like his clever solution to get a chunk of items using GetItemAt (the code snippet he posted). I also like the fact that his approach to the problem is very simple. I haven’t run his code though, so I can’t speak to that.

      Thanks for pointing out this article.

      Bea

  7. Smitha

    Thanks Bea. :) You are simply superb.

    Regards,
    Smitha

  8. Thomas Lebrun

    Hi Bea,

    Any news for a solution to data virtualization that works, in Silverlight, on scrolling ?

    I would be very interesting in having a such solution.

    Thanks !

    Thomas

    • Bea

      Hi Thomas,

      I haven’t yet heard of anyone who has implemented a data virtualization solution for Silverlight. If you find someone who has, please post it here, as I’m sure others would benefit from that knowledge.

      Thanks,
      Bea

  9. steve buddy

    In Vincent’s solution, everything happens synchronously. This means that the UI will be unresponsive while the database is being queried.

    Has anyone modifies Vincent’s solution so there is threading like there is in Pauls code. If so could you email me your source code or solution at stevensrf1@inbox.com
    Thanks

    • Bea

      Hi Steve,

      Please take a look at my other blog post that combines advantages of both solutions.

      Bea

    • Bea

      Hi,

      Yes, it appears that the code was removed from Vincent’s server. I’ve uploaded his source code to my server, and added the link to the blog post. You can find it here.

      Also, please keep in mind that I have another blog post that combines advantages of both solutions compared in this post.

      Thanks,
      Bea

      • steve

        Bea thanks for the uploading of the source code. Bea I read about everything you post and I did read your posting where you combined the two solutions. The stuff you post is so for advance of my knowledge it takes me about a month to understand what you are posting. Thanks for all you posting

  10. Gary

    Hi,
    If binding with grouping in ListView, Data Virtualization still possible?

    Regards
    Gary

    • Bea

      Hi Gary,

      The filtering, sorting and grouping built-in features in WPF need to access all items of the collection on the client side, which negates all the benefits of data virtualization. The only way to perform these operations together with data virtualization is to implement them yourself on the server side. I show in this post how to do that for filtering, and I have a post coming that shows the same thing for sorting. You could use those as inspiration to implement grouping (or maybe I’ll do that for a future post, if time permits).

      Hope this helps.

      Bea

  11. steve fred

    How do I apply this slice quick sorting Vincent’s talks about when information is
    retrieved from a database table. In both Bea’s download source code and Vincent’s source
    code there is never any real sorting done per say. New VincentCustomer objects are create
    on the fly and with the information just be create base on the an index number.

    I query a database table for information and the information is stored in a DataTable
    object. There are about 100000 rows in the DataTable object. The information in the
    DataTable will be to displayed in a listview. Each row in the DataTable object represents
    an object that will be displayed in the listview. So in Bea’s download code when the Load
    method is called startindex would reference a row position in the DataTable from which a
    new VincentCustomer is created. Now the startindex to row position indexing is no problem
    when no sorting is not involved. When sorting is involved I have to sort the whole
    DataTable and now just section of the Virtual list as appears in the download code and
    Vincent talks about. Am I missing a connection here? Since my DataTable creates the
    VincentCustomers how do I sort a slice of the DataTable? The VirtualList never ever holds
    all 100000 object that the DataTable object contains so in reality does this quick sorting
    routine Vincent’s talk about actually work in reality?

    If anyone has some source code that downloads information from a database and uses
    Vincent’s data virtualization and has the quick sorting working could you post a link to
    the source code so other can download it? All I can see is having to sort the how
    DataTable using the default View of the DataTable since the VirtualList never contains all
    1000 objects. If I am missing something could some post what I am missing?
    steve_44@inbox.com is my email address if you would like to email a link to some source or email me some source code.

    • Bea

      Hi Steve,

      I wrote a blog post that covers the scenario you describe – sorting with data virtualization.

      Thanks,
      Bea

    • Bea

      Thanks Kent, your solution is a good compromise for the lack of support of data virtualization in Silverlight. Thanks for letting me know.

      Bea

Comments are closed.