HTML Agility Pack– UWP

You might have tried before to build an application that consumes some content from the internet. Now if you’re lucky, you would have some server that provides you the content through some restful endpoint ( as a JSON, XML or even plain text). But what if the server is not ready yet, and you want to start to build the app? What if the company won’t invest on building a server side apis, and they want you to use the website they have?

So the first thing you can think about is to download the html page, and start “extracting” the content out of it using Regular Expressions. The problem with this solution, its not easy, not clean. And it might take days to make it work for one website.

HTML Agility Pack to the rescue!

HTML Agility Pack is an HTML parser, which can extract data out of HTML pages, either by using XPATH language or by using simple search queries

DISCLIMER: Before we go through this, you need to read the website usage policy. Some websites don’t allow parsing the content.. So be careful

In this tutorial I’m going to try to get the content out of IMDB website. The reason I picked this website that they don’t have open APIs, and they have a rich content. (But again, the usage policy doesn’t allow us to use it for production, so we will use it for learning purposes)

So the web page I’m going to parse is the “movies in theaters” page, which shows the latest movies:
http://www.imdb.com/movies-in-theaters/

Let’s start by creating a new project, Start Visual Studio, and create a new blank UWP app.

Now we need to create the model of our project, so we can hold the data we get back from the imdb website. Create a class called Movie, and add three attributes like this:

public class Movie
{
public string Title { get; set; }
public string Cover { get; set; }
public string Summary { get; set; }
}

Now we need to create a simple view. In our MainPage.xaml, we will add a GridView and edit the ItemsTemplate so we can show the movies there, your xaml code should be like this:

</pre>
<pre><Grid Background="{ThemeResource ApplicationPageBackgroundThemeBrush}">
        <GridView x:Name="lstMovies">
            <GridView.ItemTemplate>
                <DataTemplate>
                    <Grid Width="300" Height="200" Margin="5">
                        <Grid.ColumnDefinitions>
                            <ColumnDefinition />
                            <ColumnDefinition />
                        </Grid.ColumnDefinitions>
                        <Image Source="{Binding Cover}" />
                        <Grid Grid.Column="1">
                            <Grid.RowDefinitions>
                                <RowDefinition Height="Auto" />
                                <RowDefinition Height="*" />
                            </Grid.RowDefinitions>
                            <TextBlock Text="{Binding Title}" FontSize="14" FontWeight="Bold"  />
                            <TextBlock Grid.Row="1" TextWrapping="Wrap" TextTrimming="WordEllipsis" Text="{Binding Summary}" />
                        </Grid>
                    </Grid>
                </DataTemplate>
            </GridView.ItemTemplate>
        </GridView>
    </Grid></pre>
<pre>

So far, what we did is simply creating a Model, and design the view. Now we need to do the real work by downloading the html page, and parse it, then create the List of the movies. Let’s go to NuGet Package Manager to Install the HTML Agility Pack, right click on the project, and click on Manage NuGet packages. Search for “Html Agility Pack” and install it.

Now to understand the structure of our page, we’re going to use the developer tools on Microsoft Edge (you can you chrome, Internet Explorer or even Firefox). On Microsoft Edge, navigate to imdb page, and right click anywhere, and click “Inspect element”. (If you can’t find the option “Inspect element”, press on F12 on your keyboard to show it).

On the “Elements” tab, notice when you hover over the html source code, Microsoft Edge highlights the visual part represented by that code. In our case, we’re trying to get the list of the movies. Looking through the code, we will notice we want to get all the divs with class “list_item”

For each movie, we need to extract the title, cover and the summary.

For the image, I kept digging, and I found that I want to access a div with a class image, then inside it there’s an anchor (a tag), then another div and finally an img tag.

Doing the same to get the Title and the summary of the movie, I found this:

Title: h4 tag with a itemprop “name
Summary: div with a class “outline

Notice I didn’t have to check all the tags to the way down to the content, because using HTML Agility Pack, I can skip all tags and get the “InnerText” directly

Lets start implementing that in Code, we need first to convert the Html source code we downloaded to an Html document:

</pre>
<pre>List<Movie> movies = new List<Movie>();
                foreach (var div in htmlDocument.DocumentNode.Descendants().Where(i => i.Name == "div" && i.GetAttributeValue("class", "").StartsWith("list_item")))
                {
                    Movie newMovie = new Movie();
                    newMovie.Cover = div.Descendants().Where(i => i.Name == "div" && i.GetAttributeValue("class", "") == "image").FirstOrDefault().Descendants().Where(i => i.Name == "img").FirstOrDefault().GetAttributeValue("src", "");
                    newMovie.Title = div.Descendants().Where(i => i.Name == "h4" && i.GetAttributeValue("itemprop", "") == "name").FirstOrDefault().InnerText.Trim();
                    newMovie.Summary = div.Descendants().Where(i => i.Name == "div" && i.GetAttributeValue("class", "") == "outline").FirstOrDefault().InnerText.Trim();
                    movies.Add(newMovie);
                }

Finally we set the ItemsSource of the listbox from the View with the movies:

lstMovies.ItemsSource = movies; 

Running the app will get you this view:

You can get the complete source code from this github repository:
https://github.com/TareqAteik/SampleUWPHtmlAgilityPack
If you have any question, please reach me out on twitter @tareqateik

Leave a Reply

Your email address will not be published. Required fields are marked *