Scraping structured data with PowerShell and HTML Agility Pack

Last update: 10/10/2025
Author Isaac
  • Html Agility Pack enables robust parsing and querying of HTML with XPath/LINQ.
  • PowerShell and C# cover static scraping; Selenium resolves pages with JavaScript.
  • CsvHelper and IronPDF make it easy to export data to CSV or generate reports in PDF.
  • Using proxies reduces blockages and enables regionally focused scraping.

Scraping with PowerShell and HTML Agility Pack

If you are looking for a practical way to extract structured data Without going crazy with regular expressions, combining PowerShell with the HTML Agility Pack is one of those solutions that saves you time and trouble. This stack allows you to navigate the DOM, locate nodes using XPath or LINQ, and reliably read text, attributes, or HTML, even when the markup isn't perfect.

In the following lines we combine the best of several approaches: PowerShell, C# and Selenium to cover both static and dynamic content, real-world examples (such as extracting the body of Craigslist ads), CSV export, and even the ability to convert results to PDF with IronPDF. All with Tricks useful as the use of proxies to avoid crashes and recommendations to keep your selectors robust over time.

What is HTML Agility Pack and why is it so useful?

HTML Agility Pack (HAP) is a .NET library that parses HTML into a node tree that you can navigate, query, and manipulate. Unlike other, more fragile approaches, HAP tolerates poorly formatted HTML and lets you navigate the DOM with XPath or LINQ using a simple API.

Among its strengths are: lenient analysis (swallows imperfect HTML), DOM manipulation (adding, deleting, or modifying nodes/attributes), support for XPath and LINQ, and good performance even on large documents. Plus, its design is extensible, so you can implement custom filters or handlers whenever needed.

Load and parse HTML with HAP: file, string, or web

To start, you can load HTML content from a local file, from a string in memory, or directly from a URL. The key class is HtmlDocument, and for the web it is very comfortable to use HtmlWeb and its Load() method.

// Desde archivo
var doc = new HtmlDocument();
doc.Load(filePath);

// Desde cadena
var doc2 = new HtmlDocument();
doc2.LoadHtml(htmlString);

// Desde la web
var web = new HtmlWeb();
var doc3 = web.Load("http://example.com/");

Once the document is loaded, you access the root node with DocumentNode. From there, you can select nodes by XPath or with LINQ, and read properties such as OuterHtml, InnerText, Name or the attributes collection with total comfort.

Selecting and reading nodes: XPath, attributes, and text cleaning

With XPath you can locate specific elements without struggling with HTML. The library offers SelectSingleNode() for a single result and SelectNodes() when you expect several.

// Un solo nodo (por ejemplo, el <title> de la página)
var titleNode = doc.DocumentNode.SelectSingleNode("//head/title");

// Varios nodos (por ejemplo, todos los <article>)
var articles = doc.DocumentNode.SelectNodes("//article");

// Lectura de información útil
var name = titleNode.Name;            // etiqueta del nodo
var html = titleNode.OuterHtml;       // HTML completo del nodo
var text = titleNode.InnerText;       // texto plano del nodo

When your text contains HTML entities, you can “clean” the content using utilities like HtmlEntity.DeEntitize() or, if you prefer BCL, System.Net.WebUtility.HtmlDecode()This gives you a more natural text, ready to be processed as data.

// Limpieza de entidades HTML en texto extraído
var limpio = HtmlEntity.DeEntitize(titleNode.InnerText);
// o
var limpio2 = System.Net.WebUtility.HtmlDecode(titleNode.InnerText);

PowerShell + HTML Agility Pack: Inspection, Methods, and Real-World Extraction

Many teams prefer PowerShell because it allows for very rapid scraping prototyping. You can load the HAP DLL (e.g., version 1.11.59) and use its classes from scripts. If you've worked with modules like PSParseHTML, you've actually been using HAP underneath.

# Cargar la DLL (ajusta la ruta a tu entorno)
$hapPath = 'C:\ruta\a\HtmlAgilityPack.dll'
[Reflection.Assembly]::LoadFile($hapPath) | Out-Null

# Descargar una página y cargarla en HtmlDocument
$dest = '$env:TEMP\page.htm'
$wc = New-Object System.Net.WebClient
$wc.Credentials = [System.Net.CredentialCache]::DefaultNetworkCredentials
$wc.DownloadFile('http://localhost/mihtml.html', $dest)

$doc = New-Object HtmlAgilityPack.HtmlDocument
$doc.Load($dest)
$root = $doc.DocumentNode

# Por ejemplo, recorrer filas de una tabla
$rows = $root.Descendants('tr')
foreach ($row in $rows) {
    $cells = $row.Descendants('td')
    if ($cells.Count -ge 2) {
        Write-Host ($cells[0].InnerText + ' - ' + $cells[1].InnerText)
    }
}

A common question when exploring objects in PowerShell is where it comes from GetAttributeValue() and why multiple signatures appear. In HtmlAgilityPack, nodes expose that method with overloads that accept the attribute name and a default value to convert to string, int, bool, etc. PowerShell may display it as a generic method (T) or various overloads depending on the typing engine, but for practical purposes you'll use it like this:

# Obtener un atributo (con valor por defecto si no existe)
$href = $node.GetAttributeValue('href', $null)
$tabIndex = $node.GetAttributeValue('tabindex', -1)
$esActivo = $node.GetAttributeValue('data-active', $false)

Regex for HTML? Better not. If you want to extract the body of an ad wrapped in a with a specific id, it is more stable to use xpath than fighting with fragile patterns. For example, for a case like … :

# Seleccionar el <section> por id (incluso si hay espacios en el id)
$section = $root.SelectSingleNode("//section[@id='posting body']")
if ($section) {
    $texto = $section.InnerText
}

This approach is clean and maintainable: if the structure changes, you adjust the xpath And that's it. You avoid typical HTML regex errors (nesting, spaces, attributes in different orders, etc.).

  How to format a Micro SD card in Windows?

A quick look with VB.NET and another sample in PowerShell

The same idea applies in VB.NET or C#: we download the HTML, load it into HtmlDocument, we locate rows and cells, and extract their text with very little code.

' VB.NET: recorrer una tabla simple
Using client As New Net.WebClient()
    Dim tmp = IO.Path.GetTempFileName()
    client.Credentials = CredentialCache.DefaultNetworkCredentials
    client.DownloadFile(_URL, tmp)

    Dim doc = New HtmlAgilityPack.HtmlDocument()
    doc.Load(tmp)

    Dim root = doc.DocumentNode
    Dim filas = root.Descendants("tr").ToList()

    For Each fila In filas
        Dim tds = fila.Descendants("td").ToList()
        If tds.Count >= 2 Then
            Console.WriteLine(tds(0).InnerText & ": " & tds(1).InnerText)
        End If
    Next
End Using

As you can see, HAP offers a solid parsing engine and versatile. The difference between the languages ​​is in the syntax; the workflow is identical: loading, selecting nodes, and reading content.

Static scraping in C# step by step: from XPath to CSV

For sites with static content (the HTML already contains the data), simply download the page and parse its nodes. Let's look at the complete flow: install HAP, load a page, select rows by XPath, map to objects, and export to CSV with CsvHelper.

1) Install HtmlAgilityPack from NuGet. 2) Load the URL with HtmlWeb.Load(). 3) Get the nodes using XPath. 4) Extract text from each cell. 5) Export the objects to CSV with CsvHelper.

using HtmlAgilityPack;
using System.Collections.Generic;

// URL de ejemplo (Wikipedia)
var url = "https://en.wikipedia.org/wiki/List_of_SpongeBob_SquarePants_episodes";
var web = new HtmlWeb();
var document = web.Load(url);

// XPath que selecciona filas de las tablas de episodios
var nodes = document.DocumentNode.SelectNodes(
    "//*[@id='mw-content-text']/div[1]/table[position()>1 and position()<15]/tbody/tr[position()>1]");

// Clase para mapear resultados
public class Episode {
    public string OverallNumber { get; set; }
    public string Title { get; set; }
    public string Directors { get; set; }
    public string WrittenBy { get; set; }
    public string Released { get; set; }
}

var episodes = new List<Episode>();
foreach (var node in nodes) {
    episodes.Add(new Episode {
        OverallNumber = HtmlEntity.DeEntitize(node.SelectSingleNode("th[1]").InnerText),
        Title        = HtmlEntity.DeEntitize(node.SelectSingleNode("td[2]").InnerText),
        Directors    = HtmlEntity.DeEntitize(node.SelectSingleNode("td[3]").InnerText),
        WrittenBy    = HtmlEntity.DeEntitize(node.SelectSingleNode("td[4]").InnerText),
        Released     = HtmlEntity.DeEntitize(node.SelectSingleNode("td[5]").InnerText)
    });
}

To write the CSV, CsvHelper makes the output much simpler. You just create a StreamWriter and call WriteRecords() with your strongly typed list.

using CsvHelper;
using System.Globalization;
using System.IO;

using (var writer = new StreamWriter("output.csv"))
using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture)) {
    csv.WriteRecords(episodes);
}

This way, anyone can open the CSV in Excel and work with the structured data without touching code. It's a simple, reliable, and easy-to-maintain flow if the page structure changes: just update your XPath and you're done.

  Delete iPhone Backups From Mac

When HTML doesn't bring the data: dynamic, AJAX and Selenium

On dynamic sites, the initial HTML may be empty, and JavaScript renders the data after XHR requests. Since HAP doesn't execute JavaScript, you need a headless browser like Selenium to render first and extract later.

using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;

var url = "https://en.wikipedia.org/wiki/List_of_SpongeBob_SquarePants_episodes";
var chromeOptions = new ChromeOptions();
chromeOptions.AddArguments("headless");
var driver = new ChromeDriver(chromeOptions);

driver.Navigate().GoToUrl(url);

var rows = driver.FindElements(By.XPath(
    "//*[@id='mw-content-text']/div[1]/table[position()>1 and position()<15]/tbody/tr[position()>1]"));

foreach (var row in rows) {
    var title = row.FindElement(By.XPath("td[2]")).Text;
    // ...
}

In scenarios with lazy loading or slow requests, add a WebDriverWait to wait for nodes to appear or for Ajax to complete. It's heavier than HAP, but for dynamic pages, it's the right way to go.

Key HAP Limitation and WebView Alternative

HAP parses the DOM as it arrives from the server, i.e. does not execute JSIf your target site requires scripts to render its content, in addition to Selenium you can load the page in a control WebView/WebBrowser that does run JavaScript and, once ready, passes the resulting HTML to HtmlAgilityPack. This way, you combine true rendering with robust parsing.

Use cases: what to do with the data

Once you have your objects in memory, the limit is your imagination: saving them in a database, transforming them into JSON to invoke APIs, generate CSVs for the business team, or upload them to periodic reports. The key is to translate results into formats your organization already uses.

Privacy, blocking, and regional scraping using proxies

By scraping at scale, sites can detect patterns and block your IPUsing proxies (preferably with a rotating address) helps avoid bans, distribute the load, and access regional versions of the same website. A good provider allows you to choose the exit location, ideal for market research or international pricing.

the proxies rotary They assign different IP addresses to each request, making it difficult for anti-bot systems to track them. Additionally, if you need to view catalogs or prices that vary by country, choose the proxy location to get the exact view a real user in that region would see.

Integrate HtmlAgilityPack with IronPDF: From HTML to PDF

There are scenarios where you need to package results into a document. That's where IronPDF comes in: with HAP you extract and compose the desired HTML and with IronPDF You convert it to PDF while maintaining styles and layout. It's perfect for reports or deliverables shared outside of the technical team.

  How to Use Game Assist on Edge: Complete Step-by-Step Guide

Installing IronPDF is as simple as adding the NuGet package. If you prefer, there is also the option to integrate the DLL Manually. Once referenced, you create an HtmlToPdf and render an HTML string that you generate yourself from the scraped content.

using HtmlAgilityPack;
using System.Text;
// using IronPdf;  // Asegúrate de referenciar IronPDF

var web = new HtmlWeb();
var doc = web.Load("https://ironpdf.com/");

var nodes = doc.DocumentNode.SelectNodes(
    "//h1[@class='product-homepage-header product-homepage-header--ironpdf']");

var htmlContent = new StringBuilder();
foreach (var n in nodes) {
    htmlContent.Append(n.OuterHtml);
}

var renderer = new IronPdf.HtmlToPdf();
var pdf = renderer.RenderHtmlAsPdf(htmlContent.ToString());
pdf.SaveAs("output.pdf");

If you need to add headers, footers, numbering, or compose pages with sections extracted from different URLs, you can customize the output before passing it to the engine. PDF for a more polished result.

Good practices for selectors and maintenance

  • Prefer stable attributes (meaningful ids or classes) over fragile indexes like div[3]/span[2].
  • Avoid regex on HTML when there is a DOM/XPath alternative.
  • Use HtmlEntity.DeEntitize/HtmlDecode to clean up entities.
  • Centralize XPaths into constants and document their intent.
  •  Implements error controls: null nodes, timeouts, structural changes.
  • For dynamics, add explicit waits and element presence conditions.

It's also a good idea to record real HTML samples in tests to detect breakages when the website changes. Maintaining a small test suite for your XPaths and mappings is inexpensive and prevents production issues, especially when there are several. sources of data.

Performance and Extensibility Tips

HAP is optimized to handle large documents with reasonable memory usage. However, if you're processing a lot of pages, consider parallelizing. downloads with limits (e.g. SemaphoreSlim) and normalizes the HTML just before extracting. If you need special rules, you can extend the pipeline with your own filters before constructing your objects.

In mixed environments, PowerShell is ideal for orchestrating tasks (downloads, proxy rotation, execution of parser C# compiled scripts) and consolidate results. Combining scripts with .NET utilities gives you agility without sacrificing performance.

If you're coming from regular expressions, you'll notice that DOM/XPath requires less maintenance and is much more readable. It's common for a well-thought-out selector to survive for months even with minor tweaks. markup from the target website.

This entire ecosystem (HtmlAgilityPack for parsing, Selenium for rendering if needed, CsvHelper for exporting, and IronPDF for presenting) fits perfectly with real-world extraction and reporting workflows. With PowerShell, C#, or VB.NET, you can build scalable solutions, and with the support of proxies, operate more resiliently in the face of lockdowns, regional shifts, or peak loads.