- I-Html Agility Pack yenza ukuba ukwahlulahlula okuqinileyo kunye nokubuza imibuzo nge-HTML nge-XPath/LINQ.
- PowerShell kunye ne-C # igquma ukukhuhla okumileyo; ISelenium isombulula amaphepha ngeJavaScript.
- I-CsvHelper kunye ne-IronPDF yenza kube lula ukuthumela idatha kwi-CSV okanye ukuvelisa iingxelo PDF.
- Ukusebenzisa iiproxies kunciphisa imiqobo kwaye kuvumela ukukhuhla okugxile kwingingqi.
Ukuba ujonge indlela ephathekayo yokukhupha Idatha eyakhiwe Ngaphandle kokuphambana ngamazwi aqhelekileyo, ukudibanisa iPowerShell kunye ne-HTML Agility Pack sesinye sezisombululo ezikongela ixesha kunye nengxaki. Esi sitaki sikuvumela ukuba ukhangele i-DOM, ufumane iindawo usebenzisa i-XPath okanye i-LINQ, kwaye ufunde okubhaliweyo ngokuthembekileyo, iimpawu, okanye i-HTML, naxa uphawu lungagqibelelanga.
Kule migca ilandelayo sidibanisa ezona ndlela zibalaseleyo: PowerShell, C# kunye neSelenium ukugubungela yomibini imixholo emileyo kunye neguqukayo, imizekelo yelizwe lokwenyani (njengokukhupha umzimba weentengiso zeCraigslist), ukuthumela ngaphandle kwe-CSV, kunye nokukwazi ukuguqula iziphumo zibe yiPDF nge-IronPDF. Konke kunye amaqhinga luncedo njengosetyenziso lwe proxies ukunqanda iingozi kunye neengcebiso zokugcina abakhethi bakho bomelele ngokuhamba kwexesha.
Yintoni i-HTML Agility Pack kwaye kutheni iluncedo kangaka?
I-HTML Agility Pack (HAP) lilayibrari ye.NET ecalula iHTML ibe yi umthi we-node ukuba ungakhangela, ubuze, kwaye ulawule. Ngokungafaniyo nezinye, iindlela ezibuthathaka ngakumbi, i-HAP inyamezela i-HTML efomathiweyo kakubi kwaye ikuvumela ukuba ujonge i-DOM nge-XPath okanye i-LINQ usebenzisa i-API elula.
Amandla ayo aquka: uhlalutyo oluthambileyo (iginya i-HTML engafezekanga), ukukhohlisa kwe-DOM (ukongeza, ukucima, okanye ukulungisa iindawo/iimpawu), inkxaso XPath kunye LINQ, kunye nokusebenza kakuhle nakumaxwebhu amakhulu. Ngaphezu koko, uyilo lwayo luyanda, ngoko unokuphumeza izihluzi zesiqhelo okanye abaphathi nanini na xa kufuneka.
Layisha kwaye ucazulule iHTML ngeHAP: ifayile, umtya, okanye web
Ukuqala, unokulayisha umxholo we-HTML kwifayile yendawo, ukusuka kumtya kwimemori, okanye ngokuthe ngqo kwi-URL. Iklasi engundoqo HtmlDocument, kwaye kwiwebhu kulula kakhulu ukuyisebenzisa HtmlWeb kunye nendlela yoLayisho () yayo.
// Desde archivo
var doc = new HtmlDocument();
doc.Load(filePath);
// Desde cadena
var doc2 = new HtmlDocument();
doc2.LoadHtml(htmlString);
// Desde la web
var web = new HtmlWeb();
var doc3 = web.Load("http://example.com/");
Nje ukuba uxwebhu lulayishiwe, ufikelela kwindawo yengcambu nge DocumentNode. Ukusuka apho, unako khetha iindawo zokuhlala nge-XPath okanye nge-LINQ, kwaye ufunde iipropathi ezinjenge-OuterHtml, i-InnerText, iGama okanye iingqokelela yeempawu kunye nentuthuzelo epheleleyo.
Ukukhetha nokufunda iindawo: XPath, iimpawu, kunye nokucoca umbhalo
NgeXPath ungafumana izinto ezithile ngaphandle kokuzabalaza ngeHTML. Ithala leencwadi libonelela nge-SelectSingleNode () ngesiphumo esinye kunye Khetha iiNodes() xa ulindele ezininzi.
// Un solo nodo (por ejemplo, el <title> de la página)
var titleNode = doc.DocumentNode.SelectSingleNode("//head/title");
// Varios nodos (por ejemplo, todos los <article>)
var articles = doc.DocumentNode.SelectNodes("//article");
// Lectura de información útil
var name = titleNode.Name; // etiqueta del nodo
var html = titleNode.OuterHtml; // HTML completo del nodo
var text = titleNode.InnerText; // texto plano del nodo
Xa umbhalo wakho uqulathe amaziko e-HTML, unga “coca” umxholo usebenzisa izinto eziluncedo ezifana HtmlEntity.DeEntitize() okanye, ukuba ukhetha i-BCL, Inkqubo.Net.WebUtility.HtmlDecode()Oku kukunika umbhalo wendalo ngakumbi, ulungele ukusetyenzwa njengedatha.
// Limpieza de entidades HTML en texto extraído
var limpio = HtmlEntity.DeEntitize(titleNode.InnerText);
// o
var limpio2 = System.Net.WebUtility.HtmlDecode(titleNode.InnerText);
I-PowerShell + i-HTML ye-Agility Pack: Ukuhlolwa, iindlela, kunye ne-Real-World Extraction
Amaqela amaninzi akhetha i-PowerShell kuba ivumela ukukrala ngokukhawuleza kweprototyping. Ungalayisha i-HAP DLL (umzekelo, uguqulelo 1.11.59) kwaye usebenzise iiklasi zayo ukusuka kwizikripthi. Ukuba usebenze ngeemodyuli ezifana ne-PSParseHTML, usebenzisa i-HAP ngaphantsi.
# Cargar la DLL (ajusta la ruta a tu entorno)
$hapPath = 'C:\ruta\a\HtmlAgilityPack.dll'
[Reflection.Assembly]::LoadFile($hapPath) | Out-Null
# Descargar una página y cargarla en HtmlDocument
$dest = '$env:TEMP\page.htm'
$wc = New-Object System.Net.WebClient
$wc.Credentials = [System.Net.CredentialCache]::DefaultNetworkCredentials
$wc.DownloadFile('http://localhost/mihtml.html', $dest)
$doc = New-Object HtmlAgilityPack.HtmlDocument
$doc.Load($dest)
$root = $doc.DocumentNode
# Por ejemplo, recorrer filas de una tabla
$rows = $root.Descendants('tr')
foreach ($row in $rows) {
$cells = $row.Descendants('td')
if ($cells.Count -ge 2) {
Write-Host ($cells[0].InnerText + ' - ' + $cells[1].InnerText)
}
}
Umbuzo oqhelekileyo xa ujonga izinto kwi-PowerShell kulapho ivela khona GetAttributeValue() kwaye kutheni kuvela imisayino emininzi. Kwi-HtmlAgilityPack, iindawo ezidityanisiweyo ziveza indlela enomthwalo ogqithisileyo owamkela igama lophawu loyelelwano kunye nexabiso elingagqibekanga lokuguqulelwa kulo. umtya, int, bool, njl. njl
# Obtener un atributo (con valor por defecto si no existe)
$href = $node.GetAttributeValue('href', $null)
$tabIndex = $node.GetAttributeValue('tabindex', -1)
$esActivo = $node.GetAttributeValue('data-active', $false)
Regex yeHTML? Kungcono hayi. Ukuba ufuna ukukhupha umzimba wentengiso esongelwe kwi-a kunye ne-id ethile, izinzile ngakumbi ukuyisebenzisa xpath kunokulwa neepateni eziethe-ethe. Umzekelo, kwimeko efana … :
# Seleccionar el <section> por id (incluso si hay espacios en el id)
$section = $root.SelectSingleNode("//section[@id='posting body']")
if ($section) {
$texto = $section.InnerText
}
Le ndlela icocekile kwaye iyagcinwa: ukuba ulwakhiwo luyatshintsha, uhlengahlengisa xpath Kwaye yiloo nto. Unqanda iimpazamo eziqhelekileyo ze-regex ze-HTML (ukwakha, izithuba, iimpawu ngokweendlela ezahlukeneyo, njl.).
Ukujonga ngokukhawuleza ngeVB.NET kunye nenye isampuli kwiPowerShell
Ingcinga efanayo iyasebenza kwiVB.NET okanye C#: sikhuphela iHTML, siyilayishe kuyo HtmlDocument, sifumana iirowu kunye neeseli, kwaye sikhuphe umbhalo wazo ngekhowudi encinci kakhulu.
' VB.NET: recorrer una tabla simple
Using client As New Net.WebClient()
Dim tmp = IO.Path.GetTempFileName()
client.Credentials = CredentialCache.DefaultNetworkCredentials
client.DownloadFile(_URL, tmp)
Dim doc = New HtmlAgilityPack.HtmlDocument()
doc.Load(tmp)
Dim root = doc.DocumentNode
Dim filas = root.Descendants("tr").ToList()
For Each fila In filas
Dim tds = fila.Descendants("td").ToList()
If tds.Count >= 2 Then
Console.WriteLine(tds(0).InnerText & ": " & tds(1).InnerText)
End If
Next
End Using
Njengoko ubona, i-HAP inikezela nge injini yokwahlulahlula eqinileyo kunye nezinto ezininzi. Umahluko phakathi kweelwimi ukwi-syntax; Ukuhamba komsebenzi kuyafana: ukulayisha, ukukhetha iindawo, kunye nomxholo wokufunda.
Ukukrwela okuzinzileyo kwi-C# inyathelo ngenyathelo: ukusuka kwi-XPath ukuya kwi-CSV
Kwiisayithi ezinomxholo omileyo (i-HTML sele iqulethe idatha), khuphela ngokulula iphepha kwaye ucazulule iindawo zalo. Makhe sijonge ukuhamba okupheleleyo: faka i-HAP, layisha iphepha, khetha imiqolo nge-XPath, imephu kwizinto, kwaye uthumele CSV kunye neCsvHelper.
1) Faka i-HtmlAgilityPack kwi-NuGet. 2) Layisha i-URL ngeHtmlWeb.Load (). 3) Fumana iindawo usebenzisa i-XPath. 4) Khipha umbhalo kwiseli nganye. 5) Thumela izinto kwi-CSV nge CsvHelper.
using HtmlAgilityPack;
using System.Collections.Generic;
// URL de ejemplo (Wikipedia)
var url = "https://en.wikipedia.org/wiki/List_of_SpongeBob_SquarePants_episodes";
var web = new HtmlWeb();
var document = web.Load(url);
// XPath que selecciona filas de las tablas de episodios
var nodes = document.DocumentNode.SelectNodes(
"//*[@id='mw-content-text']/div[1]/table[position()>1 and position()<15]/tbody/tr[position()>1]");
// Clase para mapear resultados
public class Episode {
public string OverallNumber { get; set; }
public string Title { get; set; }
public string Directors { get; set; }
public string WrittenBy { get; set; }
public string Released { get; set; }
}
var episodes = new List<Episode>();
foreach (var node in nodes) {
episodes.Add(new Episode {
OverallNumber = HtmlEntity.DeEntitize(node.SelectSingleNode("th[1]").InnerText),
Title = HtmlEntity.DeEntitize(node.SelectSingleNode("td[2]").InnerText),
Directors = HtmlEntity.DeEntitize(node.SelectSingleNode("td[3]").InnerText),
WrittenBy = HtmlEntity.DeEntitize(node.SelectSingleNode("td[4]").InnerText),
Released = HtmlEntity.DeEntitize(node.SelectSingleNode("td[5]").InnerText)
});
}
Ukubhala i-CSV, i-CsvHelper yenza imveliso ibe lula kakhulu. Wenza nje i-StreamWriter kwaye ufowune Bhala iiRekhodi () ngoluhlu lwakho oluchwethezwe ngamandla.
using CsvHelper;
using System.Globalization;
using System.IO;
using (var writer = new StreamWriter("output.csv"))
using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture)) {
csv.WriteRecords(episodes);
}
Ngale ndlela, nabani na unokuvula i-CSV kwi-Excel kwaye asebenze nge Idatha eyakhiwe ngaphandle kwekhowudi yokuchukumisa. Yindlela elula, ethembekileyo, kwaye kulula ukuyigcina ukuba ulwakhiwo lwephepha luyatshintsha: hlaziya nje iXPath yakho kwaye ugqibile.
Xa i-HTML ingazisi idatha: i-dynamic, i-AJAX kunye ne-Selenium
Kwiisayithi eziguqukayo, i-HTML yokuqala isenokungabi nanto, kwaye iJavaScript inikezela ngedatha emva kwezicelo ze-XHR. Kuba i-HAP ayiphumezi JavaScript, udinga isikhangeli esingenantloko njengeSelenium ukuze unikezele kuqala kwaye ukhuphe kamva.
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
var url = "https://en.wikipedia.org/wiki/List_of_SpongeBob_SquarePants_episodes";
var chromeOptions = new ChromeOptions();
chromeOptions.AddArguments("headless");
var driver = new ChromeDriver(chromeOptions);
driver.Navigate().GoToUrl(url);
var rows = driver.FindElements(By.XPath(
"//*[@id='mw-content-text']/div[1]/table[position()>1 and position()<15]/tbody/tr[position()>1]"));
foreach (var row in rows) {
var title = row.FindElement(By.XPath("td[2]")).Text;
// ...
}
Kwiimeko ezinokulayisha ukonqena okanye izicelo ezicothayo, yongeza a WebDriverWait ukulinda ukuba iindawo zivele okanye iAjax igqibezele. Inzima kune HAP, kodwa kumaphepha aguqukayo yindlela elungileyo yokuhamba.
I-HAP yokuNcitshiswa kweNqanaba kunye neWebhu engenye indlela
IHAP iyacalula iDOM njengoko ifika isuka kumncedisi, okt. ayisebenzi i-JSUkuba indawo yakho ekujoliswe kuyo ifuna izikripthi ukunika umxholo wayo, ukongeza kwi-Selenium unokulayisha iphepha kulawulo WebView/WebBrowser eyenza iJavaScript kwaye, xa sele ilungile, igqithise isiphumo seHTML kwiHtmlAgilityPack. Ngale ndlela, udibanisa unikezelo oluyinyani kunye nokwahlulahlula okuqinileyo.
Sebenzisa iimeko: ukwenza ntoni ngedatha
Nje ukuba unezinto zakho kwinkumbulo, umda yingcamango yakho: ukuzigcina kwisiseko sedatha, uziguqule zibe JSON ukucela ii-APIs, ukuvelisa ii-CSV zeqela leshishini, okanye uzilayishe kwiingxelo zamaxesha athile. Undoqo kukuguqulela iziphumo kwiifomathi esele zisetyenziswa ngumbutho wakho.
Ubumfihlo, ukubhloka, kunye nokukhuhla kwengingqi usebenzisa i-proxies
Ngokukrazula kwisikali, iisayithi ziyakwazi ukubona iipateni kwaye zithintele eyakho IPUkusebenzisa i-proxies (ngokukhethekileyo kunye nedilesi ejikelezayo) kunceda ukuphepha ukuvinjelwa, ukusabalalisa umthwalo, kunye nokufikelela kwiinguqulelo zengingqi zewebhusayithi efanayo. Umboneleli olungileyo ikuvumela ukuba ukhethe indawo yokuphuma, efanelekileyo kuphando lwemarike okanye amaxabiso ngamazwe.
Abameli ejikelezayo Banika iidilesi ezahlukeneyo ze-IP kwisicelo ngasinye, okwenza kube nzima kwiinkqubo ezichasene ne-bot ukuzilandela. Ukongeza, ukuba ufuna ukujonga iikhathalogu okanye amaxabiso ahluka ngokwelizwe, khetha indawo yommeli ukuze ufumane imboniselo ngqo umsebenzisi wokwenyani kulo mmandla angayibona.
Hlanganisa i-HtmlAgilityPack kunye ne-IronPDF: Ukusuka kwi-HTML ukuya kwi-PDF
Kukho iimeko apho ufuna ukupakisha iziphumo kuxwebhu. Kulapho i-IronPDF ingena khona: nge-HAP ukhupha kwaye ubhale iHTML oyifunayo kunye nayo IronPDF Uyiguqulela kwiPDF ngelixa ugcina izitayile kunye noyilo. Ifanelekile kwiingxelo okanye ezinikezelweyo ekwabelwana ngazo ngaphandle kweqela lobugcisa.
Ukufaka i-IronPDF kulula njengokongeza iphakheji ye-NuGet. Ukuba ukhetha, kukho kwakhona ukhetho lokudibanisa i Dll Ngesandla. Nje ukuba kubhekiselwe kuyo, udala iHtmlToPdf kwaye unikeze umtya we-HTML owuvelisayo kumxholo okhutshiweyo.
using HtmlAgilityPack;
using System.Text;
// using IronPdf; // Asegúrate de referenciar IronPDF
var web = new HtmlWeb();
var doc = web.Load("https://ironpdf.com/");
var nodes = doc.DocumentNode.SelectNodes(
"//h1[@class='product-homepage-header product-homepage-header--ironpdf']");
var htmlContent = new StringBuilder();
foreach (var n in nodes) {
htmlContent.Append(n.OuterHtml);
}
var renderer = new IronPdf.HtmlToPdf();
var pdf = renderer.RenderHtmlAsPdf(htmlContent.ToString());
pdf.SaveAs("output.pdf");
Ukuba ufuna ukongeza iiheader, iifooter, iinombolo, okanye uqulunqe amaphepha anamacandelo athatyathwe kwii-URL ezahlukeneyo, unokwenza ngokwezifiso imveliso ngaphambi kokuba uyidlulisele kwi-injini. PDF ngesiphumo esiphucuke ngakumbi.
Izenzo ezilungileyo zabakhethi kunye nokugcinwa
- Khetha iimpawu ezizinzileyo (izazisi ezinentsingiselo okanye iiklasi) kunezalathisi eziethe-ethe ezifana div[3]/span[2].
- Phepha i-regex kwi-HTML xa kukho enye iDOM/XPath.
- Sebenzisa i-HtmlEntity.DeEntitize/HtmlDecode ukucoca amaziko.
- Beka embindini ii-XPaths zibe zisigxina kwaye ubhale injongo yazo.
- Isebenzisa ulawulo lwemposiso: iindawo ezingenanto, ixesha lokuphuma, utshintsho lwesakhiwo.
- Kwi-dynamics, yongeza ukulinda okucacileyo kunye neemeko zobukho bento.
Ikwangumbono olungileyo ukurekhoda iisampulu zokwenyani zeHTML kwiimvavanyo zokubona ukwaphuka xa iwebhusayithi itshintsha. Ukugcina uvavanyo oluncinci lwe-XPaths yakho kunye neemephu akubizi kwaye kuthintela imiba yemveliso, ngakumbi xa zininzi. Imithombo yedatha.
Iingcebiso zokuSebenza kunye noKwandiswa
I-HAP ilungiselelwe ukuphatha amaxwebhu amakhulu kunye nokusetyenziswa kwememori okufanelekileyo. Nangona kunjalo, ukuba ulungisa amaphepha amaninzi, cinga ngokuhambelana. i-ascargas nemida (umz. SemaphoreSlim) kwaye iqhelekile i HTML nje phambi kokutsalwa. Ukuba ufuna imithetho ekhethekileyo, unokwandisa umbhobho ngezihluzo zakho phambi kokwakha izinto zakho.
Kwiimeko ezixubeneyo, i-PowerShell ilungele ukucwangcisa imisebenzi (ukukhutshelwa, ukujikeleziswa kweproxy, ukwenziwa abahlalutyi C # iqulunqe izikripthi) kwaye udibanise iziphumo. Ukudibanisa imibhalo nge .NET izinto eziluncedo kukunika ubulula ngaphandle kokuncama ukusebenza.
Ukuba uvela kwintetho eqhelekileyo, uya kuqaphela ukuba i-DOM/XPath ifuna ukugcinwa okuncinci kwaye ifundeka ngakumbi. Iqhelekile into yokuba umkhethi ocingisiswe kakuhle aphile iinyanga nokuba kungenziwa izinto ezincinci. umrhumo kwiwebhusayithi ekujoliswe kuyo.
Yonke le ecosystem (HtmlAgilityPack yokwahlulahlula, iSelenium yonikezelo xa ifuneka, iCsvHelper yokuthumela ngaphandle, kunye ne-IronPDF yokubonisa) ihambelana ngokugqibeleleyo nokutsalwa kwehlabathi lokwenyani kunye nokuhamba komsebenzi wokunika ingxelo. NgePowerShell, C#, okanye VB.NET, unokwakha izisombululo ezinobungakanani, kunye nenkxaso ye proxies, sebenza ngokuzimela ngakumbi ebusweni bokuvalwa, iishifti zengingqi, okanye imithwalo ephezulu.
Umbhali onomdla malunga nehlabathi le-bytes kunye netekhnoloji ngokubanzi. Ndiyakuthanda ukwabelana ngolwazi lwam ngokubhala, kwaye yile nto ndiza kuyenza kule bhlog, ndikubonise zonke izinto ezinomdla malunga nezixhobo, isoftware, ihardware, iindlela zetekhnoloji, kunye nokunye. Injongo yam kukukunceda uhambe kwihlabathi ledijithali ngendlela elula neyonwabisayo.