HTML parsing methods
Jump to navigation
Jump to search
This page contains information relevant to KDE 3.x or older versions.
This page contains archived content from various sources pertaining to KDE 3.x (maybe outdated and/or partly useful) which should probably be updated. Please regard information in this page with extra caution.For HTML parsing, you have the following possibilities:
- QXML
- QDOM
- Perl
- XHTML
Obviously, QXML and QDOM need XML-compliant HTML pages, and the least HTML pages are XML-compliant. Perl is not the scope of this site. This tutorial chooses the XHTML approach.
First step
As we remember from http://developernew.kde.org/Development/Tutorials/Programming_Tutorial_KDE_4/How_to_write_an_HTML_parser, biggest thing is to be able to parse non-XML-conform syntax. It works with the following program.
tags.cpp
1#include <kapplication.h>
2#include <kaboutdata.h>
3#include <kcmdlineargs.h>
4#include <dom/html_document.h>
5
6int main (int argc, char *argv[])
7{
8 KAboutData aboutData( "test", "test",
9 "1.0", "test", KAboutData::License_GPL,
10 "(c) 2006" );
11 KCmdLineArgs::init( argc, argv, &aboutData );
12 KApplication khello;
13
14 DOM::HTMLDocument doc;
15 DOM::DOMString tag("*");
16 DOM::DOMString uri("<html><body><a href=\"http://www.kde.org/\"></a><a href=\"/index.php\" nowrap>Log in</a><a href=\"http://www.gmx.de\"></a></body></html>");
17
18 doc.loadXML(uri);
19 kdDebug() << "Does this doc have child elements ? " << doc.hasChildNodes() << endl;
20 for (int i=0; i<doc.getElementsByTagName(tag).length(); i++) kdDebug() << doc.getElementsByTagName(tag).item(i).nodeName().string() << endl;
21 kdDebug() << "Size of your doc " << sizeof(doc.firstChild()) << endl;
22 kdDebug() << doc.isHTMLDocument() << endl;
23 kdDebug() << doc.toString().string() << endl;
24}
Compile it like this:
gcc -I/usr/lib/qt3/include -I/opt/kde3/include \ -L/opt/kde3/lib -lkdeui -lkhtml -o tags tags.cpp
Second
#include <kapplication.h>
#include <kaboutdata.h>
#include <kcmdlineargs.h>
#include <dom/html_document.h>
#include <dom/html_element.h>
#include <dom/dom_node.h>
int main (int argc, char *argv[])
{
KAboutData aboutData( "test", "test",
"1.0", "test", KAboutData::License_GPL,
"(c) 2006" );
KCmdLineArgs::init( argc, argv, &aboutData );
KApplication khello;
DOM::HTMLDocument doc;
DOM::DOMString tag("*");
DOM::DOMString uri("<html><body><a href=\"http://www.kde.org/\"><b>fat</b></a><a href=\"/index.php\" nowrap>Log in</a><a href=\"http://www.gmx.de\"></a></body></html>");
doc.loadXML(uri);
kdDebug() << "Here's a list of the document elements" << endl;
for (int i=0; i<doc.getElementsByTagName(tag).length(); i++) kdDebug() << doc.getElementsByTagName(tag).item(i).nodeName().string() << endl;
DOM::HTMLDocument doc2;
DOM::DOMString uri2("<html><body>this is html<b>fat</b></body></html>");
doc2.loadXML(uri2);
kdDebug() << "This is the in-memory html:" << endl;
kdDebug() << doc.toString().string() << endl;
doc.body().insertBefore(doc.body().firstChild().firstChild(),doc.body().firstChild());
kdDebug() << "Moving around nodes" << endl;
kdDebug() << doc.toString().string() << endl;
}