Parsing HTML pages

If you are using the main C++ distribution of wxWidgets, Feel free to ask any question related to wxWidgets development here. This means questions regarding to C++ and wxWidgets, not compile problems.
Post Reply
Muetdhiver
Super wx Problem Solver
Super wx Problem Solver
Posts: 323
Joined: Sun Jun 08, 2008 11:59 am
Location: Bordeaux, France

Parsing HTML pages

Post by Muetdhiver » Mon Nov 28, 2011 2:51 pm

Hi all,

is there some sample that illustrates the wxHtmlParser class or wxHtmlWinParser in order to simply parse an HTML page given as argument ?

I'm not able to find such documentation anywhere.

Thanks for your help.
OS: Ubuntu 11.10
Compiler: g++ 4.6.1 (Eclipse CDT Indigo)
wxWidgets: 2.9.3

Muetdhiver
Super wx Problem Solver
Super wx Problem Solver
Posts: 323
Joined: Sun Jun 08, 2008 11:59 am
Location: Bordeaux, France

Re: Parsing HTML pages

Post by Muetdhiver » Mon Nov 28, 2011 3:47 pm

As example, i've got a web page in wich I can find a tag like this :

<div class="thumbinner" ...>

What I want to achieve is just to get some information after the div tag.
Guess I must use the parser :

Code: Select all

wxHtmlParser parser = new wxHtmlWinParser( "mypage.html" ) ;
wxHtmlCell* top_level_object = parser->Parse();

top_level_object->Find( wxHTML_COND_ISANCHOR, ????????????????????);
What should I put in the Find method to obtain what I want ?
Thanks a lot.

Bye.
OS: Ubuntu 11.10
Compiler: g++ 4.6.1 (Eclipse CDT Indigo)
wxWidgets: 2.9.3

User avatar
doublemax
Moderator
Moderator
Posts: 15507
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Parsing HTML pages

Post by doublemax » Mon Nov 28, 2011 5:03 pm

I don't think this class can be used that easily (i have no idea though).

Depending on what you need, maybe it can be done with wxRegEx or wxXmlDocument.
Use the source, Luke!

User avatar
evstevemd
Part Of The Furniture
Part Of The Furniture
Posts: 2293
Joined: Wed Jan 28, 2009 11:57 am
Location: United Republic of Tanzania
Contact:

Re: Parsing HTML pages

Post by evstevemd » Wed Nov 30, 2011 10:05 am

with wxWebview (in development on trunk) you can get source code and as DM said you can analyze to get what you want!
Chief Justice: We have trouble dear citizens!
Citizens: What it is his honor?
Chief Justice:Our president is an atheist, who will he swear to?
[Ubuntu 19.04/Windows 10 Pro/MacOS 10.13 - GCC/MinGW/Clang, CodeLite IDE]

Auria
Site Admin
Site Admin
Posts: 6695
Joined: Thu Sep 28, 2006 12:23 am
Contact:

Re: Parsing HTML pages

Post by Auria » Wed Nov 30, 2011 2:44 pm

evstevemd wrote:with wxWebview (in development on trunk) you can get source code and as DM said you can analyze to get what you want!
webview is only used for displaying the page, AFAIK it doesn't have an API to expose the DOM
"Keyboard not detected. Press F1 to continue"
-- Windows

User avatar
evstevemd
Part Of The Furniture
Part Of The Furniture
Posts: 2293
Joined: Wed Jan 28, 2009 11:57 am
Location: United Republic of Tanzania
Contact:

Re: Parsing HTML pages

Post by evstevemd » Thu Dec 01, 2011 6:34 pm

Auria wrote:
evstevemd wrote:with wxWebview (in development on trunk) you can get source code and as DM said you can analyze to get what you want!
webview is only used for displaying the page, AFAIK it doesn't have an API to expose the DOM
May be but you can get source and analyze it
http://docs.wxwidgets.org/2.9.2/classwx ... f26764f6d9
Chief Justice: We have trouble dear citizens!
Citizens: What it is his honor?
Chief Justice:Our president is an atheist, who will he swear to?
[Ubuntu 19.04/Windows 10 Pro/MacOS 10.13 - GCC/MinGW/Clang, CodeLite IDE]

Muetdhiver
Super wx Problem Solver
Super wx Problem Solver
Posts: 323
Joined: Sun Jun 08, 2008 11:59 am
Location: Bordeaux, France

Re: Parsing HTML pages

Post by Muetdhiver » Fri Dec 02, 2011 8:18 am

Hello,

thanks.
Wow, I've not the time to get the sources and analyse them. I would think a solution has been developped already in wxWdigets api.

In fact, after posting theses messages, I read more information about wxXmlDocument class, since HTML is just an implentation of XML norms. So I think it could do the trick, but never tested yet.

Again thanks, bye.
OS: Ubuntu 11.10
Compiler: g++ 4.6.1 (Eclipse CDT Indigo)
wxWidgets: 2.9.3

Auria
Site Admin
Site Admin
Posts: 6695
Joined: Thu Sep 28, 2006 12:23 am
Contact:

Re: Parsing HTML pages

Post by Auria » Fri Dec 02, 2011 6:53 pm

Actually it depends, if you need to parse XHTML then indeed wxXMLDocument will do, though if you parse non-XHTML then it won't work
"Keyboard not detected. Press F1 to continue"
-- Windows

Post Reply