Best way to parse HTML data

If you are using the main C++ distribution of wxWidgets, Feel free to ask any question related to wxWidgets development here. This means questions regarding to C++ and wxWidgets, not compile problems.
Post Reply
RP__
Earned some good credits
Earned some good credits
Posts: 124
Joined: Tue Jan 20, 2015 5:53 pm

Best way to parse HTML data

Post by RP__ »

Hello,

To retrieve data from a webpage I want to get the full page source and then parse data and filter out only what I require.
This is the HTML code:

Code: Select all

<div class="ui-hide-label" id="P6_ZOEK1_CONTAINER"><label id="P6_ZOEK1_LABEL" for="P6_ZOEK1"></label><div class="ui-input-text ui-body-inherit ui-corner-all ui-shadow-inset"><input name="P6_ZOEK1" class="text_field apex-item-text" id="P6_ZOEK1" type="text" size="30" maxlength="4000" placeholder="Search:" value=""></div></div><input name="P6_GED_ID" id="P6_GED_ID" type="hidden" value=""><input type="hidden" value="0" data-for="P6_GED_ID"><div class="ui-hide-label" id="P6_GED_ACTIVITEIT_CONTAINER"><label id="P6_GED_ACTIVITEIT_LABEL" for="P6_GED_ACTIVITEIT">Blahtext</label><div class="ui-select"><a class="ui-btn ui-icon-carat-d ui-btn-icon-right ui-corner-all ui-shadow" id="P6_GED_ACTIVITEIT-button" role="button" aria-haspopup="true" href="#"><span class="selectlist apex-item-select">AA text</span></a><select name="P6_GED_ACTIVITEIT" tabindex="-1" class="selectlist apex-item-select" id="P6_GED_ACTIVITEIT" size="1" data-native-menu="false">
<option value="AA">AA text</option>
<option value="BB">BB text</option>

I want to get all the "<option value="AA">AA text</option>" type entries from P6_ZOEK1_CONTAINER.
It is n long, so it can be 0..*.
I could not find anything in the samples so I decided to create a topic.
Should I be looking at wxHtmlParser to do this? If so, are there any samples I can look at?
coderrc
Earned some good credits
Earned some good credits
Posts: 141
Joined: Tue Nov 01, 2016 2:46 pm

Re: Best way to parse HTML data

Post by coderrc »

Despite the fact that HTML is not regular and thus can't be parsed with regular expressions, since you only want a specific data I would just use a regular expression and move on with my life.
RP__
Earned some good credits
Earned some good credits
Posts: 124
Joined: Tue Jan 20, 2015 5:53 pm

Re: Best way to parse HTML data

Post by RP__ »

Okay, suitable for this situation.

When I use this regex, it doesn't compile:

Code: Select all

"<option value=\".{10, 20}\">(.{9,15}) : (.{0,255})<\\/option>"
On a website like http://regexr.com/, <option value=".{10,20}">(.{9,15}) : (.{0,255})<\/option> does return results for the data I feed it.
It always returns false in my code, which is like this:

Code: Select all

wxRegEx reText;
if (reText.Compile(HTML_CARD_REPRESENTATION, wxRE_ADVANCED) == false) {
	return;
}
I use the same setup with different regex strings in other places, which do succeed in compiling.
How much different is wxWidget's regex implementation from the one like the website? Are there any special cases I should take into consideration?
User avatar
doublemax
Moderator
Moderator
Posts: 19160
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Best way to parse HTML data

Post by doublemax »

Code: Select all

"<option value=\".{10, 20}\">(.{9,15}) : (.{0,255})<\\/option>"
I don't see how this could match the text you posted. E.g. in the html code the text between the first set of quotes has only two chars, but the regex expects 10-20 chars.

Also, you need the outer separators.

Code: Select all

"/<option value=\".{1, 20}\">(.{1,15})\w+(.{1,255})<\\/option>/g"
Try this (untested).
Use the source, Luke!
RP__
Earned some good credits
Earned some good credits
Posts: 124
Joined: Tue Jan 20, 2015 5:53 pm

Re: Best way to parse HTML data

Post by RP__ »

It won't work for the example provided, no.
But it does for this:

Code: Select all

<option value="T1512 053">T1512 053 : TTS Text to Speech</option>
That's the actual representation of something I will be using.

As there is also this in the HTML, I want to put more requirements in for the regex:

Code: Select all

<option value="112">Scherpenzeel</option>
doublemax wrote:Also, you need the outer separators.

Code: Select all

"/<option value=\".{1, 20}\">(.{1,15})\w+(.{1,255})<\\/option>/g"
Try this (untested).
This regex does not compile.

It's kinda confusing because the regexes I use elsewhere do not have the outer separators.
User avatar
doublemax
Moderator
Moderator
Posts: 19160
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Best way to parse HTML data

Post by doublemax »

Code: Select all

wxRegEx reg("<option value=\".{1,20}\">(.{1,15})\\s+(.{1,20})<\\/option>", wxRE_ADVANCED );
if( reg.Matches( html ) )
{
  wxLogMessage("'%s' '%s'", reg.GetMatch( html, 1), reg.GetMatch( html, 2) );
}
This one works for the text you posted.
Use the source, Luke!
RP__
Earned some good credits
Earned some good credits
Posts: 124
Joined: Tue Jan 20, 2015 5:53 pm

Re: Best way to parse HTML data

Post by RP__ »

Great, that one works!

I noticed that the /g flag does not work well.
I would like to use the regex once to match all occurences of it in a string.
Something where I would not have to move an iterator for each match for example.

Right now it only finds the first match and then it stops.
User avatar
eranon
Can't get richer than this
Can't get richer than this
Posts: 867
Joined: Sun May 13, 2012 11:42 pm
Location: France
Contact:

Re: Best way to parse HTML data

Post by eranon »

In my opition, it's a way to parse HTML data, but not the best way to parse HTML data (in reference to your thread's title). Of course, it's enough if you have well known and limited cases, but to proceed with intensive scraping/parsing, the best way would be to go with a dedicated lib. On my part, I'm used to use libxml2 (supporting XML and HTML) with XPath, but there're a lot of others libs around...
[Ind. dev. - wxWidgets 3.0/3.1 under "Win 7 64-bit, TDM64-GCC" + "OS X 10.9, LLVM Clang"]
RP__
Earned some good credits
Earned some good credits
Posts: 124
Joined: Tue Jan 20, 2015 5:53 pm

Re: Best way to parse HTML data

Post by RP__ »

eranon wrote:In my opition, it's a way to parse HTML data, but not the best way to parse HTML data (in reference to your thread's title). Of course, it's enough if you have well known and limited cases, but to proceed with intensive scraping/parsing, the best way would be to go with a dedicated lib. On my part, I'm used to use libxml2 (supporting XML and HTML) with XPath, but there're a lot of others libs around...
So what's up with the wxHtmlParser then? Isn't that a lib that can parse HTML data?
I agree with you that I probably will need a library to do this stuff, also because I will most likely use more HTML data in the future.
User avatar
eranon
Can't get richer than this
Can't get richer than this
Posts: 867
Joined: Sun May 13, 2012 11:42 pm
Location: France
Contact:

Re: Best way to parse HTML data

Post by eranon »

Yes, it is, "Captain" :), but, frankly, I never used it. So I found my happiness to read, write and traverse XML DOM in wxWidgets (and I have an app using a lot of XML stuff), when time came to scrape/parse complex HTML (in another app), I immediatly thought about XPath... Then, I skipped wxHtmlParser. I'm not perfect ;)
[Ind. dev. - wxWidgets 3.0/3.1 under "Win 7 64-bit, TDM64-GCC" + "OS X 10.9, LLVM Clang"]
RP__
Earned some good credits
Earned some good credits
Posts: 124
Joined: Tue Jan 20, 2015 5:53 pm

Re: Best way to parse HTML data

Post by RP__ »

Okay, I would use wxHtmlParser if it had more documentation.
I looked for tutorials already but I could not find any. I don't really know how to get started.
iwbnwif
Super wx Problem Solver
Super wx Problem Solver
Posts: 282
Joined: Tue Mar 19, 2013 8:52 pm

Re: Best way to parse HTML data

Post by iwbnwif »

There is some documentation here, although it is for a derived class.

http://docs.wxwidgets.org/trunk/overvie ... html_cells
wxWidgets 3.1.2, MinGW64 8.1.0, g++ 8.1.0, Ubuntu 19.04, Windows 10, CodeLite + wxCrafter
Some people, when confronted with a GUI problem, think "I know, I'll use Eclipse RCP". Now they have two problems.
RP__
Earned some good credits
Earned some good credits
Posts: 124
Joined: Tue Jan 20, 2015 5:53 pm

Re: Best way to parse HTML data

Post by RP__ »

I have decided to go with wxRegEx after all.

How can I make it capture all occurences in a big string, instead of just the first?
User avatar
doublemax
Moderator
Moderator
Posts: 19160
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Best way to parse HTML data

Post by doublemax »

How can I make it capture all occurrences in a big string, instead of just the first?
I don't think that's possible, you have to loop until there is no more match like shown in the docs.
Use the source, Luke!
Post Reply