Best way to parse HTML data

If you are using the main C++ distribution of wxWidgets, Feel free to ask any question related to wxWidgets development here. This means questions regarding to C++ and wxWidgets, not compile problems.
RP__
Experienced Solver
Experienced Solver
Posts: 96
Joined: Tue Jan 20, 2015 5:53 pm

Best way to parse HTML data

Postby RP__ » Fri Apr 21, 2017 10:24 am

Hello,

To retrieve data from a webpage I want to get the full page source and then parse data and filter out only what I require.
This is the HTML code:

Code: Select all

<div class="ui-hide-label" id="P6_ZOEK1_CONTAINER"><label id="P6_ZOEK1_LABEL" for="P6_ZOEK1"></label><div class="ui-input-text ui-body-inherit ui-corner-all ui-shadow-inset"><input name="P6_ZOEK1" class="text_field apex-item-text" id="P6_ZOEK1" type="text" size="30" maxlength="4000" placeholder="Search:" value=""></div></div><input name="P6_GED_ID" id="P6_GED_ID" type="hidden" value=""><input type="hidden" value="0" data-for="P6_GED_ID"><div class="ui-hide-label" id="P6_GED_ACTIVITEIT_CONTAINER"><label id="P6_GED_ACTIVITEIT_LABEL" for="P6_GED_ACTIVITEIT">Blahtext</label><div class="ui-select"><a class="ui-btn ui-icon-carat-d ui-btn-icon-right ui-corner-all ui-shadow" id="P6_GED_ACTIVITEIT-button" role="button" aria-haspopup="true" href="#"><span class="selectlist apex-item-select">AA text</span></a><select name="P6_GED_ACTIVITEIT" tabindex="-1" class="selectlist apex-item-select" id="P6_GED_ACTIVITEIT" size="1" data-native-menu="false">
<option value="AA">AA text</option>
<option value="BB">BB text</option>



I want to get all the "<option value="AA">AA text</option>" type entries from P6_ZOEK1_CONTAINER.
It is n long, so it can be 0..*.
I could not find anything in the samples so I decided to create a topic.
Should I be looking at wxHtmlParser to do this? If so, are there any samples I can look at?

coderrc
Experienced Solver
Experienced Solver
Posts: 55
Joined: Tue Nov 01, 2016 2:46 pm

Re: Best way to parse HTML data

Postby coderrc » Fri Apr 21, 2017 11:36 am

Despite the fact that HTML is not regular and thus can't be parsed with regular expressions, since you only want a specific data I would just use a regular expression and move on with my life.

RP__
Experienced Solver
Experienced Solver
Posts: 96
Joined: Tue Jan 20, 2015 5:53 pm

Re: Best way to parse HTML data

Postby RP__ » Fri Apr 21, 2017 7:19 pm

Okay, suitable for this situation.

When I use this regex, it doesn't compile:

Code: Select all

"<option value=\".{10, 20}\">(.{9,15}) : (.{0,255})<\\/option>"


On a website like http://regexr.com/, <option value=".{10,20}">(.{9,15}) : (.{0,255})<\/option> does return results for the data I feed it.
It always returns false in my code, which is like this:

Code: Select all

wxRegEx reText;
if (reText.Compile(HTML_CARD_REPRESENTATION, wxRE_ADVANCED) == false) {
   return;
}


I use the same setup with different regex strings in other places, which do succeed in compiling.
How much different is wxWidget's regex implementation from the one like the website? Are there any special cases I should take into consideration?

User avatar
doublemax
Moderator
Moderator
Posts: 10774
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Best way to parse HTML data

Postby doublemax » Fri Apr 21, 2017 8:32 pm

Code: Select all

"<option value=\".{10, 20}\">(.{9,15}) : (.{0,255})<\\/option>"
I don't see how this could match the text you posted. E.g. in the html code the text between the first set of quotes has only two chars, but the regex expects 10-20 chars.

Also, you need the outer separators.

Code: Select all

"/<option value=\".{1, 20}\">(.{1,15})\w+(.{1,255})<\\/option>/g"
Try this (untested).
Use the source, Luke!

RP__
Experienced Solver
Experienced Solver
Posts: 96
Joined: Tue Jan 20, 2015 5:53 pm

Re: Best way to parse HTML data

Postby RP__ » Sat Apr 22, 2017 8:46 am

It won't work for the example provided, no.
But it does for this:

Code: Select all

<option value="T1512 053">T1512 053 : TTS Text to Speech</option>

That's the actual representation of something I will be using.

As there is also this in the HTML, I want to put more requirements in for the regex:

Code: Select all

<option value="112">Scherpenzeel</option>


doublemax wrote:Also, you need the outer separators.

Code: Select all

"/<option value=\".{1, 20}\">(.{1,15})\w+(.{1,255})<\\/option>/g"
Try this (untested).

This regex does not compile.

It's kinda confusing because the regexes I use elsewhere do not have the outer separators.

User avatar
doublemax
Moderator
Moderator
Posts: 10774
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Best way to parse HTML data

Postby doublemax » Sat Apr 22, 2017 11:50 am

Code: Select all

wxRegEx reg("<option value=\".{1,20}\">(.{1,15})\\s+(.{1,20})<\\/option>", wxRE_ADVANCED );
if( reg.Matches( html ) )
{
  wxLogMessage("'%s' '%s'", reg.GetMatch( html, 1), reg.GetMatch( html, 2) );
}
This one works for the text you posted.
Use the source, Luke!

RP__
Experienced Solver
Experienced Solver
Posts: 96
Joined: Tue Jan 20, 2015 5:53 pm

Re: Best way to parse HTML data

Postby RP__ » Mon Apr 24, 2017 6:26 am

Great, that one works!

I noticed that the /g flag does not work well.
I would like to use the regex once to match all occurences of it in a string.
Something where I would not have to move an iterator for each match for example.

Right now it only finds the first match and then it stops.

User avatar
eranon
Ultimate wxWidgets Guru
Ultimate wxWidgets Guru
Posts: 555
Joined: Sun May 13, 2012 11:42 pm
Location: France
Contact:

Re: Best way to parse HTML data

Postby eranon » Mon Apr 24, 2017 11:02 am

In my opition, it's a way to parse HTML data, but not the best way to parse HTML data (in reference to your thread's title). Of course, it's enough if you have well known and limited cases, but to proceed with intensive scraping/parsing, the best way would be to go with a dedicated lib. On my part, I'm used to use libxml2 (supporting XML and HTML) with XPath, but there're a lot of others libs around...
[Ind. dev. - wxWidgets 3.0/3.1 under "Win 7 64-bit, TDM64-GCC" + "OS X 10.9, LLVM Clang"]

RP__
Experienced Solver
Experienced Solver
Posts: 96
Joined: Tue Jan 20, 2015 5:53 pm

Re: Best way to parse HTML data

Postby RP__ » Mon Apr 24, 2017 11:43 am

eranon wrote:In my opition, it's a way to parse HTML data, but not the best way to parse HTML data (in reference to your thread's title). Of course, it's enough if you have well known and limited cases, but to proceed with intensive scraping/parsing, the best way would be to go with a dedicated lib. On my part, I'm used to use libxml2 (supporting XML and HTML) with XPath, but there're a lot of others libs around...


So what's up with the wxHtmlParser then? Isn't that a lib that can parse HTML data?
I agree with you that I probably will need a library to do this stuff, also because I will most likely use more HTML data in the future.

User avatar
eranon
Ultimate wxWidgets Guru
Ultimate wxWidgets Guru
Posts: 555
Joined: Sun May 13, 2012 11:42 pm
Location: France
Contact:

Re: Best way to parse HTML data

Postby eranon » Mon Apr 24, 2017 11:53 am

Yes, it is, "Captain" :), but, frankly, I never used it. So I found my happiness to read, write and traverse XML DOM in wxWidgets (and I have an app using a lot of XML stuff), when time came to scrape/parse complex HTML (in another app), I immediatly thought about XPath... Then, I skipped wxHtmlParser. I'm not perfect ;)
[Ind. dev. - wxWidgets 3.0/3.1 under "Win 7 64-bit, TDM64-GCC" + "OS X 10.9, LLVM Clang"]

RP__
Experienced Solver
Experienced Solver
Posts: 96
Joined: Tue Jan 20, 2015 5:53 pm

Re: Best way to parse HTML data

Postby RP__ » Mon Apr 24, 2017 12:19 pm

Okay, I would use wxHtmlParser if it had more documentation.
I looked for tutorials already but I could not find any. I don't really know how to get started.

iwbnwif
Filthy Rich wx Solver
Filthy Rich wx Solver
Posts: 237
Joined: Tue Mar 19, 2013 8:52 pm

Re: Best way to parse HTML data

Postby iwbnwif » Mon Apr 24, 2017 12:28 pm

There is some documentation here, although it is for a derived class.

http://docs.wxwidgets.org/trunk/overvie ... html_cells
wxWidgets 3.10, TDM MinGW64 5.1.0, g++ 4.8.2, Ubuntu 16.04LTS, Windows 10, CodeLite + wxCrafter
Some people, when confronted with a GUI problem, think "I know, I'll use Eclipse RCP". Now they have two problems.


Return to “C++ Development”

Who is online

Users browsing this forum: Google [Bot] and 3 guests