Page 1 of 1

*How to get the source code of a page?

Posted: Tue Aug 11, 2015 11:04 am
by whoops

for example, i want to get the source code of http://www.w3.org/1999/xhtml/
here is the source code (in Opera, right-click the "View page source" or CTRL + U):
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="http://www.w3.org/StyleSheets/TR/base"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
"http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head profile="http://www.w3.org/2003/g/data-view">
<meta http-equiv="content-type"
content="application/xhtml+xml; charset=UTF-8" />
<title>XHTML namespace</title>
<link rel="stylesheet" type="text/css"
href="http://www.w3.org/StyleSheets/TR/base" />
<link rel="transformation" href="http://www.w3.org/2008/07/rdfa-xslt" />
<link xmlns:data-view="http://www.w3.org/2003/g/data-view#"
about="http://www.w3.org/1999/xhtml" rel="data-view:namespaceTransformation"
href="http://www.w3.org/2008/07/rdfa-xslt" />
</head>

<body>

<div class="head">
<p><a href="http://www.w3.org/"><img class="head"
src="http://www.w3.org/Icons/WWW/w3c_home" alt="W3C" /></a></p>
</div>

<h1><abbr title="Extensible HyperText Markup Language">XHTML</abbr>
namespace</h1>

<p>The namespace name <tt>http://www.w3.org/1999/xhtml</tt> is intended for use
in various specifications such as:</p>

<p>Recommendations:</p>

... (omited)
so, how to get it, what classes needed for get source code :?:
for test, you can make it into a wxTextCtrl to display the result.

Re: *How to get the source code of a page?

Posted: Tue Aug 11, 2015 12:27 pm
by doublemax
Here's some old code that reads a web page and saves it to a local file. That should contain everything you need.
viewtopic.php?p=42101#p42101

Beware that the wxWIdgets http classes don't support HTTP redirect, if a website uses that you'll get an empty file.

Re: *How to get the source code of a page?

Posted: Tue Aug 11, 2015 1:56 pm
by whoops

Here is the test code, but it threw a runtime exception:

Code: Select all

#include <wx/wx.h>
#include <wx/url.h>
#define BUFSIZE 16384
int main(void)
{
	wxURL url(wxT("http://www.w3.org/1999/xhtml"));
	if(url.IsOk()) {
		// in_stream threw a runtime exception :-(
		wxInputStream *in_stream = url.GetInputStream();
		unsigned char buffer[BUFSIZE];
		do {
			in_stream->Read(buffer, BUFSIZE);
		} while (!in_stream->Eof());
		delete in_stream;
		wxString src;
		src.Printf(wxT("%s"), buffer);
	}
    return 0;
}
how to deal with it [the comment in line 8] ?
I'm using wxWidgets 2.8.12
[/size]

Re: *How to get the source code of a page?

Posted: Tue Aug 11, 2015 2:33 pm
by doublemax
If you use wxWidgets from main() you need to initialize it manually. And you need to check the returned pointer from GetInputStream() for NULL.

Code: Select all

#include <wx/wx.h>
#include <wx/url.h>
#define BUFSIZE 16384
int main(void)
{
   wxInitializer wx;
   
   if( !wx.IsOk() )
    return -1;
   
   wxURL url(wxT("http://www.w3.org/1999/xhtml"));
   if(url.IsOk())
   {
      wxInputStream *in_stream = url.GetInputStream();
      
      // in_stream can be NULL if something went wrong
      if( in_stream!=NULL )
      {
        // this code is wrong when the page is bigger than BUFSIZE
        unsigned char buffer[BUFSIZE];
        do {
           in_stream->Read(buffer, BUFSIZE);
        } while (!in_stream->Eof());
        delete in_stream;
      }
      wxString src;
      src.Printf(wxT("%s"), buffer);
   }
    return 0;
}

Re: *How to get the source code of a page?

Posted: Wed Aug 12, 2015 2:06 am
by whoops
do there have alternative ways to deal with the status "301 Moved Permanently" (redirect) ?

Re: *How to get the source code of a page?

Posted: Wed Aug 12, 2015 8:17 am
by doublemax

Re: *How to get the source code of a page?

Posted: Wed Aug 12, 2015 9:22 am
by whoops
thanks so much :)