*How to get the source code of a page? Topic is solved

If you are using the main C++ distribution of wxWidgets, Feel free to ask any question related to wxWidgets development here. This means questions regarding to C++ and wxWidgets, not compile problems.
Post Reply
User avatar
whoops
Earned a small fee
Earned a small fee
Posts: 23
Joined: Sat Jun 27, 2015 5:53 am
Location: China

*How to get the source code of a page?

Post by whoops » Tue Aug 11, 2015 11:04 am


for example, i want to get the source code of http://www.w3.org/1999/xhtml/
here is the source code (in Opera, right-click the "View page source" or CTRL + U):
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="http://www.w3.org/StyleSheets/TR/base"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
"http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head profile="http://www.w3.org/2003/g/data-view">
<meta http-equiv="content-type"
content="application/xhtml+xml; charset=UTF-8" />
<title>XHTML namespace</title>
<link rel="stylesheet" type="text/css"
href="http://www.w3.org/StyleSheets/TR/base" />
<link rel="transformation" href="http://www.w3.org/2008/07/rdfa-xslt" />
<link xmlns:data-view="http://www.w3.org/2003/g/data-view#"
about="http://www.w3.org/1999/xhtml" rel="data-view:namespaceTransformation"
href="http://www.w3.org/2008/07/rdfa-xslt" />
</head>

<body>

<div class="head">
<p><a href="http://www.w3.org/"><img class="head"
src="http://www.w3.org/Icons/WWW/w3c_home" alt="W3C" /></a></p>
</div>

<h1><abbr title="Extensible HyperText Markup Language">XHTML</abbr>
namespace</h1>

<p>The namespace name <tt>http://www.w3.org/1999/xhtml</tt> is intended for use
in various specifications such as:</p>

<p>Recommendations:</p>

... (omited)
so, how to get it, what classes needed for get source code :?:
for test, you can make it into a wxTextCtrl to display the result.

User avatar
doublemax
Moderator
Moderator
Posts: 15262
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: *How to get the source code of a page?

Post by doublemax » Tue Aug 11, 2015 12:27 pm

Here's some old code that reads a web page and saves it to a local file. That should contain everything you need.
viewtopic.php?p=42101#p42101

Beware that the wxWIdgets http classes don't support HTTP redirect, if a website uses that you'll get an empty file.
Use the source, Luke!

User avatar
whoops
Earned a small fee
Earned a small fee
Posts: 23
Joined: Sat Jun 27, 2015 5:53 am
Location: China

Re: *How to get the source code of a page?

Post by whoops » Tue Aug 11, 2015 1:56 pm


Here is the test code, but it threw a runtime exception:

Code: Select all

#include <wx/wx.h>
#include <wx/url.h>
#define BUFSIZE 16384
int main(void)
{
	wxURL url(wxT("http://www.w3.org/1999/xhtml"));
	if(url.IsOk()) {
		// in_stream threw a runtime exception :-(
		wxInputStream *in_stream = url.GetInputStream();
		unsigned char buffer[BUFSIZE];
		do {
			in_stream->Read(buffer, BUFSIZE);
		} while (!in_stream->Eof());
		delete in_stream;
		wxString src;
		src.Printf(wxT("%s"), buffer);
	}
    return 0;
}
how to deal with it [the comment in line 8] ?
I'm using wxWidgets 2.8.12
[/size]
 Things being equal, the simplest explanation tends to be the right.

 [ Windows 7 Ultimate x64 | wxWidgets 3.0.2 | Microsoft Visual C++ 2010 Express ]

User avatar
doublemax
Moderator
Moderator
Posts: 15262
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: *How to get the source code of a page?

Post by doublemax » Tue Aug 11, 2015 2:33 pm

If you use wxWidgets from main() you need to initialize it manually. And you need to check the returned pointer from GetInputStream() for NULL.

Code: Select all

#include <wx/wx.h>
#include <wx/url.h>
#define BUFSIZE 16384
int main(void)
{
   wxInitializer wx;
   
   if( !wx.IsOk() )
    return -1;
   
   wxURL url(wxT("http://www.w3.org/1999/xhtml"));
   if(url.IsOk())
   {
      wxInputStream *in_stream = url.GetInputStream();
      
      // in_stream can be NULL if something went wrong
      if( in_stream!=NULL )
      {
        // this code is wrong when the page is bigger than BUFSIZE
        unsigned char buffer[BUFSIZE];
        do {
           in_stream->Read(buffer, BUFSIZE);
        } while (!in_stream->Eof());
        delete in_stream;
      }
      wxString src;
      src.Printf(wxT("%s"), buffer);
   }
    return 0;
}
Use the source, Luke!

User avatar
whoops
Earned a small fee
Earned a small fee
Posts: 23
Joined: Sat Jun 27, 2015 5:53 am
Location: China

Re: *How to get the source code of a page?

Post by whoops » Wed Aug 12, 2015 2:06 am

do there have alternative ways to deal with the status "301 Moved Permanently" (redirect) ?
 Things being equal, the simplest explanation tends to be the right.

 [ Windows 7 Ultimate x64 | wxWidgets 3.0.2 | Microsoft Visual C++ 2010 Express ]

User avatar
doublemax
Moderator
Moderator
Posts: 15262
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: *How to get the source code of a page?

Post by doublemax » Wed Aug 12, 2015 8:17 am

Use the source, Luke!

User avatar
whoops
Earned a small fee
Earned a small fee
Posts: 23
Joined: Sat Jun 27, 2015 5:53 am
Location: China

Re: *How to get the source code of a page?

Post by whoops » Wed Aug 12, 2015 9:22 am

thanks so much :)
 Things being equal, the simplest explanation tends to be the right.

 [ Windows 7 Ultimate x64 | wxWidgets 3.0.2 | Microsoft Visual C++ 2010 Express ]

Post Reply