
in this part we are tying up all the loose ends.
in this article you will have a single page application that supports seo.
writing the sitemap.xml and adding index.html to your paths
before we reach the final part of hooking it all together there are 2 seo things we should do.
the first one is to add index.html
to your path.
it will make your life easier handling redirects and default index page etc..
it is not a requirement, but i recommend it and i assume you applied this in the rest of the post.
plus - developers are not usually aware of this, but not specifying index.html will cause problems when deadling with iframes.
i am not going to dwell on this here, but only mention that i had 2 iframes in my application that did not work until i added index.html
to the src attribute.
the other thing is adding a sitemap.
adding a sitemap tells the crawlers which pages they should crawl.
it improves the search results.
i strongly recommend using a sitemap and not relying on the crawling behavior of following links.
index.html
adding index.html is done on your front-server and is quite easy.
all you need to do is add a redirect rule from /
to /index.html
in nginx it looks like so :
rewrite ^/$ \$scheme://\$host/index.html break;
sitemap
sitemaps are xml files that are returns when you approach /sitemap.xml
path.
you can also expose them on a different path, and then submit sitemaps to the search engines - this is usually what i do.
you can maintain a file called sitemap.xml
, but what is the fun in that?
if you have public content generated at runtime you should use an auto-generated sitemap.
simply use sitemap module from npm.
it has a pretty straight forward api.
even though you can omit the index.html
from the path as crawlers will now be redirected automatically,
i recommend you specify it anyway.
your output should look like so :
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://www.exmaple.org/index.html#!/public/item/53f0f7b250dab2f71901abf8/intro</loc> <lastmod>2014-09-03</lastmod> <changefreq>hourly</changefreq> <priority>0.5</priority></url>
</urlset>
note that sitemaps has a size limit.
it can contain upto 50K entries.
when i helped implementing seo to a site, i prefered to publish the last 10K records that were updated.
my reasons were:
- i don’t want the crawlers to crawl the entire site every time
- i don’t want to construct a huge sitemap. it will consume a lot of memory and might crush the system
i might be wrong about these 2 assumptions, but i prefered to take the safe road.
hooking it all up
to hook everything up, you need to tell your front server to redirect all requests with _escaped_fragment_
to your backend.
since we are dealing with a single page application these requests will actually be to an index.html
file - as there is no other route
in nginx you can add the following
location ~ /index.html {
if (\$args ~ "_escaped_fragment_") {
rewrite ^(.*)$ /backend/crawler;
}
}
just change /backend/crawler
to your path.
in your express code map this url to the code that uses phantom
.
app.get('/backend/crawler', function(req, res){
var url = req.param('_escaped_fragment_');
url = req.absoluteUrl('/index.html#!' + decodeURIComponent(url) );
logger.info('prerendering url : ' + url ) ;
var phantom = require('phantom');
phantom.create(function (ph) {
setTimeout( function(){
try{
ph.exit();
}catch(e){
logger.debug('unable to close phantom',e);
}
}, 30000);
return ph.createPage(function (page) {
page.open(url, function ( status ) {
if ( status === 'fail'){
res.send(500,'unable to open url');
ph.exit();
}else {
page.evaluate(function () {
return document.documentElement.innerHTML;
}, function (result) {
res.send( result);
ph.exit();
});
}
});
});
});
});
there are 2 things in this script that might seem weird
the first is the method absoluteUrl
. i assign this method to the request in a middleware.
this is the implementation
exports.origin = function origin( req, res, next){
var _origin = req.protocol + '://' +req.get('Host') ;
req.origin = _origin;
// expects a URL from root "/some/page" which will result in "protocol://host:port/some/page"
req.absoluteUrl = function( relativeUrl ){
return _origin + relativeUrl;
};
next();
};
the other thing to note is the 30 seconds timeout that i have, in which i invoke exit
on phantom.
this is a safety valve..
since this code spawns new phantomjs processes, i want to make sure these processes will die eventually.
i had the unfortunate opportunity to see it go haywire and bringing the machine to 100% cpu.
on the same note, i suggest you add killall phantomjs
to your start
commands so that every time you stop/start your application, it will make sure no orphan phantomjs processes are left.
so now there is only 1 thing left.
take a url from the sitemap, replace the #!
with _escaped_fragment_
and use wget
or curl
on it and see if you get the entire html
if this worked, you can also go to facebook and try to share the url.
The end
thank you for reading this serie. i hope it helped you. please feel free to comment and give feedback.