Wednesday, October 29, 2014

json parse is insane

json parse is insane

every project i've been in so far had the same stupid bug with JSON.parse.
at the beginning someone got a string instead of an object and they used JSON.parse to convert it to an object.
then after a while, someone fixed it upstream, and now JSON.parse is getting an object and fails with a very cryptic error message.
i always make sure my code is safe and i write it as such:
item = result.data;
if ( typeof(item) === 'string' ) { 
   item = JSON.parse(item);
}
    
however, each time i write this piece of code, it seems to me like it is a bug in javascript.
seems to me like 1 of 2 things should happen:
  • javaScript should throw an error you cannot parse an object
  • javaScript should simply return the object as there's nothing to parse.
i think the latter is better and more aligned with the rest of JavaScript behavior.
but what happens now is simply insane - JavaScript implicitly converts the object toString and tries to parse that.
which is insane because nowhere in the world of JavaScript is toString meant to return a JSON.
this is why you have JSON.stringify to begin with.
so if anything at all JSON.parse should use JSON.stringify instead of toString
- but what would be the point of that? simply return the object you got.
another reason why this is insane is that toString returns [object Object]
which ironically enough starts like an array (which is a valid input for JSON.parse) and so the error developers get is invalid token o.
and last reason for insanity is that this has been the situation for quite a while now.
i tried to see what other libraries are doing with this insanity :
turns out that JQuery doesn't try to fix this issue. $.JSONparse is just as insane.
lodash does not offer anything in this matter.
i know that angular behaves nicely - like i expected - but for projects that don't use angular, it would be an overkill to use angular just for this.
other than that, I could not find any references to this problem anywhere.
this problem seems to me a lot like the console.log problem - that it does not exist in some browsers - and should have a similar fix.
my current recommendation to fix this issue by replacing JSON.parse method with something like
JSON.parseString = JSON.parse
JSON.parse = function ( o ) {
    if ( typeof(o)  === 'string' ){ 
       return JSON.parseString.apply(arguments);
    } else { 
        return o; 
    }  
}
but i keep getting strong objections on such a solution as it is intrusive. what do you think? leave comments below.

Wednesday, October 22, 2014

seo with phantomjs part 3

this article is part of a series of articles explaining about seo and single page applications.
the stack i use contains node, angular, nginx etc.
in this part we are tying up all the loose ends.
in this article you will have a single page application that supports seo.

writing the sitemap.xml and adding index.html to your paths

before we reach the final part of hooking it all together there are 2 seo things we should do.
the first one is to add index.html to your path.
it will make your life easier handling redirects and default index page etc..
it is not a requirement, but i recommend it and i assume you applied this in the rest of the post.
plus - developers are not usually aware of this, but not specifying index.html will cause problems when deadling with iframes.
i am not going to dwell on this here, but only mention that i had 2 iframes in my application that did not work until i added index.html to the src attribute.

the other thing is adding a sitemap.
adding a sitemap tells the crawlers which pages they should crawl.
it improves the search results.
i strongly recommend using a sitemap and not relying on the crawling behavior of following links.

index.html

adding index.html is done on your front-server and is quite easy.
all you need to do is add a redirect rule from / to /index.html
in nginx it looks like so :

rewrite ^/$ \$scheme://\$host/index.html break;

sitemap

sitemaps are xml files that are returns when you approach /sitemap.xml path.
you can also expose them on a different path, and then submit sitemaps to the search engines - this is usually what i do.

you can maintain a file called sitemap.xml, but what is the fun in that?
if you have public content generated at runtime you should use an auto-generated sitemap.
simply use sitemap module from npm.
it has a pretty straight forward api.
even though you can omit the index.html from the path as crawlers will now be redirected automatically,
i recommend you specify it anyway.
your output should look like so :

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url> <loc>http://www.exmaple.org/index.html#!/public/item/53f0f7b250dab2f71901abf8/intro</loc> <lastmod>2014-09-03</lastmod> <changefreq>hourly</changefreq> <priority>0.5</priority> </url>
</urlset>  
  

note that sitemaps has a size limit.
it can contain upto 50K entries.
when i helped implementing seo to a site, i prefered to publish the last 10K records that were updated.
my reasons were:

  • i don't want the crawlers to crawl the entire site every time
  • i don't want to construct a huge sitemap. it will consume a lot of memory and might crush the system
i might be wrong about these 2 assumptions, but i prefered to take the safe road.

hooking it all up

to hook everything up, you need to tell your front server to redirect all requests with _escaped_fragment_ to your backend.
since we are dealing with a single page application these requests will actually be to an index.html file - as there is no other route
in nginx you can add the following

location ~ /index.html {
    if (\$args ~ "_escaped_fragment_") {
         rewrite ^(.*)$ /backend/crawler;
    }
  }   
  
just change /backend/crawler to your path.

in your express code map this url to the code that uses phantom.

app.get('/backend/crawler', function(req, res){
    var url = req.param('_escaped_fragment_');
    url = req.absoluteUrl('/index.html#!' + decodeURIComponent(url) );
    logger.info('prerendering url : ' + url ) ;


    var phantom = require('phantom');
    phantom.create(function (ph) {

        setTimeout( function(){
            try{
                ph.exit();
            }catch(e){
                logger.debug('unable to close phantom',e);
            }
        }, 30000);


        return ph.createPage(function (page) {
            page.open(url, function ( status ) {
                if ( status === 'fail'){
                    res.send(500,'unable to open url');
                    ph.exit();
                }else {
                    page.evaluate(function () {
                        return document.documentElement.innerHTML;
                    }, function (result) {
                        res.send( result);
                        ph.exit();
                    });
                }

            });
        });
    });
});  
  

there are 2 things in this script that might seem weird
the first is the method absoluteUrl. i assign this method to the request in a middleware.
this is the implementation

exports.origin = function origin( req, res, next){
    var _origin = req.protocol + '://' +req.get('Host')  ;
    req.origin = _origin;

    // expects a URL from root "/some/page" which will result in "protocol://host:port/some/page"
    req.absoluteUrl = function( relativeUrl ){
        return _origin + relativeUrl;
    };
    next();
};  
  

the other thing to note is the 30 seconds timeout that i have, in which i invoke exit on phantom.
this is a safety valve..
since this code spawns new phantomjs processes, i want to make sure these processes will die eventually.
i had the unfortunate opportunity to see it go haywire and bringing the machine to 100% cpu.

on the same note, i suggest you add killall phantomjs to your start commands so that every time you stop/start your application, it will make sure no orphan phantomjs processes are left.

so now there is only 1 thing left.
take a url from the sitemap, replace the #! with _escaped_fragment_ and use wget or curl on it and see if you get the entire html

if this worked, you can also go to facebook and try to share the url.

The end

thank you for reading this serie. i hope it helped you.
please feel free to comment and give feedback.

Wednesday, October 15, 2014

seo with phantomjs part 2

this article is part of a series of articles explaining about seo and single page applications.
the stack i use contains node, angular, nginx etc.
in this part we are focusing on crawlers and single page applications

identifying a crawler and providing the prerender version

so now that we know how to generate a prerendered version of a page using phantomjs
all we need to is to identify a crawler and redirect them to the prerendered version.
turns out this is the tricky part..

url fragments

turns out a lot of people don't know this part so i decided to take a minute and explain..
urls have a general structure of http://domain:port/path/?query#fragment
the part we are interested in this post is the fragment.
if you are dealing with angularjs, you know that part very well.
a lot of developers do not know that fragments are client side only and will not reach the backend.
so you cannot write code in the backend to check if there is a fragment..

another important thing with fragment you should know is that when you change it in javascript, it does not cause a refresh to the entire page.
if you change any other part in the url, you will see the entire page refreshes.
but fragments will not do that..
and so - single page applications, like the ones that use angularjs, rely heavily on fragments.
this method allows them to keep a state on the url without reloading the page.
saving the state is important - it allows you to copy paste the url and send it to someone else - and not refreshing the page gives you a nice user experience

it is important to also note that recently, since html5, browsers now support changing the url without refreshing the entire page..
and so no need for fragments anymore..
in angularjs application you can simply define: $locationProvider.html5Mode(true)

personally, i am still not confident enough to use the html5mode, so i keep using fragments. more on this soon
however - you should consider using html5 mode as some crawlers support only that method.

and so the single page applications live happily ever after.. until seo comes to the picture..

how do crawlers handle single page applications ?

by the name you can understand that crawling a single page application is very easy - as there is only a single page.. but is misleading.
in fact there are a lot of pages in single page applications, but all the pages are loaded lazily in the background while one does not - the single page.
this actually causes a lot of issues to crawlers as they do not have that "lazy background" feature as it requires running javascript and invoking ajax calls.

so when a crawler comes to a single page application it should somehow create a request to a prerendered version of the page.

google's solution to the problem

along came google and declared a new standard.

if the url contains '#!' (hash-bang), this hash bang will be replaced with _escaped_fragment_
and so if your url looks like so http://domain:port/path/?query#!fragment (note the ! that was not there before) it will be crawled as
http://domain:port/path/?query∧_escaped_fragment_=escaped_fragment where escaped fragment is essentially the fragment without special characters that has other meaning when they are not after a hash tag.

html 5 mode

another, more modern option today is to use html 5 mode.
this essentially tells angularjs to stop using this format http://domain:port/path/?query#fragment and to start using http://domain:port/fragment.
browsers can now support changing the url without refreshing, the backend recieves the entire path, and everyone are happy..
i chose not to use this method as it is relatively new and there is still some trust i need to have in this before i use it.

but not all crawlers following google's standard.
if you'll try to share your page on linkedin you will have problems unless you use html5.
you can still expose specific urls for sharing, but it would be nicer to have it right out of the box.
i encourage you to try using html5 mode.

adding the hash-bang

now comes the sad part of adding a '!' too all your urls..
define the following to angular : $locationProvider.html5Mode(false).hashPrefix('!'); and go over all the links you wrote and change them.

for backward compatibility, you should also add a script to your header to redirect from # to #!:

<script>
  try {
      if ( window.location.hash.indexOf('#/') === 0 ){
          window.location.hash = '#!/' + window.location.hash.substring(2);
      }
  }catch(e){
      try{
          console.log('unable to redirect to new url',e);
      }catch(e){}
  }

</script>  
  

Next time

the next article will help you set up a sitemap and serve prerendered version of your pages
it bascially applies everything we learned until now.
that will be the last article in this serie.

Wednesday, October 8, 2014

seo wtih phantomjs part 1

this article is part of a series of articles explaining about seo and single page applications.
the stack i use contains node, angular, nginx etc.
in this part we are focusing on phantomjs and how to use it to prerender a page with javascript.

phantomjs to the rescue

phantomjs is a browser that runs in the memory (no graphics required).
you install it by running npm -g install phantomjs and then verify it is available by running phantomjs --version.
since it is a browser, it can do whatever a browser can such as render css, execute javascript and so on

first thing i did was write a small snippet of code testing out phantomjs.
here is a great snippet you find when searching phantomjs get html from phantomjs official site

var webPage = require('webpage');
var page = webPage.create();

page.open('http://phantomjs.org', function (status) {
  var content = page.content;
  console.log('Content: ' + content);
  phantom.exit();
});
  

so i wrote a file call phantom_example.js and obviously did the same mistake like everyone else
and i ran node phantom_example.js.
and i got the following error

module.js:340
    throw err;
          ^
Error: Cannot find module 'phantom'
    at Function.Module._resolveFilename (module.js:338:15)
    at Function.Module._load (module.js:280:25)
    at Module.require (module.js:364:17)
    at require (module.js:380:17)
    at Object.<anonymous> (/full/path/scripts/phantom_example.js:2:19)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Function.Module.runMain (module.js:497:10)
  
after digging around i found the obvious solution.
phantomjs is a command line, and not a library i include.
so running the command phantomjs phantom_example.js resolved it for me and i got the html.

running this from within my server

so this script required me to run phantomjs while i wanted to get the same result without leaving the server.
this was even simpler.
turns out there are a lot of libraries that do just that
my personal favorite is phantom.
these libraries essentially run the phantomjs commandline for you.
so when you invoke them, you will see a phantomjs process running in the background
and so some of them might require you to pre-install phantomjs.
phantom is a library that requires it to be installed.
here is a script with phantom that does the same thing. running node phantom_example.js will produce the right result.
var phantom = require('phantom');
var url = 'http://phantomjs.org';
phantom.create(function (ph) {
    return ph.createPage(function (page) {
        page.open(url, function ( status ) {
            if ( status === 'fail'){
                console.log('unable to open url', status);
                ph.exit();
            }else {
                page.evaluate(function () {
                    return document.documentElement.innerHTML;
                }, function (result) {
                    console.log( result);
                    ph.exit();
                });
            }

        });
    });
});    
    

Next part

the next article will talk about crawlers and single page application where we will understand the problem and 2 solutions introduced by google and html5.

Wednesday, October 1, 2014

seo with phantomjs

seo with phantomjs

angularjs, seo, nginx, phantomjs, facebook share, node and sitemaps - in 60 minutes or less
when i was asked about how to make an angular site seo friendly, i was shocked to discover that
even though googlebot is supposed to support javascript, angular apps still have placeholders
where values should be, making your search result display as {{title}}.
really? 2014 is almost over, and we have to deal with prerendering still? omg..
as i was getting dizzy with the thought of having some jade template engine in my beautiful mean stack code,
i decided to risk everything and write a solution with phantomjs.
you will not believe how simple it is
i was then shocked again to discover that there are services doing just that, and they charge a lot of money!
i was unimpressed by services like prerender.io
and what they offer. escpecially when i knew i was going to have a lot of pages soon.
and besides, why pay when it is so darn easy?

sharing in facebook doesn't work too, so who cares about google crawler?

even if google has javascript support, i want to be able to share my pages on facebook and other social networks..
so i need a better solution.

in the next couple of articles i will talk in depth about how to add seo support for single page applications.

Monday, November 4, 2013

Gruntfile.js - adding another HTML file to usemin

Adding an HTML to the Gruntfile usemin

Recently I started using node and with it yo, grunt and bower.
It is nice to get a quick kickstart
But now when I have to add/modify something in the build process,
I get stumped a lot.

You usually have a single base html file called index.html
and then you have Angular with ng-view to change content
thus generating a Single Page Application.

However, I always find it necessary to have error pages which are self contained,
which means index.html is not involved.

While yo's generators take care of index.html, your error page does not load correctly.
The reason for this is the usemin task in grunt which turn your
href attributes to point to the minified version of the file.
For example, if index.html has the following in the header

<!-- build:css({.tmp,app}) styles/main.css -->
<link rel="stylesheet" href="styles/main.css">
<link rel="stylesheet" href="styles/page1.css">
<!-- endbuild -->  
  
grunt usemin will turn it into this
<link rel="stylesheet" href="styles/1b62fe48.main.css">
  
note that page1 is not included in the output, and that is because the new main.css
contains them both.

So the question is what should I do if I have index2.html.
How would I get it to work here too.

Solution

The trick is to look at useminPrepare which by default looks like so

useminPrepare: {
    html: '<%= yeoman.app %>/index.html',
    options: {
        dest: '<%= yeoman.dist %>'
    }
},  
  

If you modify the HTML field to include some other file, that file will be picked to the build process too.
You do that by simply turning that field into an array like so:

useminPrepare: {
    html: ['<%= yeoman.app %>/index.html','<%= yeoman.app %>/index2.html'],
    options: {
        dest: '<%= yeoman.dist %>'
    }
},  
  

Assuming your index2.html has something similar to index.html

<!-- build:css({.tmp,app}) styles/main2.css -->
<link rel="stylesheet" href="styles/main2.css">
<link rel="stylesheet" href="styles/page2.css">
<!-- endbuild -->  
  
it will get picked up and processed accordingly.

Monday, October 28, 2013

Configuration Module for NodeJS

configuration Module for NodeJS

NodeJS is great but it lacks settings/configuration mechanisms.
Actually - I understand why it lacks it - configuration nowadays is written in JSON anyway.. so in node you just import it..
But there are some features you still want/need that do not exist yet.

Missing Features

  • Overriding with another JS file
    This means I want the same syntax to override the configuration. Some libraries suggest command line flags.. I do not like it.
  • Front-end support
    I want to change configuration in a single place, and I want to be able to configure by front-end as well.
  • Fail to load if a required configuration is missing

Implementation

var publicConfiguration = {
    "title": "Hello World"
    "errorEmail": "null" // don't send emails if null

};

var privateConfiguration = {
    "port":9040,
    "adminUsername": undefined, 
    "adminPassword": undefined
}        

var meConf = null;
try{
    meConf = require("../conf/dev/meConf");
}catch( e ) { console.log("meConf does not exist. ignoring.. ")}




var publicConfigurationInitialized = false;
var privateConfigurationInitialized = false;

function getPublicConfiguration(){
    if (!publicConfigurationInitialized) {
        publicConfigurationInitialized = true;
        if (meConf != null) {
            for (var i in publicConfiguration) {
                if (meConf.hasOwnProperty(i)) {
                    publicConfiguration[i] = meConf[i];
                }
            }
        }
    }
    return publicConfiguration;
}


function getPrivateConfiguration(){
    if ( !privateConfigurationInitialized ) {
        privateConfigurationInitialized = true;

        var pubConf = getPublicConfiguration();

        if ( pubConf != null ){
            for ( var j in pubConf ){
                privateConfiguration[j] = pubConf[j];
            }
        }
        if ( meConf != null ){
              for ( var i in meConf ){
                  privateConfiguration[i] = meConf[i];
              }
        }
    }
    return privateConfiguration;

}


exports.sendPublicConfiguration = function( req, res ){
    var name = req.param("name") || "conf";

    res.send( "window." + name + " = " + JSON.stringify(getPublicConfiguration()) + ";");
};


var prConf = getPrivateConfiguration();
if ( prConf != null ){
    for ( var i in prConf ){
        if ( prConf[i] === undefined ){

            throw new Error("undefined configuration [" + i + "]");
        }
        exports[i] = prConf[i];
    }
}


return exports;

Explanations

sendPublicConfiguration
a route to send the public configuration that can be exposed in the front-end.
You should write something like
app.get("/backend/conf", require("conf.js").sendPublicConfiguration);
                
and then you can get the configuration by importing a script
<script src="/backend/conf?name=myConf"></script>                 
                
The name decides the global variable name that will include the configuration.
undefined
Means a configuration is required
null
Means the configuration is optional
myConf
The overrides file. The code above assumes it is under app folder and that meConf is under conf/dev folder. You can easily change that so you can pass the overrides path when you require the module.
Overrides logic
Private configuration simply takes all values from public configuration and then uses meConf for overrides.
Public configuration filters meConf and overrides only keys it already has.
This way we make sure nothing private is exposed.

Environment Support

Some people find it nice if they can run node with a flag stating if they are running in production or development.
I am not one of those people.
However, my code answers their needs as well.
You can have multiple meConf.js file

meConf.production.js
meConf.development.js
        
and so on, and import the right one using the environment flag.

You can also have a single meConf file which has fields for production, development and such and invoke the same logic.

            meConf = require("../conf/dev/meConf")[environmentFlag];
        
Use it as you see fit.