regression toward the datascience

Probability of norepeatword


Building python visualization of probability of norepeatword.

The problem as stated in stat110 strategic practice work.

A "norepeatword" is a sequence of at least one (and possibly all) of the usual
26 letters a,b,c,. . . ,z, with repetitions not allowed. For example, “course” 
is a norepeatword, but “statistics” is ...

Newton-Pepys problem


Building python visualization of Newton-Pepys problem.

The problem is to find the probability of throwing sixes from a certain number of fair dice.

Problem as stated by Pepys to Newton is as follows:

Which of the following three propositions has the greatest chance of success?
    A. Six fair dice are ...

Birthday paradox


Building python visualization of the famous Birthday paradox or Birthday problem.

Birthday problem, is to find the probability of a pair from the given set of randomly chosen people to have same birthday.

Given, we have 365 days (ignoring February 29th). The following chart shows how the probability increases with ...

Adding empty directory within GIT


While maintaining GIT repositories, in some cases, we might be required to add an empty directory as a placeholder to mean something or we wanted all the files in a directory to stay local. However in GIT we can add only files to the repository and not the directory.

I ...

Tornado Error during WebSocket handshake


I was getting following exeption in WebSocket client, when trying to connect Tornado WebSocket server.

WebSocket connection to 'ws://localhost:5678/echo' failed: Error during WebSocket handshake: Unexpected response code: 403

and on the server-side log

WARNING:tornado.access:403 GET /echo (::1) 6.00ms

A simple Echo WebSocket server ...

Binding port 80 to tomcat in Ubuntu


Edit the file /etc/default/tomcat7

change the line




and run the following commands

$ sudo touch /etc/authbind/byport/80
$ sudo chmod 500 /etc/authbind/byport/80
$ sudo chown tomcat7 /etc/authbind/byport/80

Hope this helps.

Building Hadoop source code


The Apache Hadoop is a framework that allows for distributed processing of large data sets across clusters of computers using MapReduce.

The steps listed below is to build and package hadoop from source code. This guide assumes a fresh installation of Ubuntu 14.04 version.

  1. Let's start with installing ...

Pig script to process CSV file with quotes and multiline


While writing Pig script, usually we use PigStorage for loading a CSV file.

Consider a sample CSV file in the following format.

2,Loading successfull,2014-09-25
3,Loading successfull,2014-09-25
4,Loading successfull,2014-09-25

can be loaded as

logs = LOAD 'log_folder/log_file.csv' USING ...

Split array based on difference with NumPy


I had a NumPy array of numbers, which I had to split based on the change of value.

For example, consider an array as shown below.

values = [112.0, 111.0, 113.0, 111.0, 112.0, 112.0, 112.0, 113.0, 113.0,
       113.0, 114.0, 114 ...