Wednesday, April 08, 2015

Ipython notebook and R

I chose to use python 3. Several of the shell commands below have a "3" suffix in Debian testing as of April 2015: ipython3, pip3.

Install programs

I installed ipython-3-notebook (in Debian Jessie) from the synaptic package manager.

In order to install the R module, I installed PIP for python 3 in the synaptic package manager. PIP is the Python Package Index, a module installation tool. Then I used pip3 to install rpy2
sudo pip3 install rpy2
There is a blog post on how to avoid using sudo to install pip modules.

Install statsmodel, a module for statistical modelling and econometrics in python. Maybe I should have installed python-statsmodels as a Debian package instead? But I it seems to be linked to python 2.x instead of python 3 (it had a dependency on python 2.7-dev). Therefore I installed statsmodels with pip3, using the --user flag mentioned above to install is as a user only module.
pip3 install --user statsmodels
The installation took several minutes on my system. It seemed to be installing a number of dependencies. Many warnings about variables defined but not used were returned but the installation kept running. The final message was:
Successfully installed statsmodels numpy scipy pandas patsy python-dateutil pytz
Cleaning up...

Starting the Ipython notebook

Move to a directory where the notebooks will be stored, start a ipython notebook kernel
cd python
ipython3 notebook

Shortcuts

See also the Ipython Notebook shortcuts. Useful shorcuts are ESCAPE to go in navigation mode, ENTER, to enter edit mode. It seems one can use vim navigation keys j and k to move up and down cells. Pressing the "d" key twice deletes a cell. CTRL+ENTER run cell in place, SHIFT+ENTER to run the cell and jump to the next one, and ALT+ENTER to run the cell and insert a new cell below. 

Run R commands in the Ipython notebook


Load an ipython extension that deals with R commands
%load_ext rpy2.ipython
 Display a standard R dataset
%R head(cars)
%R plot(cars)
Use data from the python statsmodels module based on this page.
import statsmodels.datasets as sd
data = sd.longley.load_pandas()
Print column names of the dataset
print(data.endog_name)
print(data.exog_name)
Print a dataset as an html table by simply giving its name in the cell. For example this data frame contains exogenous variables:
data.exog
Python can pass variables to R with the following command:
totemp = data.endog
gnp = data.exog['GNP']
%R -i totemp,gnp
Estimate a linear model with R
%%R
fit <- br="" gnp="" least-squares="" lm="" nbsp="" regression="" totemp="">print(fit$coefficients)  # Display the coefficients of the fit.
plot(gnp, totemp)  # Plot the data points.
abline(fit)  # And plot the linear regression.
Plot the datapoints and linear regression with the ggplot2 package
%%R
library(ggplot2)
ggplot(data = NULL, aes(x =gnp, y = totemp)) +
    geom_point() +
    geom_abline( aes(intercept=coef(fit)[1], slope=coef(fit)[2]))

No comments: