RSS

Using A Docker Image For Cloudera Certification Preparation.

To study for cloudera spark & hadoop developer certification, I tried to use the cloudera quickstart vm with no success on my ubuntu box (15 GB RAM, stashed away in the basement,  so I usually ssh/rdp into it even when at home) .

I had some incompatibility with the virtualbox version for ubuntu 16.04 LTS. I did not want to run virtualbox on my mac – 8 GB RAM of which almost  6 GB is in use with my every day tinkering with software.

I opted to not figure out what was wrong with my virtualbox installation and went for the easy way out – docker image! (later I got kvm to work ,  but thats a story for another day)

Instructions are laid out here. Then I had to make a few changes/ additions to make it work for me.

Rather than what cloudera have in their documentation:


docker run --hostname=quickstart.cloudera --privileged=true -t -i [OPTIONS] [IMAGE] /usr/bin/docker-quickstart

I use the run command below, making sure Im in the directory with the source code/commands Ill be using in the container (will be mounted to the container’s sourcecode directory):


docker run --hostname=quickstart.cloudera --priviliged=true -tdi --rm \

--name=cloudera -p 8888 --mount type=bind,source="$(pwd)",target=/home/cloudera/certification \

cloudera/quickstart:latest /usr/bin/docker-quickstart

-d so that the container is not attached to my current terminal (Ill be able to ‘exec’ … ‘bin/bash’ into the container later)

–rm so the container is removed on stopping or killing it

–name so I dont have to hunt for the container hash id, for example , to stop it

So now I can get into the container:

docker exec -itu cloudera cloudera /bin/bash

and inside the container the location “/home/cloudera/certification” is mounted to my computer’s directory with the source code for certification prep (I suggest using this website as a tutorial ). This way I can take down the container and maintain the source code at all times

In addition to this, I find it useful sometimes to interact with some http endpoints in the image (like hue, cloudera manager) when Im not at home .  Ive resorted to using an ssh tunnel to do this.


ssh -L 7180:172.17.0.2:7180 -N -f -l [my-home-public-ip-address] 

172.17.0.2 is the docker container local ip.Port 7180 is for cloudera manager;  I do the same for port 8888 for hue. So now both are available at my localhost:7180 and 8888

 

Walkthrough Spark AggregateByKey (Using Pyspark)

I haven’t found a good beginner level explanation of apache spark’s aggregateByKey method so I’m writing one up.

This was part of a coursera big data class at https://www.coursera.org/learn/big-data-essentials

The simple case I was using was: given a social graph file with userId and   follower id, finding the user with the most followers.

So this is a simple grouping by userId and aggregate by count of followers they have.

Implementing this using the aggregateByKey method is as follows:


#from pyspark shell, read in the file
import operator

#read in a local file
raw_data = sc.textFile('/data/twitter/twitter_sample_small.txt')

#define a method to read the data, split by tab
def parse_edge(s):
user, follower = s.split('\t')
  return (int(user),int(follower))

# cache the intermediate rdd after parsing it
edges = raw_data.map(parse_edge).cache()

#apply aggregateByKey - see explanation below the code
fol_agg = edges.aggregateByKey(0,lambda v1,v2: v1+1\
,operator.add)

# top user/key with most followers.
# use operator to make sure the values(aggregated counts) and not the keys/userIds
#  are used for the comparison
top_user = fol_agg.top(1,operator.itemgetter(1)

Explanation:
0 is the starting value
The 1st closure (lambda v1,v2: v1+1) works  over all records with the same key but within a partition.
Say the 1st 3 records key, value pairs are:
1\t5
2\t6
1\t7
For userId(key) 1, the 1st iteration takes the values 0(start value) and 5, and returns result of adding 0 and 1.
For the 2nd iteration for key 1, it takes values 1 (from previous calculation) and 7 from next record, and returns result of adding 1 to previous sum 1,
So key 1 now has a aggregated value of 2 and so on…
Note that the actual values 5 and 7 are not used directly but are just proxies to tell how many records there are for a key , each counting only once

The 2nd function passed (operator.add) adds up the values coming from the 1st lambda but it does this across different partitions.

SourceCode (slightly modified to run as python script not from pyspark shell):

 
Leave a comment

Posted by on December 15, 2017 in big data, pyspark, python, software, spark

 

Spring Shell with Spring Boot

Spring Shell is a spring project that exposes a shell for interaction with a spring application.
I found this useful in cases where I dont need a full blown web app bur still want to use the features such as the spring packages, spring data, property management and dependecy injection to develop an application.
The spring shell set up described in documentation is clear enough but it does not cover using spring shell with spring boot. Spring boot really speeds up bootstrapping a spring application.

With a few references and digging around the method in spring shell that is called from the spring application’s main method, I found the following solution.

Spring shell assumes there will be an xml config that it reads sort of like this:


new ClassPathXmlApplicationContext("classpath*:/META-INF/spring/spring-shell-plugin.xml";);

To run the spring shell application, the spring application should call the main method in spring sell’s Bootstrap class.

 public static void main(String[] args) throws IOException {
Bootstrap.main(args);

}

Then the commands that should be available in the shell are defined as spring components like this:

@Component
public class HelloWorldCommands implements CommandMarker {

// use any Spring annotations for Dependency Injection or other Spring
// interfaces as required.

// methods with @Cli annotations go here

}

So to change this set up so that spring shell reads a context set up by spring boot, the code below does the following:

  1. sets up a spring boot app
  2. retreives and runs the shell from the context retreived from spring boot
  3. It contains the rest of the code copied from the spring shell Bootstrap class’ main method, since under the spring boot setup, we wont be calling Boostrap.main
@SpringBootApplication
@ComponentScan({"org.springframework.shell.commands",/* your packages here*/;"})
public class MyshellApplication {

private static ApplicationContext ctx;
private static CommandLine commandLine;
private static StopWatch sw = new StopWatch("Spring Shell";

@Bean
public JLineShellComponent jLineShellComponent(){
return new JLineShellComponent();
}
@Bean
public CommandLine commandLine(){
return new CommandLine(null,3000,null);
}

public static void main(String[] args) {
sw.start();
ctx = SpringApplication.run(MyshellApplication.class);
commandLine = SimpleShellCommandLineOptions.parseCommandLine(args);
MyshellApplication application = new MyshellApplication();

application.runShell();

}

private ExitShellRequest runShell(){

String[] commandsToExecuteAndThenQuit = commandLine.getShellCommandsToExecute();
// The shell is used
JLineShellComponent shell = ctx.getBean( JLineShellComponent.class);
ExitShellRequest exitShellRequest;

if (null != commandsToExecuteAndThenQuit) {
boolean successful = false;
exitShellRequest = ExitShellRequest.FATAL_EXIT;

for (String cmd : commandsToExecuteAndThenQuit) {
successful = shell.executeCommand(cmd).isSuccess();
if (!successful)
break;
}

// if all commands were successful, set the normal exit status
if (successful) {
exitShellRequest = ExitShellRequest.NORMAL_EXIT;
}
}
else {
shell.start();
shell.promptLoop();
exitShellRequest = shell.getExitShellRequest();
if (exitShellRequest == null) {
// shouldn't really happen, but we'll fallback to this anyway
exitShellRequest = ExitShellRequest.NORMAL_EXIT;
}
shell.waitForComplete();
}

((ConfigurableApplicationContext) ctx).close();
sw.stop();
if (shell.isDevelopmentMode()) {
System.out.println("Total execution time: "+ sw.getLastTaskTimeMillis() + "ms");
}
return exitShellRequest;
}
}

So now when I run the spring boot application , I get my command shell

 java -jar <application>.jar 

Capture

Capture2

 
5 Comments

Posted by on October 30, 2015 in java, programming, software engineering

 

Tags: , , , ,

UIGrid with a Spring Boot Backend

Frequently, I find myself needing to lookup and update values from a sql server table we use at work to store client specific coded values and their descriptions.

Mostly, I do the look-ups trying to standardize the descriptions across different clients by assigning a business specific code and description to logically similar client values.

Tired of looking up and updating these values straight from the database, I decided to create a simple app to cut down on time spent on editing sql statments.

The Angular data grid UIGrid  is no longer in beta having been recently released. It looked like a good grid to test out. I also decided to use a spring boot mvc app for the back end.

The end product will look something like this, allowing filtering an multiple columns, and eventually updating values in place

 

Getting the Back end up and running:

I created a new spring starter project in eclipse.

starterproject

I selected the options web and jpa.

To get the spring jpa repository to work, I had to do the following:

      1. Enter the database credentials in the application.properties
spring.datasource.url=jdbc:sqlserver://dbserver;database=dbname;integrated security=false
spring.datasource.driver-class-name=com.microsoft.sqlserver.jdbc.SQLServerDriver
spring.datasource.password=passwd
spring.datasource.username=username
spring.jpa.hibernate.naming-strategy=org.hibernate.cfg.DefaultNamingStrategy
      1. Generate the JPA class from the database table – I used the jpa support in eclipse provided by eclipselink

After opening the JPA view in eclipse, I could create a connection to my db. Then by doing a create new -> jpa entiy class from table, I was able to get my class with all the annotations
jpaentitycreate

A few pointers here. If any columns in the table brought in is nullable and has null values, I had to make sure the datatype I used was the autoboxed type (example Integer and not the primitive int) since nulls cant apply to the primitive types.
Also , the defaultNamingStrategy property above was necessary to get Hibernate to stop applying its conventions (camelCase to under_scores) when figuring out my table name, otherwise it wouldn’t find my table and colulmn names.

      1. Create the JPA repository interface with the signature below – assuing my entity was called Code, mapped from the table Code that holds my coded values
@Repository
public interface CodeRepository extends CrudRepository <Code, Integer> {

List<Code>; findAll();

}
      1. Create my controller
@RestController
@RequestMapping("/getCodes")
public class CodeController {

@Autowired
private CodeRepository codeRepo;

@RequestMapping(method=RequestMethod.GET)
public List<Code> getCodes(){
List<Code> codes = codeRepo.findAll();
return codes ;

}

}

      1. Add the @EnableJpaRepositories annotation to the class that runs the main spring boot method

On running the main spring boot method, my endpoint was available on port 8080 served by the embedded tomcat, and a list of my codes were available to a browser at the ‘getCodes’ path.

Now I had to get UIGrid to request the list of codes and display them.
The html page ended up looking as follows:


<!doctype html>
<html ng-app="app">
<head>

<link rel="stylesheet"
href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" >
<link rel="stylesheet"
href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap-theme.min.css" >
< link rel="stylesheet" href=" http://ui-grid.info/release/ui-grid.css"
type="text/css">
<link rel="stylesheet" href=" /resources/css/app.css " type=" text/css" >
</head>
<body>

<div class="container"ng-controller="DecodeCtrl">
< div id=" gridCode" ui-grid=" CodeGridOptions" class="grid" ui-grid-resize-columns ui-grid-pagination ui-grid-auto-resize ui-grid-selection> </div >
</div>
<script src= " http://ajax.googleapis.com/ajax/libs/angularjs/1.4.3/angular.js " > < /script>
<script src= " http://ajax.googleapis.com/ajax/libs/angularjs/1.4.3/angular-touch.js " > < /script>
<script src= "http://ajax.googleapis.com/ajax/libs/angularjs/1.4.3/angular-animate.js"> </script>
<script src= "http://ui-grid.info/docs/grunt-scripts/csv.js"> < /script>
<script src= "http://ui-grid.info/docs/grunt-scripts/pdfmake.js"> < /script>
<script src= "http://ui-grid.info/docs/grunt-scripts/vfs_fonts.js"> < /script>
<script src= "http://ui-grid.info/release/ui-grid.js"> < /script>
<script src= "/resources/js/app.js " > < /script >
</body>
</html>

And the angularjs javascript to provide the module, controller and directives

var app = angular.module('app', ['ngTouch', 'ui.grid','ui.grid.resizeColumns','ui.grid.pagination',
'ui.grid.autoResize','ui.grid.selection']);

app.controller('CodeCtrl', ['$scope', '$http','$log',function ($scope,$http,$log) {

$http.get('/getCodes').success(function(data){
$scope.CodeGridOptions.data = data;
});

$scope.CodeGridOptions = {
enableFiltering: true,
paginationPageSizes: [25, 50, 75],
paginationPageSize: 25,
enableRowHeaderSelection:false,
multiSelect:false,
columnDefs: [{field:'ID',width:30},
{field:'Client',width:30},
{field:'Group',width:30},
{field:'Category',width:200},
{field:'Code',width:200},
{field:'Description',width:200},
{field:'MappedCode',width:200},
{field:'MappedDescription',width:200},
{field:'VendorID',width:30} ]

}
}]);

 

Healthcare I.T in the big data space: Article 2 – Getting setup, touching some code

This is the 2nd article in this series, you can start with the 1st article if you need some context.

For the infrastructure, I began small.

  1. I launched an Ubuntu ec2 instance
  2. I installed a vnc  server on it so i could access the desktop from a vnc client on my Mac  – I plan to use this ec2 instance as my development machine
  3. I installed mirth – I’ll need this to generate valid hl7 messages that I will feed into spring xd’s big data pipeline
  4. I installed eclipse luna- to develop the spring integration modules that will be uploaded into spring xd.

My 1st use case is to create a hl7 feed and ingest it using spring xd, storing the messages in hdfs.

For the 1st development effort, I plan to create a mirth channel that reads hl7 from a database, sends it over tcp to a spring xd custom source module.
The mirth configuration is shown below

Source tab of the channel

mirth_source_channel
Destination tab of the channel

mirth_destination
For the spring xd to ingest the hl7, I have to create a custom source module as none of the out of the box modules will handle hl7 mllp protocol. This module will use a  camel hl7 endpoint to accept messages from mirth and then place these messages on a spring integration channel.

The plan is to eventually upload this spring integration module into spring xd. 1st things 1st. In eclipse I created a spring boot project. The configuration class / application class looks as follows:

This being the 1st iteration , Ian writing to file just to make sure the data is coming in, ad sure enough after I run the code in eclipse and start my mirth channel, I have hl7 messages dropping into my folder . Neat!!

Spring boot/integration /camel source code iteration 1 of the inbound hl7 to a file destination:

package com.wchr.xd;

import java.io.IOException;

import org.apache.camel.CamelContext;
import org.apache.camel.builder.RouteBuilder;
import static org.apache.camel.component.hl7.HL7.ack;
import org.apache.camel.component.hl7.HL7MLLPNettyDecoderFactory;
import org.apache.camel.component.hl7.HL7MLLPNettyEncoderFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

@SpringBootApplication
@Configuration
public class Application extends RouteBuilder {

	@Autowired
	CamelContext camelContext; 

	//private static CamelContext cmContext;

    public static void main(String[] args) throws IOException, InterruptedException {
    	 SpringApplication.run(Application.class, args);
    	// cmContext = context.getBean(CamelContext.class);
    	}

    @Bean
    public HL7MLLPNettyDecoderFactory hl7decoder(){
    	return new HL7MLLPNettyDecoderFactory();
    }

    @Bean
    public HL7MLLPNettyEncoderFactory hl7encoder(){
    	return new HL7MLLPNettyEncoderFactory();
    }

	@Override
	public void configure() throws Exception {
		// TODO Auto-generated method stub
		from("netty4:tcp://localhost:9000?sync=true&encoder=#hl7encoder&decoder=#hl7decoder")
		.to("file://inbound?autoCreate=true&fileName=${date:now:yyyyMMddHHmmss}.hl7")
		.transform(ack());
	}

}

snapshot of the inbound messages from my eclipse project
inbound hl7

 
Leave a comment

Posted by on August 2, 2015 in java, software, spring

 

Tags: , , , , , , ,

Healthcare I.T in the big data space: Article 1 – The idea

So I’ve had this idea on my mind for a surprisingly long time, maybe even a couple of years. Finally I’m acting on it.

I have over the years seen a lot of EMR SYSTEMS, healthcare integration platforms tools applications both off the shelf vendor products and in house solutions. Most fall in the category I would call legacy (older technologies) or at best current proven technologies. Not much out there that is more bolder in using some of the newer specific big data solutions. That’s understandable since in some cases if the existing solution is not broke, why fix it.

Still this has left me with an itch that I’m going to try address here. I want to see a data integration project that aggregates healthcare data (hl7, custom flat files, x12,….), uses a big data platform to ingest, monitor, transform, store and run some analytics – the whole 9 yards.

If I’m successful in my next few posts I will set up a simple project to do so, using simple use cases at each of the integration steps I have mentioned above.

Down to some specifics:

I plan to use the following technologies: (if I could getaway with it I’d use spring xd solely)

  1. Source of some test data – mirth for hl7, spring xd for other generic data sources, others as my use cases advance
  2. Ingestion of data  –  camel for hl7, spring xd
  3. Monitoring – spring xd counters and gauges
  4. Transformation – spring xd processors, sources and sinks
  5. Persistence – hdfs using spring xd
  6. Analytics -not sure yet; one of the solutions supported by spring xd

My infrastructure consists of Amazon web services ec2 instances (Ubuntu) that I will spin up as required.

In the next post, I will have a short write up about how I got set up with aws ec2 and installed mirth and got started with eclipse

 
Leave a comment

Posted by on August 2, 2015 in programming, software

 

Tags: , , , ,

Informatica Java Transform handling of Dates

I love working with the java transform in informatica because it gives me all the freedom in the world, and it makes my mappings compact, rather than having tens of drag and dropped components. I can usually achieve a lot with only one java transform.

One thing confounded me recently as I could not from in side a java transform assign a date value to an output port I had defined as a date/time port.
So the code below could not work.
date_port_out = New Date();

I tried a bunch of alternatives , example using simpleDateFormat to create a format that is the default in informatica, but nothing was working.
Then I decided to look at the generated code in the java transform

informatica_date_in_java

I couldn’t believe it ; When I create a date/time port, the java transform creates a long out of it. How does it do this, I could guess , but I sure know how to convert a java Date into the millisecond values since epoch (Jan 1st 1970).

long millis_since_epoch = new Date().getTime();

Then I assigned the long value to the output port – and it worked !!

date_port_out = new Date().getTime; 

Also important not to forget to import the Date package in the imports tab;

import java.util.Date;
 
Leave a comment

Posted by on July 29, 2015 in programming, software

 

Tags: , , , ,

Informatica Java Transform For Data Validation

I have been working in informatica for a while now, and every time I need to do something thats out of the ordinary filter, sort , run a query, aggregate, I usually resort to the swiss army knife that is the java transform.

One of the  main abilities I get from the java transform is the creation new rows of output from a single row of input. Why would I want to do this? Well one common use case I have seen examples of out there is to unpivot your data transpose columns into rows.

So If you have data like

Hubert_Blaine_Wolfeschlegelsteinhausenbergerdorff Mary
10 20

And you need it to look more like

Hubert_Blaine_Wolfeschlegelsteinhausenbergerdorff 10
Mary 20

All nice and dandy. But I had another use case that I think is more interesting.

I have a set of data I need to validate for errors, and for each row, there are several errors that could occur. So one row could have 10 errors spread out in different columns.

I needed a process that goes through my set of data and creates another set of data holding the collection of validation errors.

So for example, I have the following row, and 2 errors I need to create from this: The names should be less than 20 characters , and the number (lets assume its the age) should be greater than 18

Hubert_Blaine_Wolfeschlegelsteinhausenbergerdorff 10

And I need the outputs

The name “Hubert_Blaine_Wolfeschlegelsteinhausenbergerdorff” is longer than 20 characters
The age 10 is less than 18

They way to achieve this was to use an active java transform (emphasize ‘active’) in which I have logic like below:

if(NAME.length &gt; 20) {
error_port_out = &quot;The name &quot;+ name_port_in+ &quot; is longer than 20 characters&quot;;
generateRow(); //does the magic of adding additional output rows from one input row
}

if(age &gt; 18) {
error_port_out = &quot;The age&quot;+ age_port_in+ &quot; is less than 18&quot;;
generateRow(); //again
}
 
Leave a comment

Posted by on July 29, 2015 in software

 

Tags: , , , ,

Entity framework 3.5 cryptic error: “Two entities with different keys are mapped to the same row…”

Just putting this out there might save someone some pain and time. I got the error after adding an association (one to many).

On comparing this association with another one that did not have an error, and on reading this<a href=”http://social.msdn.microsoft.com/Forums/en-US/c1e1d43a-a09f-4692-9372-3133f04d3eeb/error-3034-two-entities-with-different-keys-are-mapped-to-the-same-row”&gt; post</a>, it was clear that the entity designer did not add the all important line

<Condition ColumnName=”” IsNull=”false”/>;

see the entire associationsetmapping below. I deleted columnnames but in the snippet above the colulmn name is the actual db columnName (the ‘store’ model in EF – as opposed to the ‘conceptual’ model) of the primary key in the one side of the association
<AssociationSetMapping Name=”” TypeName=””; StoreEntitySet=”>
<EndProperty Name=””><ScalarProperty Name=”” ColumnName=”” /></EndProperty>
<EndProperty Name=””> <ScalarProperty Name=”” ColumnName=”” /></EndProperty>
<Condition ColumnName=”” IsNull=”false” />;
</AssociationSetMapping>

 

 
Leave a comment

Posted by on September 11, 2013 in software

 

Tags:

c# Dictionary with lambda functions

I recently needed a quick way to allow a user to select a value from a combo box (win forms)….let me back up a bit, explain why I’m using an outdated UI technology. Isn’t everyone using the new kids on the block, javascript, jquery, angular-js, bootstrap for css and so on? Well, I am limited by the environment I am developing the app – a slow moving enterprise one.

So back to the topic at hand. A user selects a value from a combo box, and the application selects the corresponding presenter (read “controller” for MVC fans) that will manage the view.

To make my app a bit sexier and to offset the disappointment of having to use using win forms, I decided to use lambda functions in a dictionary to do the trick.

I have a setupUI method that sets up the event handler for the combo box (among other things)

    class MainPresenter
    {
      private IPresenterFactory presenterFactory;
      private IPresenter presenter;
      //....
      void SetupUI()
        {
            cbEntity.TextChanged += cbEntity_TextChange;
        }
        void cbEntity_TextChange(object sender, EventArgs e)
        {
            try
            {
                this.presenter = presenterFactory.getPresenter(cbEntity.Text);
                //.....
            }
            catch
            {
                //....
            }
        }
     }

cbEntity is the combobox 

The presenter and presenterfactory are injected into the mainpresenter elsewhere (in the main program.cs) through the constructor – not shown.
Also ItemPresenter, FormPreseneter etc… all implement Ipresenter interface
Then the succinct dictionary of lambda functions in the presenterfactory class.

class PresenterFactory:IPresenterFactory
{
  private Dictionary&lt;string, Func&lt;IPresenter&gt;&gt; presenters
    = new Dictionary&lt;string, Func&lt;IPresenter&gt;&gt;
    {
      {&quot;Items&quot;,()=&gt; new ItemPresenter(new ItemRepository())},
      {&quot;Forms&quot;,()=&gt; new FormPresenter(new FormRepository())},
      {&quot;Synonyms&quot;,()=&gt; new SynonymPresenter(new SynonymRepository())}

   };

public IPresenter getPresenter(string entityName)
  {
    return this.presenters[entityName]();
  }
}

So if the user selected ‘Items’ in the combo box, the dictionary lookup for the key ‘Items’, returns a a value that is a lambda function.

Then calling the lambda function with

this.presenters[entityName]();

returns the instantiated  ItemPresenter value(with dependency ‘ItemRepository’ having been injected). Same applies for different selections (Form and Synonym selections)

 
Leave a comment

Posted by on September 3, 2013 in programming, software

 

Tags: , , , , , ,