Wednesday, November 20, 2013

Graph database (Neo4j) as a datastore for MDM

Background

I am currently part of a Professional Service team; we provide service related to architecting, Solutioning, development, enhancement and implementation of Master Data Management. We are specialized heavily in IBM Master Data Management.

As part of the various implementations that I have been engaged on; I have had fair amount of implementation exposure to using MDM to store Customer data. 

If you have been exposed to MDM (regardless of any vendor [IBM, Oracle, Golden Source]/ domain [Customer, Product, Security]); you would realize that in MDM there is a lot of relations (directed and bi-directional), properties etc are stored between entities (Customer, Email, Phone etc..). All the MDM products, that I have been exposed too, uses an RDBMS (DB2, Oracle, MYSQL etc..) as a datastore. After playing around with Neo4J, and realizing the nature of graph database as a datastore; I wanted to explore how we could use Neo4j for an MDM solution. 

As part of trial play; I want to data model a MDM (for customer data model) for a retail enterprise "CrossRoads Inc.". In the following section I would demonstrate 
 - A customer data
 - Relation between various source system and customer data
 - Relationship between customer and entities (Email, Phone etc..)
 - Preferences that a customer setup
 - Merged nodes

 I am using only the Cypher queries capability to demonstrate the above.

NOTE: This is just a POC; as such I wanted to highlight only the basic scenarios. A typical implementation would involve complex scenarios, based on enterprise needs.

POC

The purpose is just to demo, hence I am not going to walk through the basics of explaining graphs or Neo4j or Cypher queries. There are really great demo, tutorials & documentation available at the NEO4J website

Scenario : Initial setup
We start of creating the basic node that represents the enterprise "CrossRoad Inc."

CREATE (root:ROOT{company : "CrossRoads Inc.", implementor : "y"}) 
RETURN root;





Next we declare the nodes that identifies the various source system; from which record will be fed into MDM. CrossRoads Inc. operates with retail stores (identified by source "STORES") and also has an online ecommerce website (identified by source "ONLINE"), via which customers can buy products

MATCH (root:ROOT) 
CREATE UNIQUE 
root-[:HAS_SYSTEM]- (s1 :SOURCE {name:'ONLINE'}),
root-[:HAS_SYSTEM]- (s2 :SOURCE {name:'STORES'})
RETURN s1,s2;



Scenario : Creating a customer

The source system STORES has a customer record; which has an id of "STO00100". 

Typically there are lot of request to search for a customer based on a souce id; since relation and its property are not yet indexable (not talking about the legacy indexing of Neo4j); I have opted to treat the source id as an entity by itself.

Customer entity are stored as a Graph Node. Using the new "Label" feature of Neo4j; we can create the "Customer" node. 

An unique enterprise customer id (property "pguid") is generated and stored as part of the customer node. All the demographics information (ex : gender, ethinicity etc..) can be stored as properties of the customer node.   

MATCH (src:SOURCE) 
WHERE src.name='STORES' 
CREATE UNIQUE 
src -[:HAS_PARTY {active: 'Y',last_update_date:'02-01-2013'}]-> (pid:SOURCE_PARTY_ID {source:'STORES',systemid:'STO001000', since_date:'01-01-2013', active: 'Y'}),
pid -[:IDENTIFIES_PARTY {current:'Y'}]-> (p1:PARTY {pguid:"P1376",gender: 'M'})
RETURN p1;




  Node 3 => Source node ("STORES)
  Node 4 => Source Id node
  Node 5 => Customer (party) node

Since there will be a lot of searches; I have opted to create the following indexes 

CREATE INDEX ON :SOURCE_PARTY_ID(systemid);

CREATE INDEX ON :PARTY(pguid);

Now we can issue queries like the following

To search for a party with id "STO001000"

MATCH (pid:SOURCE_PARTY_ID) -[r:IDENTIFIES_PARTY]-> (p:PARTY) 
WHERE pid.systemid="STO001000" AND pid.source='STORES' AND r.current='Y' 
RETURN p;

To search for a party by its "guid"

MATCH (p:PARTY)
WHERE  p.pguid="P1376"  

RETURN p;

Scenario : Creating history on sub nodes (name)

In the next steps; Lets look at how sub entities (ex names) can be created and how historical information can be created.

When you walk into a store and if the cashier asks your name, most of the customers give partial name "J Doe" or full name "John Doe" or even fake / avatar names "Mr. Jesse Pinkmen" (Yeah Breaking bad fan here).

Retailer source system stores these name variations for a single customer record. If the customer changes the name (by calling up a CSR or something); they maintain history as to when the record was changed. new name etc.

Hence name is not stored as a property of the "Customer" entity; but rather a seperate sub-entity. The "name" node cannot exist without a customer and hence has a rel; Pation ship with the customer node. The relationship identifies name type ("Professional", "online avatar" etc..)

The following cypher query demonstrates how to create a name node and attach it to the customer node. 

The customer would like to "Be Called" as "Louie C.K"

MATCH (p:PARTY)
WHERE  p.pguid="P1376"
CREATE UNIQUE 
p -[:HAS_NAME { nameType : "BE_CALLED", current: 'Y', source : 'STORES', sinceDate : '01-02-2013'}]-> (nm:PARTY_NAME {firstname:"C.K", lastname:"Louie"})

RETURN p,nm;

I have opted to store "source lineage" informations in the relationship rather than name node itself. By "Source Lineage" information i mean property like which source provided this information, which date was this available. Also the "current" property is an important one to node, it identifies if the relationship is active and is the most recent 




Node 5 => Party / Customer node
Node 6 => Name node

On a different day; the customer came back to the store and decided to buy something. Now he decides to "Be Called" as "Charlie Sheen". In this case the following single cypher statment replaces the active "Be Called" name from "Louie C.K" to "Charlie Sheen"

MATCH (pid:SOURCE_PARTY_ID) -[r:IDENTIFIES_PARTY]-> (p:PARTY) -[hn:HAS_NAME]-> (nm:PARTY_NAME) 
WHERE pid.systemid="STO001000" AND pid.source='STORES' AND r.current='Y'AND hn.nameType = 'BE_CALLED' AND hn.current = 'Y' AND hn.source = 'STORES'
SET hn.current = 'N'
WITH p,nm
CREATE p -[:HAS_NAME { nameType : "BE_CALLED", current: 'Y',  source : 'STORES', sinceDate : '02-02-2013'}]-> (nm2:PARTY_NAME {firstname:"Charlie", lastname:"Sheen"})
,nm2 -[:PREVIOUS]-> nm

RETURN p,nm,nm2;





Node 7 => name node "Charlie Sheen"
Node 6 => name node "Louie C.K"
Node 5 => Party node

The cypher statement sets the "Louie CK" relationship current status to "N" and sets the relationship current status to "Y" for "Charlie Sheen". Also in order for faster travel it has created a "Previous" relationship between "Charlie Sheen" and "Louie CK".

Scenario : Maintaining preferences on sub nodes (email)

Customers typically has a lot of preferences on various interaction with a typical retail stores. For ex; when i walk into best buy I have preference to shop only on TV, laptops/tablets, speakers, games). I dont want to go to best buy to look for vacuum cleaners, refrigerators and stuffs. Another example I would like to receive deals, coupons on electronic items (via email) and dont share email information with third party for marketting.

Such kind of information are what are called as Preferences. Preferences are related to various entities (email, phone, address etc). A customer maintains some kind of relationship with an entity (ex : email). For example this customer ("STO001000") has an email address ("Jesse.Pinkman@breakingbad.tv") that he uses for personnel ("Home").

MATCH (p:PARTY)
WHERE p.pguid="P1376"
CREATE UNIQUE 
p -[:HAS_EMAIL { usageType : "HOME", current: 'Y', source : 'STORES', sinceDate : '02-02-2013'}]-> (em:EMAIL {emailAddress:"Jessie.Pinkman@breakingbad.tv"})

RETURN p,em;

Preference related informations can now be stored as part of the relationship "HAS_EMAIL" that exists between the customer node and email node.



 Node 8 => email node
 Node 5 => customer node

In the above example the customer does not like to subscribeToDeals  (r.subscribeForDeals='N') , have the retail store send his email to its subsidary (r.sendEmailToSubsidary='N'), but would like to receive coupon (r.receiveCoupons='Y')


Scenario : Additional entities and additional party

Before going onto the next scenario "merged parties". I wanted to create more sub entities.

Lets say the customer bought some valuable items as part of buying that item he has decided to provide realistic information like "phone", "address" etc. THe following cypher statement creates these entities and also create a relationship between the customer.

MATCH (src:SOURCE) 
WHERE src.name='ONLINE' 
CREATE UNIQUE 
src -[:HAS_PARTY {active: 'Y',last_update_date:'02-01-2013'}]-> (pid:SOURCE_PARTY_ID{source:'ONLINE',systemid:'ON00891', since_date:'01-01-2013', active: 'Y'}),
pid -[:IDENTIFIES_PARTY {current:'Y'}]-> (p2:PARTY {name:"P2498", pguid:"P2498"})

RETURN p2;

The following cypher statements creates 

  • Online customer with id "ON00891" and guid "P2498"
  • has a "avatar" name of "Jessie Pinkmen"
  • Uses email address "Jessie.Pinkman@breakingbad.tv" for his home/personnel use.
  • Have a phone number "7329876534"
  • and has provided a billing & shipping address; when he bought some item from online. 

#Create another party p2 that is fed from online having an id "ON00891" and the party has a work email "Jessie.Pinkman@breakingbad.tv"
MATCH (src:SOURCE) 
WHERE src.name='ONLINE' 
CREATE UNIQUE 
src -[:HAS_PARTY {active: 'Y',last_update_date:'02-01-2013'}]-> (pid:SOURCE_PARTY_ID{source:'ONLINE',systemid:'ON00891', since_date:'01-01-2013', active: 'Y'}),
pid -[:IDENTIFIES_PARTY {current:'Y'}]-> (p2:PARTY {name:"P2498", pguid:"P2498"})
RETURN p2;

#Add name node of usage type "AVATAR" to party "P2498"
MATCH (p:PARTY)
WHERE p.pguid="P2498"
CREATE UNIQUE 
p -[:HAS_NAME { nameType : "AVATAR", current: 'Y', source : 'ONLINE', sinceDate : '01-02-2013'}]-> (nm:PARTY_NAME {firstname:"Jessie", lastname:"Pinkmen"})
RETURN p,nm;

#Create link for email, phone and address
MATCH (pid:SOURCE_PARTY_ID) -[r:IDENTIFIES_PARTY]-> (p2:PARTY),(em:EMAIL) 
WHERE pid.systemid="ON00891" AND r.current='Y'
AND em.emailAddress="Jessie.Pinkman@breakingbad.tv" 
CREATE UNIQUE 
p2 -[:HAS_EMAIL { usageType : "HOME", current: 'Y', source : 'ONLINE', sinceDate : '01-02-2013'}]-> em
RETURN p2,em;

MATCH (pid:SOURCE_PARTY_ID) -[r:IDENTIFIES_PARTY]-> (p2:PARTY),(ph:PHONE) 
WHERE pid.systemid="ON00891" AND r.current='Y'
AND ph.number="7329876534" 
CREATE UNIQUE 
p2 -[:HAS_PHONE { usageType : "PERSONNEL", phoneType : "SmartPhone", current: 'Y', source : 'ONLINE', sinceDate : '01-02-2013'}]-> ph
RETURN p2,ph;

MATCH (pid:SOURCE_PARTY_ID) -[r:IDENTIFIES_PARTY]-> (p2:PARTY),(addr:ADDRESS) 
WHERE pid.systemid="ON00891" AND r.current='Y'
AND addr.addrLine1="1320 Hollow Way Dr" AND addr.city="Matawan" AND addr.state="NJ" AND addr.Country="US"
CREATE UNIQUE 
p2 -[:HAS_ADDRESS { usageType : "BILLING", current: 'Y',  current: 'Y', source : 'ONLINE', sinceDate : '01-02-2013'}]-> addr
RETURN p2,addr;

MATCH (pid:SOURCE_PARTY_ID) -[r:IDENTIFIES_PARTY]-> (p2:PARTY) 
WHERE pid.systemid="ON00891" AND r.current='Y'
CREATE UNIQUE 
p2 -[:HAS_ADDRESS { usageType : "SHIPPING", current: 'Y',  current: 'Y', source : 'ONLINE', sinceDate : '01-02-2013'}]-> (addr:ADDRESS {addrLine1:"1564 SleepyHollow Ave", city:"Matawan", state:"NJ", Country:"US"})

RETURN p2,addr;

The following query is a simple way to retrieve all the nodes, relations attached to a customer node. In othewords; by executing this query we get a complete picture of the customer information.

MATCH (pid:SOURCE_PARTY_ID) -[r:IDENTIFIES_PARTY]-> (p2:PARTY)-[*]->(n)
WHERE pid.systemid="ON00891" AND r.current='Y'

RETURN p2,n;





Scenario : Merging party nodes

In our small database; we now have a customer who has done some shopping in a local retail store and also bought some items via online store. Here is the picture of the both the customers 

MATCH (pid:SOURCE_PARTY_ID) -[r:IDENTIFIES_PARTY]-> (p2:PARTY)-[*]->(n)
WHERE r.current='Y'

RETURN p2,n;




 from the picture above we find that both the customer share same entities (email (node 8), phone (node 11) & address (node 12).

In a MDM solution; the matching engine component/sub-system helps in finding duplicate customers and merges the information between the customer into one single entity. Lets suppose that the above customer node are the same and we have elected to merge them. Now, we look at how this can be done. As part of merging; the old nodes should be maintained for history and origin determination.

Neo4j does not offer capabilities to merge nodes together, but this could be achieved as follows. First we start of by creating a new customer node that represents merging of the customer node from stores and also online. This could be achieved via this single cypher statement

MATCH (pid1:SOURCE_PARTY_ID) -[r1:IDENTIFIES_PARTY]-> (p1:PARTY)
WHERE pid1.systemid="STO001000" AND r1.current='Y' 
WITH p1
MATCH p1,(pid2:SOURCE_PARTY_ID) -[r2:IDENTIFIES_PARTY]-> (p2:PARTY) 
WHERE pid2.systemid="ON00891" AND r2.current='Y'
CREATE  p1 -[:MERGED_TO { partid1:"P1376", partyid2:"P2498", merge_date:'03-01-2013'}]-> (p3:PARTY {pguid:"P5836"})
        ,p2 -[:MERGED_TO { partid1:"P1376", partyid2:"P2498", merge_date:'03-01-2013'}]-> (p3)

RETURN p1, p2,p3;

What the above does is that it creates a new customer / party node (guid : P5836) and creates a "Merged To" relation between Stores Customer (STO0010000) and Online customer (ON00891). 



  Node 15=> new customer (merged) node
  Node 5 => stores customer node
  Node 10 => online customer node.

At this time, Neo4j does not have a way copy relations from one node to another node. The "SET" operator can replace / copy all properties between nodes but it still not copy the relationships. The only way we could this is manually create the relationship or via programatic code (ex: via java driver classes).

In this poc; i have been keeping only cypher statement and hence did not go the route of creating these relations.

Conclusion
Though I havent covered all the scenarios, i believe the demonstration would have given some overview of how the data could be modeled in Neo4j. 

I find that querying and getting the results is very easy. dont have to worry about joins and joins on multiple tables. 

Speed is faster as now i can retrieve all the party information in literally one simple call.

I wanted to test the invocation via Spring Data, but i believe it is still in development for this beta version and hence did not take time to test it out.

I definetly feel that this is the modern data store and probably will be seeing this in the future versions of MDM based projects.

For now, a polyglot based implementation can be developed with the COTS MDM and Neo4j. Neo4j would store the relationships. This can avoid multiple joins and faster search and retrieval of party information.


Thursday, January 24, 2013

Scrapping away at the JigSaw

Introduction

For a concept idea that I am working on I needed to get some realistic sample people data. The data fields are Name, address, email phone, occupation etc.. I know it that this data could easily be fabricated, I tried it out got more like these :

Jules abdoullah Desai,
123 Farber Lakes Dr,
Monroeville, Texas

That address is not convincing and that person may not exist. I wanted to get the sample data from a source so that there is some distribution of people across location (city, state etc..) and real valid address (well at least close enough). The following example looks more convincing for me:

Ibrahim Khan
1 Lilly Corporate Ctr
IndianapolisIN 46285-0001
United States
 This blog post details on the data source that provided me with the type of data and how I built a program and extracted information from this source. At the end of reading this post, you will have some idea of
  • The data source
  • Method used to get the data.
  • web scrapping.
  • The program and libraries that helped in extracting the information.

Data source

I search to find a data source (web site) which could give me a dataset. First stop was www.whitepages.com. You could get hold a list of person by searching on a last name. The search result produced as below :


  
as you could see; the distribution of people by locality is there; but are too many duplicates.also the address is not in full. hence I was not satisfied with this.

On further searching and digging I finally came to www.jigsaw.com. Although Jigsaw's original intention is to host a repository of business lead contacts and provide a medium where different people could exchange this information. My intention was different but it served to be ideal. To start off, a simple search by name "Ibrahim khan" returns the following result :




from the search result I am able to get company, title and name. When I click on the name itself; I get a little bit more details  as below :


Though the actual phone & email address is hidden (you need to pay for it); but other information like Name, Job Title, Organization, Address seems more realistic and valid. Hence this is a perfect data source for getting the data I need. I could live with the fact of auto generating the email address (like firstname.lastname@somerandom.org) and phone number for these contacts data.

NOTE: Jigsaw had a rest based service to retrieve the information of contacts, but at the time of this experimentation; it was closed and I was not able to get an account to access this service.

Strategy for data collection

Ideally it would be great, if we have a list of names in a database and we could query for each name. Unfortunately, Jigsaw did not provide this. A workaround is to search based on a last name (ex: carpenter) and from the list of names (which has a URL) present in the search result, iterate each and retrieve the necessary information (name, title, org, address, etc..). The only downside is for you to come up a list of common last names that you could use for the search.

You could get the list of common last names from the us census, like the list from 2000 census Genealogy Data: Frequently Occurring Surnames. The amount of data collected from each last name search would be good enough at least for my case.

For the data collection process; I am going to be doing a 2 pass on JigSaw.com. As from above, based on a list of last names
  1. First pass:   For each last name do a search and store the resultant links for each contact.
  2. Second pass: For each url (obtained from the above), do a HTTP request and extract the required set of data.
Also to note is that your program should have logged into Jigsaw and get a HTTPs session, in order to retrieve all the data from the search result.  

Tool

Having found the ideal data source, the next step is to find a way to extract the data from this data source.A well known popular methodology is to use a web crawler/site scrapper. 

From the time, I have known about this methodology; it has come a long way. Now you could easily create site scrappers with scrapping software (ex: automationanywhere) or employ a SAAS (ex: grepsr or mozenda) which provides sophisticated extraction and scheduling for scrapping at intervals etc..

I could simply used the above mentioned options (it would save a lot of time); but I just wanted to get my hands dirty a little, I needed to understand what is involved with scrapping and how it can be achieved. Hence I choose to program a little; now I did not want to start from a scratch I wanted to use existing well known libraries that provided 
  • website crawling
  • a framework for crawling
  • discover links and further crawl
  • ability to program using a language like python, java etc..
  • handle website with HTTPS sessions. 
Now, there are many framework and libraries that could achieve the above, for my exploration; I choose  Scrapy; as it had more references in forums (like StackOverflow) and Google search result. I also liked Scrapy in that the output will be stored as CSV, JSON, XML all based on argument parameters.

In order to do a web site scrapping; it is necessary that you have to understand the web page source html; inspect and get to know the locations/xpath etc.. I recommend using the developer tools of Firefox or Chrome to do the same. I would not be explaining this process.

Scrapy and dynamic rendered web pages


Before jumping right into the code; i would have to explain how to handle java script rendered web page. I am doing this before, as i learned it the hard way of how the web page got rendered and how to use scrappy on these dynamic content

Why is there a section of this is for the reason that the search result and many other pages from Jigsaw are dynamically rendered by Javascript. Unless the web page is rendered, you will not get list of contact names, urls etc..
Crawling with Scrapy is simple as long as the web pages are pure HTML and rendered. However if the page content/page section/links that you are interested in is java script rendered then it becomes difficult. The hard way is for your scrapper to look at the whole the HTML content and determine how the data could be extracted. Another way is to use a web-page javascript rendering component/library  that first reads the html and gives a rendered output, again this is hard.
Upon exploration i came a work around to do scrapping Selenium. Selenium web-driver interacts with a browser (of choice) and loads up the url. Once the page has been rendered by the browser, we could use classes in web-driver to get the rendered web page. The following will use the "Selenium Web driver" component. The logic of loading url and extraction is :
  1. Open the url using the "Selenium web driver"
  2. Selenium opens a browser instance (firefox in my case), and loads the url.
  3. Once the url is loaded, we wait a configurable amount of time, in order for the browser to render the page.
  4. Once rendered, using classes from Selenium web driver, we look for the content (sections, links etc) that are of interest to us and proceed further.

Code

The following set of section will dwelve into how I used Scrapy to crawl/scrape and extract the data.  In order to develop the scrapper with scrapy, it needs to be developed with Python. I am hoping that you know Python. As of today, I am not yet an expert in Python, but enough to develop simple solutions.

The code has been developed in a windows based environment. If you would want to use this code, I leave it to you to reasearch and find out how to install python in windows and also install libraries like libxml, scrappy, selenium etc..

The code is checked in GitHub project  (Will be updating this shortly) Jigsaw Scrapper.git

Extracted data

As mentioned in the strategy, the code will be based on a 2 pass/run at jig saw, in order to extract the data that is of interested. In the first pass, we will extract the following:
  • Contact Name
  • URL for this contact
This will be stored in a text file ("ContactsToURL.csv"). In the second pass, we retrieve the details of the contact and store in a text file. The details of the contacts that will be extracted are :
  • Contact Name (First, Middle & Last)
  • Contact Id
  • Title
  • Company or Organization
  • Address
  • Phone (if present)
  • Added by
  • Last updated.
The definition of above data structure is defined in the "Items.py" file as required by Scrapy.
The output format is set at the command line of scrapy. For the first pass, the data was stored in a CSV format and for the second pass i stored the output in a JSON format.

Code logic

First pass : Searching and capturing the contact urls

As mentioned earlier in the first pass, the scrapper does a search on the last name, iterates over the result set and stores the contact name and contact URL.

This is done by the class "SearchAndRetrieveContactsURL.py". The class extends from CrawlSpider class of Scrapy which is meant for crawling.

For the list of last names, i have initialized it in an array (though this could be done be reading a file as well). The start_urls is empty as we are going to override the default loading procedure.

We over ride the method "start_requests"; to handle the logon process. We also instantiate the Selenium Webdriver and load up the logon URLS. This ensures the selenium instantiated browser is loaded and logged in.

If the login is successful; the "after_login" method is called up. This then checks if the login was indeed a success and redirects to the search contact page.

Once on the search page, a search is done using a last name (initialized at the class level) and once we get the search result we parse the result. This is done by method "getLastNameSearchURL".

The "parse_searchResult" then iterates over the result and captures the contact name and URL for the contact and if there is next_page link; then it crawls to the next page in a recursive fashion. The parsing of the search result page is done from the webpage that gets rendered in the browser (that was instantiated by Selenium).


On finishing one last name, the next last name is retrieved and searching and fetching process continues until all the names have been completed. 

Second Pass: Fetching contact data

Given the contact url, I found that it is not mandatory to logon to JigSaw, a simple URL load, gets you the page and it is rendered and hence no need to worry about dynamic rendering. 

This is done by the scrapper class "RetrieveContactDetails.py". The scrapper loads the text file containing the contacts urls and for each url, it does a HTTP get extract data content and goes onto the next one.

The class extends from the "BaseSpider" class of Scrapy, as crawling is not needed.

Retrospective

As mentioned earlier, there are much simpler solutions to web site scrapping using software's and SAAS providers. If its a one time based setup, I would get a software, if scrapping is needed to be done on a regular basis, I would recommend exploring the SAAS provider based solution, as it relieves you in cost of hardware, skill set etc..

There are also various books on site scrapping for programmers to start off Web bots, Spiders, and Screen Scrapers.

Before employing any of the above, it is better that you would understand what you want to scrap, copy right issues, authentication mechanism etc..

Last words


Till next time, "Vanakkam"