May 5, 2013

Simple metastore creation for Hive in MySQL

For Hive, the meta-store is like the system catalog which contains metadata about the tables stored in Hive. This metadata is specified during table creation and reused every time the table is referenced in HiveQL. The database is a namespace for tables, where ‘default’ is used for tables with no user supplied database name. The metadata for table contains list of columns and their types, owner, storage and SerDe information (which I can detail in future posts). It can also contain any user supplied key and value data; which can be used for table statistics. Storage information includes location of the table’s data in the underlying file system, data formats and bucketing information. SerDe (which controls how Hive serializes/deserializes the data in a row) metadata includes the implementation class of serializer and deserializer methods and any supporting information required by that implementation. The partitions can have its own columns and SerDe and storage information which can be used in the future to evolve Hive schema.The metastore uses either a traditional relational database (like MySQL, Oracle) or file system and not HDFS since it is optimized for sequential scans only),thus the fired HiveQL statements are executed slow which only access metadata objects.

its simple to install the metastore.

-install mysql-conector
$ sudo yum install mysql-connector-java
-create a symbolic link in the Hive directory
$ ln -s /usr/share/java/mysql-connector-java.jar /usr/lib/hive/lib/mysqlconnector-java.jar

-create the database for the Hive metastore.cdh4 ships with scripts for derby,mysql,oracle and postgre
$ mysql -u root -p
mysql> CREATE DATABASE hivemetastoredb;
mysql> USE hivemetastoredb;
mysql> SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema- 0.9.0.mysql.sql;

-create a user for the metastore
mysql>CREATE USER 'hive'@'%' IDENTIFIED BY 'hive';

-grant access for all hosts in the network
mysql> GRANT ALL PRIVILEGES ON hivemetastoredb.* TO hive@'%' WITH GRANT OPTION;

following entries in the file /etc/hive/conf/hive-sites.xml, if you are trying a jdbc connection

Dec 19, 2012

Data and Brain


Came across an interesting presentation on Using Data to Understand Brain.


Is it possible to read your brain? hmmm

I am a little two-faced with these riddles....

Dec 18, 2012

Dec 17, 2012

Unicode features in various languages

Here’s what each language natively supports in its standard distribution.

Unicode Javascript ᴘʜᴘ Go  Ruby  Python Java  Perl
Internally UCS‐2 or
UTF‐8⁻ UTF‐8 varies UCS‐2 or
UTF‐16 UTF‐8⁺
Casefolding none simple simple full none simple full
Casemapping simple simple simple full simple full full
Normalization ─⁺
UCA Collation ✔⁺
Named Characters
Properties two (non‐regex)⁻ three (non‐regex)⁻ two⁺ every⁺

from Tom Christiansen Unicode Support Shootout: The Good, the Bad, the Mostly Ugly

Grapheme -  A grapheme is the smallest semantically distinguishing unit in a written language, analogous to the phonemes of spoken languages.

Casefolding - Unicode defines case folding through the three case-mapping properties of each character: uppercase, lowercase and titlecase. These properties relate all characters in scripts with differing cases to the other case variants of the character.

Case mapping - is used to handle the mapping of upper-case, lower-case, and title case characters for a given language.

What is the difference between case mapping and case folding? Case mapping or case conversion is a process whereby strings are converted to a particular form—uppercase, lowercase, or titlecase—possibly for display to the user. Case folding is primarily used for caseless comparison of text, such as identifiers in a computer program, rather than actual text transformation. Case folding in Unicode is based on the lowercase mapping, but includes additional changes to the source text to help make it language-insensitive and consistent. As a result, case-folded text should be used solely for internal processing and generally should not be stored or displayed to the end user.

Normalization - courtesy

Normalization - Unicode has encoded many entities that are really variants of existing nominal characters. The visual representations of these characters are typically a subset of the possible visual representations of the nominal character. more -

UCA Collation - Collation is the general term for the process and function of determining the sorting order of strings of characters. It is a key function in computer systems; whenever a list of strings is presented to users, they are likely to want it in a sorted order so that they can easily and reliably find individual strings. Thus it is widely used in user interfaces. It is also crucial for databases, both in sorting records and in selecting sets of records with fields within given bounds.The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which defines a customizable method to compare two strings. These comparisons can then be used to collate or sort text in any writing system and language that can be represented with Unicode. more

Named Characters - Unicode characters are assigned a unique Name (na). The name, in English, is composed of A-Z capitals, 0-9 digits, - (hyphen-minus) and .The Unicode Standard specifies notational conventions for referring to sequences of characters (or code points) treated as a unit, using angle brackets surrounding a comma-delimited list of code points, code points plus character names, and so on. For example, both of the designations in Table 1 refer to a combining character sequence consisting of the letter “a” with a circumflex and an acute accent applied to it. more  more

Properties - Each Unicode character belongs to a certain category. Unicode assigns character properties to each code point. These properties can be used to handle "characters" (code points) in processes, like in line-breaking, script direction right-to-left or applying controls.more
Perl looks cool!

Aug 1, 2012

Machine generated data

At first, the term "machine-generated  data" can be confusing. One would think,  every data is (or are?)  generated from one device or another is provided by an innocent mortal in this so called era of social media and big data. Then, there should be a clear distinction to such definitions. If an user enter some data in a form, then it is not considered machine generated. At the same time, the same application can track the user's location and log it in a remote server. So it becomes the machine generated data.

Wikipedia says,

Machine-generated data (MGD) is the generic term for information which was automatically created from a computer process, application, or other machine without the intervention of a human. 

According to Monash Research,

In classical human-generated data, what’s recorded is the direct result of human choices. Somebody buys something, makes an inquiry about it, fills an order from inventory, makes a payment in return for the object, makes a bank deposit to have funds for the next purchase, or promotes a manager who’s been particularly successful at selling stuff. Database updates ensue. Computers memorialize these human actions more quickly and cheaply than humans carry them out. Plenty of difficulties can occur with that kind of automation — applications are commonly too inflexible or confusing — but keeping up with data volumes is generally the least of the problems.

So what are they? Are they stream of logs flowing through the information super waterway?

May be, until they churned into some books or toilet rolls!

Application Logs - Logs generated by web or desktop applications. The server side logs used for debugging and support tickets!

Call Detail Records - The ones recorded your telecom company. They contain useful details of the call or service that passed through the switch etc like the phone number of the calls, its duration etc. Needed for billing.

Web logs - use to count the visitors and similar web analytics done on these data

Database Audit Logs - Enable auditing to audit for suspicious database activity, it is common that not much information is available to target specific users or schema objects

OS logs - tracks crashing or errors

There are many similar generated data by different application and systems like RFIDs, sensors etc. Then these messages can be mashed up. For the machine data, there will be structure or format and semantics based on the domain it relies on.

The growth of such data is fast and continuous. As it is a stream of data and like a history they are not changed. They are like a record of events.

courtesy- link

 Anyone tried iPhonetracker?

courtesy- link

Geolocation and LBS does push a load a data. HTML5 do have a geolocation functionality (even though you have the choice not to track). Following a sample code to test it.