Apache Pig for batch data analysis over Hadoop

 2014-08-25

In these days I’m playing with Apache Pig for running data analysis over Apache Hadoop. Below a sample wordcloud generated from the top word count of nouns of the Italian translation of the Bible Copy the file book.txt to hadoop distribuited file system (HDFS) withhadoop-2.4.0/bin/hdfs dfs -copyFromLocal -f book.txtTest the pig job locally withpig-0.13.0/bin/pig -x local wordcount.pigRun the pig job in hadoop withpig-0.13.0/bin/pig -x mapreduce wordcount.pigLook at results withhadoop-2.4.0/bin/hdfs dfs -cat book-wordcount/part\*|moreCopy the results to a local file withhadoop-2.4.0/bin/hdfs dfs -cat book-wordcount/part\* > frequenza-parole-bibbia.txt Below the two scripts I used for this short tutorial: Wordcount (pig script):a = load '/user/matteo/book.txt'; b = foreach a { line = LOWER(REPLACE((chararray)$0, '\[!?\\\\.»«:;,\\'\]', ' ')); generate flatten(TOKENIZE(line)) as word; } c = group b by word; d = foreach c generate group, COUNT(b) as cnt; d\_ordered = ORDER d BY cnt DESC; store d\_ordered into '/user/matteo/book-wordcount'; Wordcloud (R script)``` library(wordcloud) p = read.table(file=“frequenza-parole-bibbia.txt”) png("/home/matteo/la-sacra-bibbia-frequenza-parole.png", width=900, height=900) wordcloud(p$V1, p$V2, scale=c(8,.3),min.freq=2,max.words=200, random.order=T, rot.per=.15) dev.off()

 Tags: #apache #bigdata #hadoop #pig

Share on the Fediverse

Enter your instance's address

Cancel Share

Calling Qliksense Repository API from Apache Drill via sql

 2022-02-23 |  #apache #api #drill #Qliksense #rest #sql #sql

Abstract I’ll show how to connect to Qliksense Repository API via sql using Apache Drill. In this example Qliksense engine service runs at https://qlik.redaelli.org:4242/ Download Download and unzip Apache Drill from https://drill.

Continue reading 

Calling Talend Cloud Rest API from Apache Drill via sql

 2022-02-04 |  #apache #api #drill #rest #sql #sql #Talend

Abstract I’ll show how to connect to Talend Cloud API via sql using Apache Drill. Download Download Apache Drill from https://drill.apache.org/download/ Configure Create or edit the file conf/storage-plugins-override.conf "storage": { "talendcloud" : { "type" : "http", "cacheResults" : true, "connections" : { "get" : { "url" : "https://api.

Continue reading 

MR70

Apache Pig for batch data analysis over Hadoop

Enter your instance's address

More posts like this

Calling Qliksense Repository API from Apache Drill via sql

Calling Talend Cloud Rest API from Apache Drill via sql