Howto managing tweets saved in #Hadoop using #Apache #Spark
2014-11-25
Apache Spark has just passed Hadoop in popolarity on the web (google trends) My first Apache Spark usage was extracting texts from tweets I’ve been collecting in Hadoop HDFS. My python script tweet-texts.py was``` import json
from pyspark import SparkContext
def valid(tweet): return ’text’ in tweet
def gettext(line): tweet = json.loads(line) return tweet[’text’]
sc = SparkContext(appName=“Tweets”) data = sc.textFile(“hdfs://hadoop.redaelli.org:9000/user/matteo/staging/twitter/searches/TheCalExperience.json/*/*/*.gz”)
result = data.filter(lambda line: valid(line))\ .map(lambda tweet: gettext(tweet))
output = result.collect()
for text in output:
print text.encode(‘utf-8’)
And lunched with
spark-1.1.0> bin/spark-submit –master local[4] tweet-texts.py
I used Apche Spark 1.1.0 and Apache Hadoop 2.5.2. When I compiled Spark with
mvn -Phadoop-2.5 -Dhadoop.version=2.5.2 -DskipTests -Pyarn -Phive package
I got an error related to protocolBuffer jar release when I tried to read files from Hadoop HDFS
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$CreateSnapshotRequestProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet
So I changed the pom.xml adding
hadoop-2.5
<hadoop.version>2.5.2</hadoop.version>
<protobuf.version>2.5.0</protobuf.version>
<jets3t.version>0.9.0</jets3t.version>