Howto managing tweets saved in #Hadoop using #Apache #Spark SQL
2015-01-15
Instead of using the old Hadoop way (map/reduce), I suggest using the newer and faster way (Apache Spark on top of Hadoop Yarn): in few lines you can open all tweets (zipped json files saved in several subdirectories hdfs://path/to/YEAR/MONTH/DAY/*gz) and query them in a SQL like language``` sc = SparkContext(appName=“extraxtStatsFromTweets.py”) sqlContext = SQLContext(sc) tweets = sqlContext.jsonFile("/tmp/twitter/opensource/2014/*/*.gz") tweets.registerTempTable(“tweets”) t = sqlContext.sql(“SELECT distinct createdAt,user.screenName,hashtagEntities FROM tweets”) tweets_by_days = count_items(t.map(lambda t: javaTimestampToString(t[0]))) stats_hashtags = count_items(t.flatMap(lambda t: t[2])\ .map(lambda t: t[2].lower()))