Howto managing tweets saved in #Hadoop using #Apache #Spark SQL

 2015-01-15

Instead of using the old Hadoop way (map/reduce), I suggest using the newer and faster way (Apache Spark on top of Hadoop Yarn): in few lines you can open all tweets (zipped json files saved in several subdirectories hdfs://path/to/YEAR/MONTH/DAY/*gz) and query them in a SQL like language``` sc = SparkContext(appName=“extraxtStatsFromTweets.py”) sqlContext = SQLContext(sc) tweets = sqlContext.jsonFile("/tmp/twitter/opensource/2014/*/*.gz") tweets.registerTempTable(“tweets”) t = sqlContext.sql(“SELECT distinct createdAt,user.screenName,hashtagEntities FROM tweets”) tweets_by_days = count_items(t.map(lambda t: javaTimestampToString(t[0]))) stats_hashtags = count_items(t.flatMap(lambda t: t[2])\ .map(lambda t: t[2].lower()))

 Tags: #Me

Share on the Fediverse

Enter your instance's address

Cancel Share

My Networking Survival Kit

 2020-03-15 |  #Me

In this small tutorial I’ll speak about tunneling, ssh port forwarding, socks, pac files, Sshuttle I’ve been using Linux since 1995 but I have never been interested a lot in networking.

Continue reading 

How to backup and restore Glue data catalog

 2020-02-21 |  #Me

How to recover a wrongly deleted glue table? You should have scheduled a periodic backup of Glue data catalog with aws glue get-tables --database-name mydb > glue-mydb.json And recreate your table with the command

Continue reading 

MR70

Howto managing tweets saved in #Hadoop using #Apache #Spark SQL

Enter your instance's address

More posts like this

My Networking Survival Kit

How to backup and restore Glue data catalog