Pulling data from Twitter: Steps

In a previous post I introduced the different tools to pull data from Twitter I started to use.

John Eberly’s article was very helpful. It explains how to automatically backup data from an Amazon EC2 instance. I had to make some changes and some commands were not possible in the Amazon EC2 version I am using (Micro Basic 64-bit Amazon Linux instance). The article is also referring a newer application for backup but I couldn’t use it on this instance. If you know better methods don’t hesitate to share.

In this post I will describe the instructions to follow in order to run a program pulling data from Twitter, and to backup data collected.

    1. Amazon Web Services AWS

I created Micro instances (Basic 64-bit Amazon Linux) as pulling data from Twitter uses very low memory level. Use the management console and click on “Launch instance” to create a new one. Save the .pem file, right-click on your created instance in the Amazon Management Console and click on “Connect” to read the instructions to follow to connect to your instance from a terminal.

Note: you cannot login directly as “root” so in the SSH command, replace “root” in “root@ec2-ip-address.compute-1.amazonaws.com” by “ec2-user” and type the command “sudo su” to be as “root”.

    1. cURL

In a previous post I introduced the command line cURL to send a request to the Twitter Streaming API as below:
curl -d @myfilter.options https://stream.twitter.com/1/statuses/filter.json -uUser:password
where “myfilter.options” is a file containing the filter options required in a streaming request, using at least 1 of the “track”, “follow” and “location” options (e.g.: “track=twitter,tweet”) and “User” and “password” are the credentials of an existing Twitter account (registering an application is not required in this case).

To easily classify the files with the data collected from Twitter, I execute the following command:
curl -d @myfilter.conf https://stream.twitter.com/1/statuses/filter.json -umyusername:mypassword >> mydata_$(date +%Y%m%d-%H%M).log &
This runs the cURL command in the background (with the “&”) so it keeps running even when closing the terminal or disconnecting from the Amazon EC2 instance. The data collected is stored in a file with the desired prefix and the current date so you know when pulling data started.

 

  • Back-up collected data with s3sync

 

The data collected from Twitter using the cURL command is stored in files. In my situation, files are initially created in the Amazon EC2 instance but I use Amazon S3 storage service to back up data and leave space in my Amazon EC2 instance (10GB of space on EC2). After creating a “Bucket” on Amazon S3 you need to install a synchronization tool where the data is collected. For that I use s3sync (ruby). (Note: newer tools exist but I couldn’t find any other I could use on Micro Amazon instance. If you know some, please share.).
To install s3sync:
-> Make sure recent Ruby version is installed (ruby -v: version should be at least “1.8.4”. Use sudo apt-get (or yum) install ruby libopenssl-ruby if necessary)
-> Download and unpack s3sync:
$ wget http://s3.amazonaws.com/ServEdge_pub/s3sync/s3sync.tar.gz
$ tar -xvzf s3sync.tar.gz
$ rm -f s3sync.tar.gz
-> Optional: install SSL certificates:
$ mkdir certs
$ cd certs
$ wget http://mirbsd.mirsolutions.de/cvs.cgi/~checkout~/src/etc/ssl.certs.shar
sh ssl.certs.shar
To use s3sync:
We need to set up environment variables (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY) with Access keys information available in the Amazon Web Services website (Account → Security Details).
Once done, we can use the command “s3cmd.rb” to manage S3 buckets:
s3cmd.rb listbuckets (to list buckets),
s3cmd.rb createbucket MyBucket (to create new bucket named “MyBucket”),
s3cmd.rb list MyBucket 10 (to list only what is in “MyBucket” 10 lines at a time)…

 

  • Automatic synchronization

 

I personally back up my data collected every day using a Shell script and the scheduling tool Crontab. The Shell script is calling s3sync to backup the data and cURL to start a new request. An example of a Shell script called “synchro.sh”:

$ export AWS_ACCESS_KEY=your_amazonaws_access_key
$ export AWS_SECRET_ACCESS_KEY=your_amazonaws_secret_access_key
$ mv /path/to/file/mydata_date.log /path/to/temp/folder
$ curl -d @myfilter.conf https://stream.twitter.com/1/statuses/filter.json -umyusername:mypassword >> mydata_$(date +%Y%m%d-%H%M).log &
$ sed -i '$d' /path/to/temp/folder/mydata_date.log
$ tar -czf /path/to/s3/folder/$(date +%Y%m%d).tar.gz mydata_*
$ ruby /path/to/s3sync/folder/s3sync.rb -r --ssl /path/to/s3/folder/ mybucket_name:subfolder
$ rm -f /path/to/temp/folder/*

After exporting environment variables we move the log file to a different location and start a new cURL request: this kills the previous cURL request and create a new log file with the current date and time. the sed command is to remove the last line of the log file as it is most of the time incomplete after moving the file. We tar the log file so that the transfer to Amazon S3 bucket takes less time.
The command s3sync.rb will transfer the files in the folder /path/to/s3/folder/ to the bucket mybucket_name (subfolders are optional). There are other options available to this command:
--delete will delete all files in the destination folder that are not in the source folder (be careful before using this option as you might permanently lose data),
--public-read will make the transfered files available to everyone.
Once completed we can deleted log files.

I use crontab to run this script once every day at midnight:
0 0 * * * /path/to/script/file/synchro.sh
So now I have data collected continuously from Twitter on an Amazon EC2 instance and collected data is automatically stored online. Then you’re free to download this data from your S3 bucket and do what you need with it.