Pulling data from Twitter: Steps

In a previous post I introduced the different tools to pull data from Twitter I started to use.

John Eberly’s article was very helpful. It explains how to automatically backup data from an Amazon EC2 instance. I had to make some changes and some commands were not possible in the Amazon EC2 version I am using (Micro Basic 64-bit Amazon Linux instance). The article is also referring a newer application for backup but I couldn’t use it on this instance. If you know better methods don’t hesitate to share.

In this post I will describe the instructions to follow in order to run a program pulling data from Twitter, and to backup data collected.

    1. Amazon Web Services AWS

I created Micro instances (Basic 64-bit Amazon Linux) as pulling data from Twitter uses very low memory level. Use the management console and click on “Launch instance” to create a new one. Save the .pem file, right-click on your created instance in the Amazon Management Console and click on “Connect” to read the instructions to follow to connect to your instance from a terminal.

Note: you cannot login directly as “root” so in the SSH command, replace “root” in “root@ec2-ip-address.compute-1.amazonaws.com” by “ec2-user” and type the command “sudo su” to be as “root”.

    1. cURL

In a previous post I introduced the command line cURL to send a request to the Twitter Streaming API as below:
curl -d @myfilter.options https://stream.twitter.com/1/statuses/filter.json -uUser:password
where “myfilter.options” is a file containing the filter options required in a streaming request, using at least 1 of the “track”, “follow” and “location” options (e.g.: “track=twitter,tweet”) and “User” and “password” are the credentials of an existing Twitter account (registering an application is not required in this case).

To easily classify the files with the data collected from Twitter, I execute the following command:
curl -d @myfilter.conf https://stream.twitter.com/1/statuses/filter.json -umyusername:mypassword >> mydata_$(date +%Y%m%d-%H%M).log &
This runs the cURL command in the background (with the “&”) so it keeps running even when closing the terminal or disconnecting from the Amazon EC2 instance. The data collected is stored in a file with the desired prefix and the current date so you know when pulling data started.


  • Back-up collected data with s3sync


The data collected from Twitter using the cURL command is stored in files. In my situation, files are initially created in the Amazon EC2 instance but I use Amazon S3 storage service to back up data and leave space in my Amazon EC2 instance (10GB of space on EC2). After creating a “Bucket” on Amazon S3 you need to install a synchronization tool where the data is collected. For that I use s3sync (ruby). (Note: newer tools exist but I couldn’t find any other I could use on Micro Amazon instance. If you know some, please share.).
To install s3sync:
-> Make sure recent Ruby version is installed (ruby -v: version should be at least “1.8.4”. Use sudo apt-get (or yum) install ruby libopenssl-ruby if necessary)
-> Download and unpack s3sync:
$ wget http://s3.amazonaws.com/ServEdge_pub/s3sync/s3sync.tar.gz
$ tar -xvzf s3sync.tar.gz
$ rm -f s3sync.tar.gz
-> Optional: install SSL certificates:
$ mkdir certs
$ cd certs
$ wget http://mirbsd.mirsolutions.de/cvs.cgi/~checkout~/src/etc/ssl.certs.shar
sh ssl.certs.shar
To use s3sync:
We need to set up environment variables (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY) with Access keys information available in the Amazon Web Services website (Account → Security Details).
Once done, we can use the command “s3cmd.rb” to manage S3 buckets:
s3cmd.rb listbuckets (to list buckets),
s3cmd.rb createbucket MyBucket (to create new bucket named “MyBucket”),
s3cmd.rb list MyBucket 10 (to list only what is in “MyBucket” 10 lines at a time)…


  • Automatic synchronization


I personally back up my data collected every day using a Shell script and the scheduling tool Crontab. The Shell script is calling s3sync to backup the data and cURL to start a new request. An example of a Shell script called “synchro.sh”:

$ export AWS_ACCESS_KEY=your_amazonaws_access_key
$ export AWS_SECRET_ACCESS_KEY=your_amazonaws_secret_access_key
$ mv /path/to/file/mydata_date.log /path/to/temp/folder
$ curl -d @myfilter.conf https://stream.twitter.com/1/statuses/filter.json -umyusername:mypassword >> mydata_$(date +%Y%m%d-%H%M).log &
$ sed -i '$d' /path/to/temp/folder/mydata_date.log
$ tar -czf /path/to/s3/folder/$(date +%Y%m%d).tar.gz mydata_*
$ ruby /path/to/s3sync/folder/s3sync.rb -r --ssl /path/to/s3/folder/ mybucket_name:subfolder
$ rm -f /path/to/temp/folder/*

After exporting environment variables we move the log file to a different location and start a new cURL request: this kills the previous cURL request and create a new log file with the current date and time. the sed command is to remove the last line of the log file as it is most of the time incomplete after moving the file. We tar the log file so that the transfer to Amazon S3 bucket takes less time.
The command s3sync.rb will transfer the files in the folder /path/to/s3/folder/ to the bucket mybucket_name (subfolders are optional). There are other options available to this command:
--delete will delete all files in the destination folder that are not in the source folder (be careful before using this option as you might permanently lose data),
--public-read will make the transfered files available to everyone.
Once completed we can deleted log files.

I use crontab to run this script once every day at midnight:
0 0 * * * /path/to/script/file/synchro.sh
So now I have data collected continuously from Twitter on an Amazon EC2 instance and collected data is automatically stored online. Then you’re free to download this data from your S3 bucket and do what you need with it.

Amazon Web Services (EC2, S3, EBS…)

In order to pull data from Twitter I used Amazon Web Services to run programs collecting data and backing it up.

Amazon Elastic Compute Cloud EC2 is a web service that provides compute capacity in the cloud. The Web interface is easy to use and allow users to create customized virtual computing environment. You pay only for what you use and you may even have Free Usage Tier for 1 year. When you create a new instance you download a Private key file (.pem) which is important to save to connect later to this instance.

AWS Management Console is a web interface that gives access to all Amazon Web Services. From there we can create and manage instances, volumes, buckets… and supervise the capacity of memory and transactions used by these instances.

Amazon Simple Store Service S3 is a storage service over Internet where we can store and retrieve data from anywhere on the Web. Each bucket created and file stored can be accessed using a single URL. Permission accesses can be set up for each file. Same as other Amazon Web Services you pay only for what is used and the space goes up to 5TB.

Amazon Elastic Block Store EBS is storage volume service from 1GB to 1TB. Storage volumes can be mounted as devices on Amazon EC2 instances. Several volumes can be mounted on the same instance but 1 volume can only be mounted on 1 instance. The difference with Amazon S3 is that Amazon EBS volumes can be used as boot partitions and point-in-time snapshots to recover data.

Pulling data from Twitter: Tools

In order to pull data from Twitter I started to use the following tools:

Twitter Stream API

Twitter exposes its data via different types of API (Application Programming Interface): REST, Search and Streaming. In order to collect nearly real-time data from Twitter, we will be using Streaming API to access public statuses filtered in various ways. The following URL is called:


When a request is sent, at least 1 filter option must be specified between keywords (“track“), follow user ids (“follow“) and geographic locations (“location“). With too much parameters the URL might be too long and then rejected which is why we use POST header parameters when we send a request.

  • Limitation:

Filter parameters are rate-limited. We can find different information on the Twitter documentation pages:

I think that in any case these limitations are high enough. However I know by experience that the Stream is limiting the amount of statuses collected so if you are using a lot of keywords or other filter options and if you leave your stream running for too long, it might be stopped and you will have to restart it.

So far my biggest streaming programs is collecting around 1,5 million tweets a day and I have to restart my program around once a week. Twitter is not giving precise information about this limit and just says it is appropriate for long-term connections:

  • Identification:

Using Twitter Stream API requires to be identified with valid credentials from a Twitter account. There are 2 types of identification: Basic Auth and OAuth.

Basic Auth is a classic way to be identified by providing a username and password from Twitter account. Twitter announced that this method will soon be deprecated without giving a fixed date so it is now advised to use OAuth identification.

OAuth is an authorization method using tokens instead of credentials. This method lets users grant a third party application access to their personal information protected by credentials on another application. With this access granted by user, OAuth can then let these 2 applications communicate with each other without having to share security credentials.

More info about Twitter Streaming API.


In order to collect data from Twitter using Streaming API, meaning collecting results from the URL https://stream.twitter.com/1/statuses/filter.json, we started to use cURL. cURL is a command line tool to transfer data from or to a server using one of the supported protocols (HTTP, FTP, SMTP…). It should be available by default on Amazon clouds. If not run the command sudo apt-get curl or sudo yum curl.

cURL has many options, including one to send data in a POST request “-d“. So we can simply include the filter parameters in a separated file and call this file using “-d @myfile.conf“.

Another option available is related to credentials if the URL is requiring identification, for example “-uUsername:Password“. We use this method of identification as we just run the cURL command from a terminal before using another application.

The command line looks like:

curl -d @myfilter.options https://stream.twitter.com/1/statuses/filter.json -uUser:password

Amazon EC2

Running several Twitter data pull programs from the same machine in my research laboratory wouldn’t have been possible for several reasons, including the fact that using the same IP address for all programs would have stopped quickly due to rate limitations.

Using Amazon Compute Clouds (Amazon EC2) gives the possibility to run each Twitter pull program on individual EC2 instance.

The advantages in this project were that we had more flexibility with rate limits as each program was running from a different IP address. It was also running on other machines and then not using memory from the local machine (very useful during blackout periods…).

Micro instances were suitable enough for this kind of work. Storage solutions such as Amazon S3 and Amazon EBS are used and will be developed later.

More info about Amazon EC2.

Now see how to use these tools in the next article listing the instructions I follow.