In order to pull data from Twitter I started to use the following tools:
Twitter exposes its data via different types of API (Application Programming Interface): REST, Search and Streaming. In order to collect nearly real-time data from Twitter, we will be using Streaming API to access public statuses filtered in various ways. The following URL is called:
When a request is sent, at least 1 filter option must be specified between keywords (“track“), follow user ids (“follow“) and geographic locations (“location“). With too much parameters the URL might be too long and then rejected which is why we use POST header parameters when we send a request.
Filter parameters are rate-limited. We can find different information on the Twitter documentation pages:
I think that in any case these limitations are high enough. However I know by experience that the Stream is limiting the amount of statuses collected so if you are using a lot of keywords or other filter options and if you leave your stream running for too long, it might be stopped and you will have to restart it.
So far my biggest streaming programs is collecting around 1,5 million tweets a day and I have to restart my program around once a week. Twitter is not giving precise information about this limit and just says it is appropriate for long-term connections:
Using Twitter Stream API requires to be identified with valid credentials from a Twitter account. There are 2 types of identification: Basic Auth and OAuth.
Basic Auth is a classic way to be identified by providing a username and password from Twitter account. Twitter announced that this method will soon be deprecated without giving a fixed date so it is now advised to use OAuth identification.
OAuth is an authorization method using tokens instead of credentials. This method lets users grant a third party application access to their personal information protected by credentials on another application. With this access granted by user, OAuth can then let these 2 applications communicate with each other without having to share security credentials.
More info about Twitter Streaming API.
In order to collect data from Twitter using Streaming API, meaning collecting results from the URL https://stream.twitter.com/1/statuses/filter.json, we started to use cURL. cURL is a command line tool to transfer data from or to a server using one of the supported protocols (HTTP, FTP, SMTP…). It should be available by default on Amazon clouds. If not run the command
sudo apt-get curl or
sudo yum curl.
cURL has many options, including one to send data in a POST request “-d“. So we can simply include the filter parameters in a separated file and call this file using “-d @myfile.conf“.
Another option available is related to credentials if the URL is requiring identification, for example “-uUsername:Password“. We use this method of identification as we just run the cURL command from a terminal before using another application.
The command line looks like:
curl -d @myfilter.options https://stream.twitter.com/1/statuses/filter.json -uUser:password
Running several Twitter data pull programs from the same machine in my research laboratory wouldn’t have been possible for several reasons, including the fact that using the same IP address for all programs would have stopped quickly due to rate limitations.
Using Amazon Compute Clouds (Amazon EC2) gives the possibility to run each Twitter pull program on individual EC2 instance.
The advantages in this project were that we had more flexibility with rate limits as each program was running from a different IP address. It was also running on other machines and then not using memory from the local machine (very useful during blackout periods…).
Micro instances were suitable enough for this kind of work. Storage solutions such as Amazon S3 and Amazon EBS are used and will be developed later.
More info about Amazon EC2.
Now see how to use these tools in the next article listing the instructions I follow.