Pulling data from Twitter: Tools

In order to pull data from Twitter I started to use the following tools:

Twitter Stream API

Twitter exposes its data via different types of API (Application Programming Interface): REST, Search and Streaming. In order to collect nearly real-time data from Twitter, we will be using Streaming API to access public statuses filtered in various ways. The following URL is called:

https://stream.twitter.com/1/statuses/filter.json

When a request is sent, at least 1 filter option must be specified between keywords (“track“), follow user ids (“follow“) and geographic locations (“location“). With too much parameters the URL might be too long and then rejected which is why we use POST header parameters when we send a request.

  • Limitation:

Filter parameters are rate-limited. We can find different information on the Twitter documentation pages:
http://dev.twitter.com/pages/streaming_api_methods#statuses-filter
http://dev.twitter.com/pages/streaming_api_concepts#filter-limiting

I think that in any case these limitations are high enough. However I know by experience that the Stream is limiting the amount of statuses collected so if you are using a lot of keywords or other filter options and if you leave your stream running for too long, it might be stopped and you will have to restart it.

So far my biggest streaming programs is collecting around 1,5 million tweets a day and I have to restart my program around once a week. Twitter is not giving precise information about this limit and just says it is appropriate for long-term connections:
http://dev.twitter.com/pages/rate-limiting#streaming

  • Identification:

Using Twitter Stream API requires to be identified with valid credentials from a Twitter account. There are 2 types of identification: Basic Auth and OAuth.

Basic Auth is a classic way to be identified by providing a username and password from Twitter account. Twitter announced that this method will soon be deprecated without giving a fixed date so it is now advised to use OAuth identification.

OAuth is an authorization method using tokens instead of credentials. This method lets users grant a third party application access to their personal information protected by credentials on another application. With this access granted by user, OAuth can then let these 2 applications communicate with each other without having to share security credentials.

More info about Twitter Streaming API.

cURL

In order to collect data from Twitter using Streaming API, meaning collecting results from the URL https://stream.twitter.com/1/statuses/filter.json, we started to use cURL. cURL is a command line tool to transfer data from or to a server using one of the supported protocols (HTTP, FTP, SMTP…). It should be available by default on Amazon clouds. If not run the command sudo apt-get curl or sudo yum curl.

cURL has many options, including one to send data in a POST request “-d“. So we can simply include the filter parameters in a separated file and call this file using “-d @myfile.conf“.

Another option available is related to credentials if the URL is requiring identification, for example “-uUsername:Password“. We use this method of identification as we just run the cURL command from a terminal before using another application.

The command line looks like:

curl -d @myfilter.options https://stream.twitter.com/1/statuses/filter.json -uUser:password

Amazon EC2

Running several Twitter data pull programs from the same machine in my research laboratory wouldn’t have been possible for several reasons, including the fact that using the same IP address for all programs would have stopped quickly due to rate limitations.

Using Amazon Compute Clouds (Amazon EC2) gives the possibility to run each Twitter pull program on individual EC2 instance.

The advantages in this project were that we had more flexibility with rate limits as each program was running from a different IP address. It was also running on other machines and then not using memory from the local machine (very useful during blackout periods…).

Micro instances were suitable enough for this kind of work. Storage solutions such as Amazon S3 and Amazon EBS are used and will be developed later.

More info about Amazon EC2.

Now see how to use these tools in the next article listing the instructions I follow.

13 thoughts on “Pulling data from Twitter: Tools

  1. I have no idea how to use the API, and i want to start collecting data for an independent project using a a set of keywords and time frames. Is is possible if i have very little programing experience? do you know of a website or company that does this already? i just want to have the # of people using a the keywords. Thank you for any help

  2. In future posts I will detail what I am currently doing to collect data using the tools described in this post so it might be an option for you and high programming knowledge is not required. I will detail the instructions and I hope it will help.
    I am currently trying to develop an alternative way to pull data though, using a more stabilized program and I will probably explain it later too.
    In the meantime you can also have a look at the different libraries listed by Twitter (http://dev.twitter.com/pages/libraries). I am using Java and I had to do some programming for my needs but maybe you can find something that can provide you what you need.

  3. Hi I am working for an org which has 744 followers, i want to get these onto a spread sheet so we can analyse who is following us. Is there a way of extracting thisd data rather than sitting and going through all the names on twitter and putting them onto a spreadsheet?
    Thanks

  4. Hi,

    I am currently working on a project and would like to analyze the twitter behavior of people who have public accounts.
    Therefore, I need to collect tweets from the last 6 to 7 months. Is there a way to pull that many tweets? (or even entire (almost) timelines?)
    All I could find so far were programs with which I could collect the last 20 tweets.
    Thank you!
    Best,
    Patricia

    • Hi,
      I am not too sure about how to collect past data because what I implemented is using Twitter Streaming and collects data on real time.
      The only option I can think of is with Twitter Search. You can find more details here: https://dev.twitter.com/docs/using-search
      It seems like there are some options to search more information than just the last 20 tweets but I don’t think it is possible to go back 6 to 7 months in the past.
      I hope this helps a little.
      Regards

      • Hi Sebastien, I am currently working in a project where i have to get all the tweets from Twitter without specifying conditions like track, location and etc… I need all the tweets .. is that possible.. i need some idea…

        • Hi,
          Twitter doesn’t allow to collect all tweets because there are too many. That’s why you have to provide filter options (tracks, geolocations and/or usernames).
          However it is possible to collect data without any of these options with “Sample” (https://stream.twitter.com/1.1/statuses/sample.json). It gives a random sample of all public statuses and you can get a large number of tweets.
          That said, it is much smaller than the total number of tweets but that’s how you can get tweets without specifying conditions.

  5. Dear Sebastien:
    Thank you for this post, really interesting.

    I have been also working with Twitter data, however I have noticed that if we want to download geo-location from a post the user must activate this tool.
    Am I right?, is it possible to get the geo-location information from each post?.

    • Hi,
      Yes you’re right, in order to collect geo-location the user must activate this option in private settings.
      In other case, the “geo-location” field will be null.

Leave a Reply