Tweets Collector Script for your Data Science project

The other day I was looking for some datasets to start working on a new personal project. After searching for a while I figure out that some datasets can be really potential projects if we combine some data with social media information like Tweets.
For example, I found this dataset on Kaggle (Link) that can be a pretty interesting project if you combine this dataset with Weather data and Tweet alert publication.
That’s when I started my journey to collect some data from Twitter and decide to build a Script to insert tweets data from a Twitter Account into a database (You can save the output as a CSV as well or whatever other formats you want).
To start this demo, I’m going to use the official Twitter account of @NJTransit, so in that case, it can be really helpful if someone is already working or have in mind to use the dataset linked above.
- Create your API KEYS
I’m not going dive in this part. Just a brief information about how and where create your Twitter API KEYS.
First, you need to create a new App in the Twitter Developer website
After creating your new App, you can grab your Consumer API keys (API Key and API Secret Key) and Access token & Access token secret because you are going to need it in just a moment.

2. Build the Script
- Define Key Credentials using ConfigParser library
- Create an Authorization Token using Key Credentials
Basically in this function we are requesting an Authorization to use the Twitter API. This authorization was made based on OAuth2 protocol.
After this. We are ready to make a request and start collection tweets for an account.
- Tweets Request Function
You can find all the requests available in the official documentation. For the purpose of this demo, the request is User Timeline.
I decided to insert all this data into a Postgres database. That’s why the function above receive cur and conn variables.
All configuration of the database is out the scope of this demo but I’m going to provide the code to connect into the Postgres database.
- Connection and Main Function to execute the script
This is where you can provide the user account where you want to collect tweets, how much tweets you want to collect, including retweets, etc. The configuration is up to you. Feel free to change and modify as much as you want, following the original documentation is a good way to figure out what you can achieve with this.
Additional Information
To complete more this project, I’m going to provide the code that I’m using to insert data into our database.
The first Query will help us to insert data into the tweets_alert table. As you can see in the code above (specifically in tweets_request function). I’m using a variable called tweets_alert_insert to make the insert.
try:
cur.execute(tweets_alert_insert, tweet_data)
except ValueError as e:
print(e)
After that, I decide to make this project more robust and retrieve the last ID inserted into the tweets_alert table. Why I did that? Because it’s going to help us to be able to execute every day our script and insert only new tweets. That’s why our Twitter API request has a parameter called since_id, this means that our request will start from the last tweets provided for us.
search_parameters = {
'screen_name': 'NJTRANSIT_NEC',
'since_id': id_last_tweet,
'count': 200,
'include_rts': False
}
After executing the script, we can see the data and verify that everything is working fine.

You can find the entire code on my Github page 👨🏻💻
What’s Next?
There are so many things that you can improve with this demo. For example, you can create an Airflow DAG with this code and schedule it to run every day and start collecting tweets without running the script by your own, or more simple, create a Cron Job. You can even deploy your DAG using Cloud Composer and have all working on GCP.
I would love to hear improvements, questions, petitions, etc. from you so please feel free to contact me. 🙌🏼
You can find me on:
- Github: https://github.com/StriderKeni
- Medium: https://medium.com/@kennycontreras