Welcome to the latest edition of the Humans of Data newsletter.
In this issue, we talk about:
- Building ETL pipelines using Apache Airflow
- Creating effective data science resumes
- Understanding different types of databases
- Boosting collaboration among the humans of data
- Common mistakes that data scientists make
As usual, towards the end, you’ll find a list of must-attend events.
Our top reads for you
How to create a workflow in Apache Airflow to track disease outbreaks in India
In a perfect world, all data would be organized as CSV files. In the real world though, it’s often in the form of PDFs, audio files, videos or comments, among others—formats that make it a bit difficult to extract data. 😔
When our team discovered a great data source for detecting and (hopefully) preventing disease outbreaks in India, we were ecstatic. And then bummed.
Because the data was available as PDF—infamous for being difficult to scrape. So Vinayak and team built an ETL (Extract, Transform, Load) pipeline to extract PDF data, transform it into CSVs and then load the CSV data into a data store.
Read on to find out how they tackled our PDF data problem using Airflow. 🚀
The 6 best ways for student resumes to get noticed by data science employers
Data Science Weekly
Getting a data science job might feel like a catch-22—you get hired only if you have relevant experience, but to get that experience you need a job as a data scientist. 😩
It’s frustrating, we know. Here’s a resource with six potential solutions to your catch-22.
From working on independent projects to building a profile on GitHub, find out how you can show off your expertise (and bag that dream data science job!). 😎
Comparing database types: How database types evolved to meet different needs
Databases are where you organize data. You might have come across several types of databases—relational, columnar, graph, key-value, document … the list goes on. No wonder it can get confusing. 😓
We get you. We might have something that can help.
Here’s a handy resource that explores the various database types and the advantages and limitations that each type offers. 🙏
Collaboration between data engineers, data analysts and data scientists
The humans of data are a diverse lot. Each data team brings together people with varying skill sets, which is wonderful for problem-solving.
However, having diverse team members can also lead to a collaboration problem.
While there aren’t any perfect solutions, we have a potential solution. This article explores the story of the data team at Dailymotion and how they managed to collaborate better to be more efficient. ⭐
Did you know?
Mistakes data scientists make
We’re starting this sneak peek with a quote directly from the article.
“If you don’t make mistakes, you’re not working on hard enough problems. And that’s a big mistake.”
Frank Wilczek, Nobel Prize in Physics (in 2004)
Making mistakes is part of the learning process. And data scientists are no exception. What mistakes do data scientists make? Adam Green has a list.
From too many metrics and models to not working with a sample data set, he goes through the most common ones from data scientists.
While this isn’t an exhaustive list, it might help you build a rough checklist while you work on your data sets (and avoid these mistakes). 💡
Save the dates
Metis Data Science Bootcamp [Free] | October 19, 2019 | Chicago, USA
If you’re looking for a free, one-day event to learn Python (emphasis on the Pandas and Scikit-learn libraries) and you’re in Chicago, here’s an event you shouldn’t miss!
Senior Data Scientist John Tate is conducting a session that introduces machine learning and the basic ML algorithms that a data scientist should know.
Since the seats are limited, we suggest you don’t wait too long and sign up right away. Register here.
ODSC India [Call for Papers] | September 16-19, 2020 | Bengaluru, India
Open Data Science (ODSC) is one of the biggest data science events in India. They’re accepting proposals for potential sessions until October 9, 2019—here’s your opportunity to share your story and experience in the data universe.
Worried about expenses? Don’t be. The organizers will take care of flights, accommodation and event
GraphConnect 2020 | April 20-22, 2020 | New York, USA
One of the largest graph database events is here. (Early bird pricing alert!)
Here’s your chance to meet graph database developers, architects and CTOs under one roof. Past speakers were from major organizations such as Adobe, Citi, Nordstrom, Georgia Tech and more. Register here.
P.S. Their CFP (Call for Proposals) is open. It’ll close on November 1, 2019. So if you have a cool story to share, submit your proposal here.
That’s all, folks! Stay tuned for future editions of the humans of data newsletter.