Of Airflow, Database Types, Effective Data Teams — October 2019 - Atlan

Welcome to the latest edition of the Humans of Data newsletter.

In this issue, we talk about:

Building ETL pipelines using Apache Airflow
Creating effective data science resumes
Understanding different types of databases
Boosting collaboration among the humans of data
Common mistakes that data scientists make

As usual, towards the end, you’ll find a list of must-attend events.

Our top reads for you

How to create a workflow in Apache Airflow to track disease outbreaks in India

Vinayak Mehta

In a perfect world, all data would be organized as CSV files. In the real world though, it’s often in the form of PDFs, audio files, videos or comments, among others—formats that make it a bit difficult to extract data. 😔

When our team discovered a great data source for detecting and (hopefully) preventing disease outbreaks in India, we were ecstatic. And then bummed.

Because the data was available as PDF—infamous for being difficult to scrape. So Vinayak and team built an ETL (Extract, Transform, Load) pipeline to extract PDF data, transform it into CSVs and then load the CSV data into a data store.

Read on to find out how they tackled our PDF data problem using Airflow. 🚀

The 6 best ways for student resumes to get noticed by data science employers

Data Science Weekly

Getting a data science job might feel like a catch-22—you get hired only if you have relevant experience, but to get that experience you need a job as a data scientist. 😩

It’s frustrating, we know. Here’s a resource with six potential solutions to your catch-22.

From working on independent projects to building a profile on GitHub, find out how you can show off your expertise (and bag that dream data science job!). 😎

Comparing database types: How database types evolved to meet different needs

Justin Ellingwood

Databases are where you organize data. You might have come across several types of databases—relational, columnar, graph, key-value, document … the list goes on. No wonder it can get confusing. 😓

We get you. We might have something that can help.

Here’s a handy resource that explores the various database types and the advantages and limitations that each type offers. 🙏

Collaboration between data engineers, data analysts and data scientists

Germain Tanguy

The humans of data are a diverse lot. Each data team brings together people with varying skill sets, which is wonderful for problem-solving.

However, having diverse team members can also lead to a collaboration problem.

While there aren’t any perfect solutions, we have a potential solution. This article explores the story of the data team at Dailymotion and how they managed to collaborate better to be more efficient. ⭐

Did you know?

Mistakes data scientists make

Adam Green

We’re starting this sneak peek with a quote directly from the article.

“If you don’t make mistakes, you’re not working on hard enough problems. And that’s a big mistake.”

Frank Wilczek, Nobel Prize in Physics (in 2004)

Making mistakes is part of the learning process. And data scientists are no exception. What mistakes do data scientists make? Adam Green has a list.

From too many metrics and models to not working with a sample data set, he goes through the most common ones from data scientists.

While this isn’t an exhaustive list, it might help you build a rough checklist while you work on your data sets (and avoid these mistakes). 💡

Save the dates

Metis Data Science Bootcamp [Free] | October 19, 2019 | Chicago, USA

If you’re looking for a free, one-day event to learn Python (emphasis on the Pandas and Scikit-learn libraries) and you’re in Chicago, here’s an event you shouldn’t miss!

Senior Data Scientist John Tate is conducting a session that introduces machine learning and the basic ML algorithms that a data scientist should know.

Since the seats are limited, we suggest you don’t wait too long and sign up right away. Register here.

ODSC India [Call for Papers] | September 16-19, 2020 | Bengaluru, India

Open Data Science (ODSC) is one of the biggest data science events in India. They’re accepting proposals for potential sessions until October 9, 2019—here’s your opportunity to share your story and experience in the data universe.

Worried about expenses? Don’t be. The organizers will take care of flights, accommodation and event passes. If you’re interested, submit your proposal here.

GraphConnect 2020 | April 20-22, 2020 | New York, USA

One of the largest graph database events is here. (Early bird pricing alert!)

Here’s your chance to meet graph database developers, architects and CTOs under one roof. Past speakers were from major organizations such as Adobe, Citi, Nordstrom, Georgia Tech and more. Register here.

P.S. Their CFP (Call for Proposals) is open. It’ll close on November 1, 2019. So if you have a cool story to share, submit your proposal here.

That’s all, folks! Stay tuned for future editions of the humans of data newsletter.