Home

Awesome

<h1 align="center">More than 2000+ questions for preparing a Data Engineer interview.</h1> <h2 align="center"><a href="./content/full.md">Full list of questions</a></h2> <h1 align="center">Interview questions for Data Engineer</h1> <div> <table> <tr> <tr> <th colspan="5">Databases and Data Warehouses</th> </tr> <tr> <th>GitHub Repo</th> <th>Official page</th> <th>Questions</th> <th>Description</th> <th>Useful links</th> </tr> <tr> <th><a href="https://github.com/apache/cassandra"><img style="vertical-align:middle" src="img/icon/github.ico" alt="Cassandra"></a></th> <th><a href="https://cassandra.apache.org/_/index.html"><img style="vertical-align:middle" src="img/icon/cassandra.ico" alt="Cassandra"></a></th> <th><a href="./content/cassandra.md">Apache Cassandra</a></th> <th>Cassandra is a distributed, wide-column store, NoSQL database management system.</th> <th><a href="https://github.com/Anant/awesome-cassandra">Awesome Cassandra</a></th> </tr> <tr> <th><a href="https://github.com/greenplum-db/gpdb"><img style="vertical-align:middle" src="img/icon/github.ico" alt="Greenplum"></a></th> <th><a href="https://greenplum.org/"><img style="vertical-align:middle" src="img/icon/greenplum.ico" alt="Greenplum"></a></th> <th><a href="./content/greenplum.md">Greenplum</a></th> <th>Greenplum is a big data technology based on MPP architecture and the Postgres open source database technology.</th> <th><a href="https://github.com/kongyew/awesome-greenplum">Awesome Greenplum</a></th> </tr> <tr> <th><a href="https://github.com/mongodb/mongo"><img style="vertical-align:middle" src="img/icon/github.ico" alt="MongoDB"></a></th> <th><a href="https://www.mongodb.com/"><img style="vertical-align:middle" src="img/icon/mongo.ico" alt="MongoDB"></a></th> <th><a href="./content/mongo.md">MongoDB</a></th> <th>MongoDB is a document-oriented database.</th> <th><a href="https://github.com/ramnes/awesome-mongodb">Awesome MongoDB</a></th> </tr> <tr> <th><a href="https://github.com/apache/hbase"><img style="vertical-align:middle" src="img/icon/github.ico" alt="Hbase"></a></th> <th><a href="https://hbase.apache.org/"><img style="vertical-align:middle" src="img/icon/hbase.ico" alt="Hbase"></a></th> <th><a href="./content/hbase.md">Apache Hbase</a></th> <th>HBase is an open-source non-relational distributed database.</th> <th><a href="https://github.com/rayokota/awesome-hbase">Awesome HBase</a></th> </tr> <tr> <th><a href="https://github.com/apache/hive"><img style="vertical-align:middle" src="img/icon/github.ico" alt="Hive"></a></th> <th><a href="https://hive.apache.org/"><img style="vertical-align:middle" src="img/icon/hive.ico" alt="Hive"></a></th> <th><a href="./content/hive.md">Apache Hive</a></th> <th>Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis.</th> <th><a href="https://github.com/dharmeshkakadia/awesome-hive">Awesome Hive</a></th> </tr> <tr> <th colspan="2"><a href="https://aws.amazon.com/dynamodb/"><img style="vertical-align:middle" src="img/icon/dynamodb.ico" alt="Amazon DynamoDB"></a></th> <th><a href="./content/dynamodb.md">Amazon DynamoDB</a></th> <th>Amazon DynamoDB is a fully managed proprietary NoSQL database service.</th> <th><a href="https://github.com/alexdebrie/awesome-dynamodb">Awesome DynamoDB</a> <a href="https://github.com/donnemartin/awesome-aws">Awesome AWS</a></th> </tr> <tr> <th colspan="2"><a href="https://aws.amazon.com/redshift"><img style="vertical-align:middle" src="img/icon/redshift.ico" alt="Amazon Redshift"></a></th> <th><a href="./content/redshift.md">Amazon Redshift</a></th> <th>Amazon Redshift is a data warehouse product.</th> <th><a href="https://github.com/awslabs/amazon-redshift-utils">Amazon Redshift Utilities</a> <a href="https://github.com/donnemartin/awesome-aws">Awesome AWS</a></th> </tr> <tr> <th colspan="2"><a href="https://cloud.google.com/bigquery"><img style="vertical-align:middle" src="img/icon/bigquery.ico" alt="BigQuery"></a></th> <th><a href="./content/bigquery.md">BigQuery GCP</a></th> <th>BigQuery is a fully-managed, serverless data warehouse.</th> <th><a href="https://github.com/coty/awesome-bigquery">Awesome BigQuery</a></th> </tr> <tr> <th colspan="2"><a href="https://cloud.google.com/bigtable"><img style="vertical-align:middle" src="img/icon/bigtable.ico" alt="Bigtable"></a></th> <th><a href="./content/bigtable.md">Bigtable GCP</a></th> <th>Bigtable is a fully managed wide-column and key-value NoSQL database service.</th> <th><a href="https://github.com/zrosenbauer/awesome-bigtable">Awesome Bigtable</a></th> </tr> <th colspan="5"><a></a></th> <tr> <th colspan="5">Data Formats</th> </tr> <tr> <th><a href="https://github.com/apache/avro"><img style="vertical-align:middle" src="img/icon/github.ico" alt="Avro"></a></th> <th><a href="https://avro.apache.org/"><img style="vertical-align:middle" src="img/icon/avro.ico" alt="Avro"></a></th> <th><a href="./content/avro.md">Apache Avro</a></th> <th>Avro is a row-oriented remote procedure call and data serialization framework.</th> <th><a href="https://github.com/m0nhawk/awesome-avro">Awesome Avro</a></th> </tr> <tr> <th><a href="https://github.com/apache/parquet"><img style="vertical-align:middle" src="img/icon/github.ico" alt="Parquet"></a></th> <th><a href="https://parquet.apache.org/"><img style="vertical-align:middle" src="img/icon/parquet.ico" alt="Parquet"></a></th> <th><a href="./content/parquet.md">Apache Parquet</a></th> <th>Apache Parquet is a column-oriented data file format designed for efficient data storage and retrieval.</th> <th><a href="TODO">TODO</a></th> </tr> <tr> <th><a href="https://github.com/delta-io"><img style="vertical-align:middle" src="img/icon/github.ico" alt="Delta"></a></th> <th><a href="https://delta.io/"><img style="vertical-align:middle" src="img/icon/deltalake.ico" alt="Delta"></a></th> <th><a href="./content/delta.md">Delta</a></th> <th>Delta Lake is a storage framework that enables building a Lakehouse architecture with compute engines</th> <th><a href="https://github.com/MrPowers/delta-examples">Delta examples</a></th> </tr> <th colspan="5"><a></a></th> <tr> <th colspan="5">Big Data Frameworks</th> </tr> <tr> <th><a href="https://github.com/apache/airflow"><img style="vertical-align:middle" src="img/icon/github.ico" alt="Airflow"></a></th> <th><a href="https://airflow.apache.org/"><img style="vertical-align:middle" src="img/icon/airflow.ico" alt="Airflow"></a></th> <th><a href="./content/airflow.md">Apache Airflow</a></th> <th>Apache Airflow is a workflow management platform for data engineering pipelines.</th> <th><a href="https://github.com/jghoman/awesome-apache-airflow">Awesome Airflow</a></th> </tr> <tr> <th><a href="https://github.com/apache/flume"><img style="vertical-align:middle" src="img/icon/github.ico" alt="Flume"></a></th> <th><a href="https://flume.apache.org/"><img style="vertical-align:middle" src="img/icon/flume.ico" alt="Flume"></a></th> <th><a href="./content/flume.md">Apache Flume</a></th> <th>Apache Flume is a distributed, reliable, and available software for efficiently collecting, aggregating, and moving large amounts of log data.</th> <th><a href="TODO">TODO</a></th> </tr> <tr> <th><a href="https://github.com/apache/hadoop"><img style="vertical-align:middle" src="img/icon/github.ico" alt="Hadoop"></a></th> <th><a href="https://hadoop.apache.org/"><img style="vertical-align:middle" src="img/icon/hadoop.ico" alt="Hadoop"></a></th> <th><a href="./content/hadoop.md">Apache Hadoop</a></th> <th>Apache Hadoop is a collection of software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation.</th> <th><a href="https://github.com/youngwookim/awesome-hadoop">Awesome Hadoop</a></th> </tr> <tr> <th><a href="https://github.com/apache/impala"><img style="vertical-align:middle" src="img/icon/github.ico" alt="Impala"></a></th> <th><a href="https://impala.apache.org/"><img style="vertical-align:middle" src="img/icon/impala.ico" alt="Impala"></a></th> <th><a href="./content/impala.md">Apache Impala</a></th> <th>Apache Impala is a parallel processing SQL query engine for data stored in a computer cluster running Apache Hadoop.</th> <th><a href="TODO">TODO</a></th> </tr> <tr> <th><a href="https://github.com/apache/kafka"><img style="vertical-align:middle" src="img/icon/github.ico" alt="Kafka"></a></th> <th><a href="https://kafka.apache.org/"><img style="vertical-align:middle" src="img/icon/kafka.ico" alt="Kafka"></a></th> <th><a href="./content/kafka.md">Apache Kafka</a></th> <th>Apache Kafka is a distributed event store and stream-processing platform.</th> <th><a href="https://github.com/semantalytics/awesome-kafka">Awesome Kafka</a></th> </tr> <tr> <th><a href="https://github.com/apache/nifi"><img style="vertical-align:middle" src="img/icon/github.ico" alt="NiFi"></a></th> <th><a href="https://nifi.apache.org/"><img style="vertical-align:middle" src="img/icon/nifi.ico" alt="NiFi"></a></th> <th><a href="./content/nifi.md">Apache NiFi</a></th> <th>Apache NiFi is a software project designed to automate the flow of data between software systems.</th> <th><a href="https://github.com/jfrazee/awesome-nifi">Awesome NiFi</a></th> </tr> <tr> <th><a href="https://github.com/apache/spark"><img style="vertical-align:middle" src="img/icon/github.ico" alt="Spark"></a></th> <th><a href="https://spark.apache.org/"><img style="vertical-align:middle" src="img/icon/spark.ico" alt="Spark"></a></th> <th><a href="./content/spark.md">Apache Spark</a></th> <th>Apache Spark is unified analytics engine for large-scale data processing.</th> <th><a href="https://github.com/awesome-spark/awesome-spark">Awesome Spark</a></th> </tr> <tr> <th><a href="https://github.com/apache/flink"><img style="vertical-align:middle" src="img/icon/github.ico" alt="Flink"></a></th> <th><a href="https://flink.apache.org/"><img style="vertical-align:middle" src="img/icon/flink.ico" alt="Flink"></a></th> <th><a href="./content/flink.md">Apache Flink</a></th> <th>Apache Flink is unified stream-processing and batch-processing framework.</th> <th><a href="https://github.com/wuchong/awesome-flink">Awesome Flink</a></th> </tr> <tr> <th><a href="https://github.com/kubernetes/kubernetes"><img style="vertical-align:middle" src="img/icon/github.ico" alt="Kubernetes"></a></th> <th><a href="https://kubernetes.io/"><img style="vertical-align:middle" src="img/icon/kuber.ico" alt="Kubernetes"></a></th> <th><a href="./content/kubernetes.md">Kubernetes</a></th> <th> Kubernetes is a system for managing containerized applications across multiple hosts.</th> <th><a href="https://github.com/ramitsurana/awesome-kubernetes">Awesome Kubernetes</a></th> </tr> <th colspan="5"><a></a></th> <tr> <th colspan="5">Cloud providers</th> </tr> <tr> <th><a href="https://github.com/aws"><img style="vertical-align:middle" src="img/icon/github.ico" alt="AWS"></a></th> <th><a href="https://aws.amazon.com/"><img style="vertical-align:middle" src="img/icon/aws.ico" alt="AWS"></a></th> <th><a href="./content/aws.md">Amazon Web Services</a></th> <th>Amazon web service is an online platform that provides scalable and cost-effective cloud computing solutions.</th> <th><a href="https://github.com/donnemartin/awesome-aws">Awesome AWS</a></th> </tr> <tr> <th><a href="https://github.com/Azure"><img style="vertical-align:middle" src="img/icon/github.ico" alt="Azure"></a></th> <th><a href="https://azure.microsoft.com/"><img style="vertical-align:middle" src="img/icon/azure.ico" alt="Azure"></a></th> <th><a href="./content/azure.md">Microsoft Azure</a></th> <th>Microsoft Azure is Microsoft's public cloud computing platform.</th> <th><a href="https://github.com/kristofferandreasen/awesome-azure">Awesome Azure</a></th> </tr> <tr> <th><a href="https://github.com/GoogleCloudPlatform"><img style="vertical-align:middle" src="img/icon/github.ico" alt="GCP"></a></th> <th><a href="https://cloud.google.com/"><img style="vertical-align:middle" src="img/icon/gcp.ico" alt="GCP"></a></th> <th><a href="./content/gcp.md">Google Cloud Platform</a></th> <th>Google Cloud Platform is a suite of cloud computing services.</th> <th><a href="https://github.com/GoogleCloudPlatform/awesome-google-cloud">Awesome GCP</a></th> </tr> <th colspan="5"><a></a></th> <tr> <th colspan="5"><b>Theory</b></th> </tr> <tr> <th colspan="2"><a href="./content/dwha.md"><img style="vertical-align:middle" src="img/icon/dwha.ico" alt="DWHA"></a></th> <th><a href="./content/dwha.md">DWH Architectures</a></th> <th>A data warehouse architecture is a method of defining the overall architecture of data communication processing and presentation that exist for end-clients computing within the enterprise.</th> <th><a href="https://github.com/numetriclabz/awesome-db">Awesome databases</a></th> </tr> <tr> <th colspan="2"><a href="./content/data-structure.md"><img style="vertical-align:middle" src="img/icon/datastruct.ico" alt="Airflow"></a></th> <th><a href="./content/data-structure.md">Data Structures</a></th> <th>A data structure is a specialized format for organizing, processing, retrieving and storing data. </th> <th><a href="TODO">TODO</a></th> </tr> <tr> <th colspan="2"><a href="./content/sql.md"><img style="vertical-align:middle" src="img/icon/sql.ico" alt="SQL"></a></th> <th><a href="./content/sql.md">SQL</a></th> <th>SQL is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS).</th> <th><a href="https://github.com/danhuss/awesome-sql">Awesome SQL</a></th> </tr> <th colspan="5"><a></a></th> <tr> <th colspan="5">Data visualization tools/BI</th> </tr> <tr> <th colspan="2"><a href="./content/tableau.md"><img style="vertical-align:middle" src="img/icon/tableau.ico" alt="Tableau"></a></th> <th><a href="./content/tableau.md">Tableau</a></th> <th>Tableau is a powerful data visualization tool used in the Business Intelligence.</th> <th><a href="TODO">TODO</a></th> </tr> <th colspan="2"><a href="./content/looker.md"><img style="vertical-align:middle" src="img/icon/looker.ico" alt="Looker"></a></th> <th><a href="./content/looker.md">Looker</a></th> <th>Looker is an enterprise platform for BI, data applications, and embedded analytics that helps you explore and share insights in real time.</th> <th><a href="TODO">TODO</a></th> </tr> <tr> <th><a href="https://github.com/apache/superset"><img style="vertical-align:middle" src="img/icon/github.ico" alt="Kafka"></a></th> <th colspan="2"><a href="https://superset.apache.org/"><img style="vertical-align:middle" src="img/icon/superset.ico" alt="Apache Superset"></a></th> <th><a href="./content/superset.md">Apache Superset</a></th> <th>Superset is a modern data exploration and data visualization platform</th> <th><a href="TODO">TODO</a></th> </tr> </table> </div> <div> <h2 align="center"> Contribution </h2> <h3>Please contribute to this repository to help it make better. Any change like new question, code improvement, doc improvement etc is very welcome.</h3> </div>