Awesome
Hello Kaggle!:wave:
I summarized the definitions of Kaggle
and basic usage after reading Kaggle's Official Document
and Kaggle Guide
<br>
I hope it will help those who are just introduced to Kaggle
like me.
<br>
If there is anything that needs to be corrected, please leave it in Issue
.
<br>
FYI, the Hello Kaggle
' document rarely deals with Python programming
or machine learning theory
<br>
and focuses on Kaggle usage
.
<br>
For those of you who are looking for programming
, data science
, and machine learning materials
, I'll leave you with some links that I've been helped with.<br>
- DATA SCIENCE ROADMAP 2020
- data engineer roadmap by datastacktv
- My Data Science Online Learning Journey on Croursera <br>
Table of contents
- <br>
<br>
What is Kaggle?
-
Kaggle
is the platform that hosts the Data Analysis Competition. -
It is common for competitions to be hosted by providing data that needs to be analyzed for the company's
research challenges, key services
.<br> <br> -
Artificial Intelligence, Machine Learning Boom
has continued to increase the number of participants and was acquired by Google's parent company 'Alphabet' in 2017. -
Since the Alphabet's acquisition,
<br>Kaggle
has become a critical site for data scientists and engineers, not just a platform.
Kaggler
? Kaggling
?
- Like Google searches
Googling
, > Kaggle's users areKaggler
orKaggling
to participate in the Competition. <br>
Kaggle Service and Features
-
Jobs
Jobs Service
was originally provided, but the service ended on December 22, 2020.<br> Simply put, it's because the number of users is small.<br> For more information, read it here at https://www.kaggle.com/jobs-board-closed. <br>
-
- Provides practical and practical lectures on
Python
,machine learning
andvisualization
, and so on. Kaggle's course
can be quite useful if you haven't learned it step by step or if you've studied an old course.- All lectures are also available in
English
,free
and acertificate
of completion. <br>
- Provides practical and practical lectures on
English
-
Data scientists from all over the world gather together and use
English
by default. -
Complementation Notice
,Dataset
,Discussion
are also in English. <br> Below is the photo ofDiscussion
andSite Forum
. -
If you look at the profiles of the winners of the Competition, there are a variety of
<br>USA
,Korea
,Russia
,China
,India
, and so on. -
Programming Language
- Generally use
Python
andR
a lot. <br>
- Generally use
Required Kaggling Knowledge
-
Purpose Knowledge Required Competition participation Python, R, data analysis Competition organizer Data analysis, English Discussion with Kaggler English Learning through Courses English
Prepare before becoming Kaggler
- Required:
Internet
,Python
andR
,PC
- Recommended:
Server with GPU
orWorkstation
and high capacityHDD
orSSD
<br>
<br>
How is Kaggle used?
Infrastructure
for data analytics
- Kaggle is
web-based
and provides tools for data analysis. (Notebook) - Community with a variety of Kagglers to enable competition and cooperation. <br>
Notebook
- The
programming environment for data analysis
provided by Kaggle. - A SaaS environment that runs code written on your Notebook on a server.
- It provides a programming environment, so there is no need to build a separate development environment. (No Python installation, Anaconda installation, etc.)
- It is similar to
Jupyter Notebook
. - Provides
4 Core CPU + 16GB RAM
by default.GPU Server
provides2Core CPU + GPU + 13GB RAM
.<br>Provided free of charge
, andGPU can be used for 30 hours a week
. <br>
Dataset
- The first thing to do when developing a machine learning-based data analysis program is to prepare
Dataset
. - Dataset is open for academic purposes or created and released by Kaggler.
- If you don't want to share your
Dataset
, you can use thePrivate
setting to make it private to the outside world. - Once Dataset or Notebook is set to
Public
,Apache 2.0 License
is applied, so you must make a careful decision. <br>
Company Training
-
Example: staff training for creating neural network-based machine learning programs
-
- Sign up for Kaggle
-
- Employees are ready to copy and execute the moderator's Notebook
-
- Modifying a Neural Network Model in Notebook
-
- Submit the results of the modified model to Competition and check the score
-
-
What if we didn't use the Kaggle?
-
- Establishing a development environment on a training computer
-
- Distributing examples of machine learning programs (neural network models)
-
- Create a program to evaluate neural network model execution results by converting them into scores
-
- Check the evaluation score of the executed model
-
- Modifying a Neural Network Model
-
- Confirm that the score varies depending on the outcome of the run <br>
-
-
Kaggle is much easier and less expensive in
<br>building a development environment
,checking the score
, anddeployment
.
Discussion
-
If you don't know something, you can ask in
Site Forums
, andCompetition
of theCommunities
.<br> -
<br>Communities
-
<br>Site Forums
<br>
Kaggle Competition?
Refer to Competitions Documentation. <br>
Featured
, the most common Competition
- Difficult competitions and generally commercial purposes.
- Most Kagglers participate in the competition, which has been held so far, the prize range is between
$100
and$1,500,000
. <br>
Research
- It mainly deals with research topics and generally does not have prize money or rewards. (All the ongoing Research Competitions have prize money.)
- Instead, you can do research by discussing with less competitive and intellectually curious Kagglers. <br>
Getting Started
for New Kaggler
- The Competitions shown here are for beginners.
- Especially
Titanic: Machine Learning from Disaster
,House Prices: Advanced Regression Techniques
,Digit Recognizer
These three competitions are the most recommended and helpful competitions for new machine learners. <br>
Playground
for data scientists and engineers
- Competition is held mainly with topics that data scientists and engineers might find interesting.
- Playground is not an easy task. It usually covers recent academic/technical issues and public social issues.
- In some cases, the organizers may offer prize money or reward. <br>
Recruitment
for job opportunities
- Companies are hosting and a prize is mostly a
Job Interview
opportunity. Participants can upload a Resume at the end of the Competition. <br>
Annual Competition
held regularly
- Kaggle has several regularly held Competitions. You can find the following information at the current Kaggle. <br>
Analytics
to effectively explain the results
- This is not explained in Documentation, so I read and wrote the Analytics Competitions that are currently up there.
- Reading the evaluation and submission formats of each Competition, the scoring method of Analytics is shown by submitting a notebook directly and scoring by a person.<br> The analyzed data should be described by the organizers' requirements. It looks like a company persuading management through a presentation. <br>
<br>
Getting Started with Kaggle
Sign Up
- Prior to starting Kaggle, click
Register
button on the upper right tosign up
first. <br>
Take a look at Kaggle Courses
- For those of you who do not have enough knowledge about machine learning or data analytics, it is also a good idea to study the areas you need at
Courses
, as described above. - Each course consists of 2 to 8 classes and offers a variety of hands-on examples. <br>
Refer to Kaggle Progression System.<br>
Before I explain how to become a Contributor
, I will explain about Kaggle Tiers
and Medal
.
Kaggle Tiers
- There is a
Progression System
in Kaggle, which is simplyKaggler Tier
.<br> This rating is a good indicator of your ability as a data scientist.<br> It also intuitively shows how much you've grown. - The
Kaggle Tiers
are divided into five levels, and conditions are also given to achieve each.-
<br>Novice
<br> -
<br>Contributor
<br> -
<br>Expert
<br> -
<br>Master
<br> -
<br>Grandmaster
<br>
-
- Also, as you can see in the pictures above,
Kaggle Tier
is rated differently forCompetitions
,Datasets
,Notebooks
, andDiscussion
. - Click on the upper right account icon and select
My Profile
to go to the profile page.<br> Then you can check your profile information and Kaggle activity content and tiers.<br> <br>
Medal
-
Medal
shows Kaggler's performance in each field.- Kaggler with excellent results in
Competition
- Kaggler writes and shares popular
Notebook
- Kaggler shares useful
Dataset
- Kaggler writes good
Comment
<br>
- Kaggler with excellent results in
-
Contributor
just needs to satisfy conditions. However, fromExpert
, the medals required for the applicable conditions in each discipline must be collected. -
<br>Competitions
have different medal criteria depending on the number of teams participating.<br> -
Datasets
,Notebooks
,Discussion
are evaluated byVote
. It means, the higher number ofVote
, the more Kaggler recommended it.<br> -
Note that there is only one type of medal awarded for each post in each part.<br> For example, if a post on
<br>Dataset
received 20 Votes, the bronze medal will be gone and the silver medal will be given.
Being Contributor
1. Adding User Profile Information
- Enter your profile, click
Edit Profile
, and enter the following:Bio (self-introduction)
Occupation
Organization
City
- In addition, you can set
profile image
andSocial Media
freely. <br>
2. SMS Verification
- Click
Phone Verification
on the profile screen. - Check the
Country Code
,Phone Number
andNot a Robot
boxes and clickSend Code
. - Enter the transmitted code and click
Verify
to complete authentication. <br>
3. Run Script
- You can achieve this by learning at
Course
or by creating your ownNotebook
and executing any code. 4. Participate in the Competition
will run a notebook, so you can skip it. <br>
4. Participate in the Competition
-
Select one Competition in the 'Getting Started' category.
-
If you go in, you can see the menu below in the middle of the screen.<br>
-
Click on 'Notes' here and take a look at other people's notebooks.
<br> -
Pick one notebook and open it in the upper right corner You'll see a button like that. Click this button to copy the notebook.
<br> -
Once the copy is complete, click
Save Version
at the upper right corner.Version Name
: You can enter the name.Version Type
: There are two options,Quick Save
orSave & Run All (Commit)
.Quick Save
is saved, not executed, andSave & Run All (Commit)
is executed. <br>
-
Click
<br>Save & Run All
here and press theSave
button. -
Go back to your profile and click
<br>Notebook
to see the notebook you just copied.<br> When you click on this notebook, there isOutput
at the right menu.<br> Select Submission.csv, which can be viewed by pressing Output, and clickSubmit to Competition
on the right. -
The screen will now be moved to the
<br>Leaderboard
menu and the submitted files will be automatically scored.<br> After scoring, you can check your score and clickJump to your position on the leaderboard
to see your ranking.
5. Comment to other people's posts or comments and cast upvote (Make 1 comment & Cast 1 upload)
- In
Discussion
, enter the topic you want and click any article you are interested in (recommended to enterGetting Started
inSite Forums
). - Read carefully and write
comments
. If the text is useful or you like it, pressVote
as well. <br>
6. Now you are a Contributor
!
<br>
Wait!
- Let me add one more thing, Kaggle Rankings.
- Rankings are separated by
Competitions
,Datasets
,Notebooks
, andDiscussion
. - The photo below shows the ranking in the
Competitions
. You can also check how many people are in each tier. <br>
<br>
Getting to know Notebook
Please re-read here for a brief introduction to your Notebook!
<br>What can you do with your Notebook
?
- Programming for data analysis is the primary purpose, and programs created to run on the Kaggle server.
- Submit to
Competition
or shareNotebook
withKaggler
. Some of theNotebooks
are shared only for training or skills. - Use
Code Cell
andMarkdown Cell
to write codes, and descriptions of the code, text, image, etc. <br> How to use Markdown<br> Markdown emoji-cheat-sheet<br> The above two links I referred to when I first used Markdown, and I still sometimes look at emoji whenever I need it. <br>
Create and Use Notebook
-
Go to the
<br>Notebook
menu and look in the upper right corner There's a button like this. Click it. -
Kaggle Notebook
has two types:Script
andNotebook
.Script
is a method of writing and executing code in a commonly used code editor.
-
<br>Notebook
is an interactive development environment similar toJupyter Notebook
. The characteristic is that you can divide the cells and execute only the code you want. -
Press
<br>File
in the upper left corner and hover your cursor overEdit Type
to select the type. In addition, you can choose betweenPython
andR
inLanguage
.<br> -
You can change the name by clicking on the top left column that looks like the picture below.<br>
<br> -
The first time you create a
Notebook
, you will see the following code:# This Python 3 environment comes with many helpful analytics libraries installed # It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python # For example, here's several helpful packages to load import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) # Input data files are available in the read-only "../input/" directory # For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory import os for dirname, _, filenames in os.walk('/kaggle/input'): for filename in filenames: print(os.path.join(dirname, filename)) # You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" # You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
The above code specifies the directory
<br>/kaggle/input
to import files after loadingNumpy
andPandas
libraries fromPython
. -
I will print
Hello Kaggle!
onNotebook
. Place the cursor in any code cell and press the+ Code
button. -
Then complete the following:<br>
<br> -
At the top left press this play button or <br> Enter
<br>Ctrl + Enter
orShift + Enter
to execute the code. The output will be like this<br> -
These are the functions of the buttons that can be seen in the cell.
- : Raise the cell position one space forward.<br>
- : Lower the cell position one space down.<br>
- : Deletes the corresponding cell.<br>
- / : Hides or indicates that cell.<br>
- : provides the following additional features:<br> <br>
Various settings for Notebook
-
Set
Public
&Private
Notebook
can be released for sharing with otherKaggler
. But if you don't want to share, or when you work as a team, you can make settings such asPrivate
orShared to a specific user
.- Press the
Share
button in the upper right corner to open a window forpublic
orprivate
setting. - If
Privacy
is set toPublic
, it will be released withApache 2.0 License
. - Use
Collaborators
to add users as collaborators. <br>
-
Settings
Language
: You can set the programming language to usePython
andR
.Environment
: TheDocker
image can be set.Original
sets up the development environment when creatingNotebook
andLatest Available
uses the latest development environment provided byKaggle
.Accelerator
: Whether to useGPU
orTPU
can be set.GPU/TPU Quota
: Show time and usage ofGPU
andTPU
Internet
: You can set whether or not to connect to the Internet.<br> You can install certain packages by settingInternet to On
. Google accounts also allow you to useBigQuery
,Cloud Storage
, andAutoML
services fromGCP
(Google Cloud Platform). <br>
How to import Data
from Notebook
-
<br>Kaggle Notebook
is available not only inCompetition Data
but also in a variety ofDataset
shared.<br> In this case, a separate file must be set up for use inNotebook
. -
- How to create a
new Notebook
- Go to the
Dataset
you want to use, and pressNew Notebook
to set the file automatically.<br> <br>
- How to create a
-
- How to add to an
existing Notebook
- To add new data to your
existing Notebook
, first access yourNotebook
.<br> Then click the+ Add Data
button in the upper right corner.<br> Then a window appears where you search for the desiredDataset
and pressAdd
after you chooseDataset
. <br>
- How to add to an
-
- How to upload yourself
- If you go into the
Data
menu and look in the upper right corner, click on the+ New Data
button.<br> Then enter a name forEnter Dataset Title
and clickSelect Files to Upload
to upload the file. (Compressed file types such as zip or tar.gz are also possible.)<br> Finally, pressCreate
to uploadDataset
. You can import the uploadedDataset
using thei
orii
method. <br>
-
- How to use output data from another
Notebook
- If you follow
ii
method, a window will appear, where you can click on theKernel Output Files
tab to use the output data from anotherNotebook
<br>
- How to use output data from another
Use external packages in Notebook
-
External packages that
<br>pip
is avaliable can be installed withpip install package_name
by clickingConsole
at the bottom ofNotebook
.<br> -
You can also use
pip
directly in the code cell, as shown in two examples!pip install package_name
import os os.system('pip install package_name')
Use Source Code
from Dataset in Notebook
-
If you add
example dataset
that has packagehello_kaggle
toNotebook
, you can add the../input/example-dataset/hello_kaggle
directory.<br> The codes you add are as follows:import sys sys.path.append("../input/example-dataset/hello_kaggle")
<br>
Competitions and Notebooks
What else can the Notebook
be used for besides data analysis Competition
?
- In general, if the goal is to win a prize,
Notebook
will be shared(Public) afterCompetition
is finished.<br> However, there is also an environment in which we can discuss with Kaggler even whenCompetition
is in progress. <br>
How to handle Data File
to use in Competition Notebook
?
-
When performing
Competition
, theData
tab is located in the upper right corner of theNotebook
. There are three types of files you can click on, each of which is described as follows.train.csv
: Learning data with correct answer label.test.csv
: Data for testing without the correct answer label.Sample_submission.csv
: Examples of data for submission <br>
-
View the
Data
menu inCompetition
to see what data each file contains.<br>For example, lets look at theTitanic - Machine Learning from Disaster
.<br> <br> In the picture above, click on the Data menu to readOverview
as follows <br> <br> If you go down further, you can select each file to view the data and download it as follows <br> <br> -
Let's use these files to create and submit a csv file for model creation and submission.<br> (The same is explained in 4. Participate in the Competition.) <br>
- Click
Save Version
in the upper right corner of theNotebook
screen. (If the code is not executed, clickSave & Run All (Commit)
. - In
Save & Run All (Commit)
,Commit
is the same meaning asGit Commit
inGithub
, which I am currently working on.<br> Therefore,Kaggle Notebook
can refer to the version of the source code previously written.
- Click
-
Now return to your profile and click
<br>Notebook
to see the notebook you just saved.<br> When you click on this notebook, there isOutput
in the right menu.<br> SelectSubmission.csv
that you can view by pressingOutput
menu and clickSubmit to Competition
on the right. -
The screen will now be moved to the
<br>Leaderboard
menu and the submitted files will be automatically scored.<br> After scoring, you can check your score and clickJump to your position on the leaderboard
to see your ranking.
<br>
Competitions Progress Flow
- The type and order that comes out here is the personal opinion of Toshiyuki Sakamoto, author of
Kaggle Guide
.
Baseline
implementing the general-purpose algorithm
- First, you start analyzing the data, you get the output data through a general-purpose algorithm.
- Develop machine learning models in earnest and compare output data and results from general-purpose algorithms.
- If the comparison results in a worse result than the general-purpose algorithm, you can assume that the model has a problem. <br>
Data Analysis
Notebook
- This refers to
Notebook
that analyzesCompetition data
and showsvisualization
. - Focus on identifying
correlations
,rules
, andstructure
between the analyzed data without creating data to submit. We also look forindependent variables
that fit well withdependent variable
. - If you have less
Competition experience
, it would be a good start to build knowledge and insight by looking at data analyzed by otherKagglers
. <br>
Fork Notebook
- For those who are new to
machine learning
andKaggle
, one way is to fork out anotebook
that is open without data analysis or model development yourself. Fork
means to copy a version of the source code.- On the top right of the
Notebook
you'd like to fork press button to copy. <br>
Merge, Blending, Stacking, Ensemble Notebook
Notebook
with words such asMerge
,Blending
,Stacking
, andEnsemble
.- As the name suggests, it means
Notebook
combining severalNotebooks
. Example
: <br>
Conclusion of Competitions Progress Flow
- When
Competition
is carried out in this order, I think it would be better to study a variety ofNotebooks
to understand the process rather than just looking at thewinner's notebook
. - Also,
Competition
is literally a competition, so the shared(public)Notebook
means that they are not serious impact on their score.<br> In fact, if you look at theNotebook of winners
, you can often see that they used the latest technology or used a different solution than theshared notebook
. <br>
<br>
Rule of Competitions
Competitions in Kaggle
sometimes have specific rules. This is becauseCompetitions
are usually hosted by a company or organization, and special rules are often created to achieve the results that the company or organization wants. <br>
What rules
should I check?
-
Rules
: To win theCompetition
, you must first know therules of Competition
. Check theRules
menu for each Competition.<br>
-
Evaluation
: On theEvaluation
page ofOverview
, you should look at theEvaluation function
and see what evaluation method is applied. Usually, statistical-based functions are used.<br>
-
One-person score check limit
: If you can check the score frequently by submitting a result file as you change the data one by one, the competition won't get any meaningful results, so there is usually a limit to the number of results checked.<br>
-
Notebook Only Competition
: Submit results usingKaggle Notebook
only.<br> If onlyKaggle Notebook
is used,Kaggler
is more likely to shareNotebook
, and all participants can easily find good ideas by viewingshared Notebook
. <br>Also, all participants have the same computing resources, which can help address inequality between those who use personal workstations and those who do not. <br>
<br>
Flow of Technology in Kaggle
Exploring in Closed Competition
- One characteristic of
Kaggle
is that it leavesdiscussion
andnotebook
ofCompetition that ended a long time ago
.<br> So if you look at these, you can see what technologies were applied to where and in what ways. - Example
Competition Used Technology Description Mercari Price Suction Cahllenge (2018.2) TF-IDF Vector + Pre-bonded Neural Network Learn the frequency of each word with neural networks Toxic Comment Classification Challenge (2018.3) FastText, Glove + GRU + LightGBM A combination of word vector dictionaries learned from time series data Avito Demand Prediction Challenge (2018.6) FastText + LSTM + 2D-CNN Learn data and images of sentences simultaneously with neural networks Quora Insincere Questions Classification(2019.1) Glove, para + OOV Token + LSTM + 1D-CNN Learn vocabularies through OOV token Jigsaw Unintended Bias in Toxicity Classification(2019.6) BERT + XLNet + GPT2 BERT model appeared to the Kaggle
Winner Solutions at a Glance
- Data-Science-Competitions is a Github repository, presents solutions that
won the Competition
topic by topic (I just checked it out that 11 months ago was the last commit). - The winning solution is technology-based at the time, so we need to see if we have better technology today.
- Most
Competitions
will continue to release their latest technology-enabled solutions on thePrivate Leaderboard
page after the end. <br>
<br>
Kaggle Dataset and API
Use public Dataset
- When studying common algorithms, it is recommended to test performance with a widely publicized
Dataset
,UCI Machine Learning Repository
is famous.<br> It is also used in many academic papers. <br>
Use it as a Data Repository
- When using
Github
, you can useKaggle
as a convenient place to storeDataset
andNotebook
(Free!) - It also has the advantage of being able to connect
Dataset
directly toNotebook
. - There is a capacity limit of up to 20GB per
public Dataset
and up to 20GB total forall private Dataset
. <br>
Kaggle API
Kaggle API
is an API that can use various functions ofKaggle
in various development environments.- Developed as
Python 3
and the usage is input command into the terminal environment. <br>
Install Kaggle API
-
You must install
Python
andpip
before starting. - <br>
-
- First, install
Kaggle API
usingpip install kaggle
.
- First, install
-
2.Then enter your profile, click on the button that looks like this, and press
Accounts
. -
3.<br> Click
Create New API Token
here to download thejson
file. -
- Save downloaded
json
file to the user's home directory as.kaggle/kaggle.json
. now you are ready to useKaggle API
. <br>
- Save downloaded
Use Kaggle API
- You can open a terminal on your PC and run commands.
- Run the
kaggle competitions list
command to see whichCompetitions
are currently in progress.<br> - To view and download
Competition files
, check the file withkaggle competitions files COMPETITION_NAME
andkaggle competitions download COMPETITION_NAME
to download the files. - To learn more about the
Kaggle API
, please visit Kaggle Public API Documentation. <br>
Finished!
First of all, thank you for reading Hello Kaggle!
<br>
I studied Python
for the first time in April 2020 and was unable to concentrate fully on my studies as I've started military service in July of the same year.<br>
That's why I couldn't study data science in depth, and I still need more knowledge to understand it.<br>
Now finally I'm stepping into machine learning
and Kaggle
.<br>
At this moment to write Hello Kaggle!
, I've improved my understanding of Kaggle
and I'm going to start with Getting Started Competition
.<br>Also eager to keep up with the latest technology by looking at other outstanding Kaggler's Notebook
.<br>Hopefully, everyone who reads Hello Kaggle!
will get the best time in 2021. Let's Keep Going!
<br>