Home

Awesome

Materials for GWU DNSC 6279 and 6290

DNSC 6279 ("Data Mining") provides exposure to various data preprocessing, statistics, and machine learning techniques that can be used both to discover relationships in large data sets and to build predictive models. Techniques covered will include basic and analytical data preprocessing, regression models, decision trees, neural networks, clustering, association analysis, and basic text mining. Techniques will be presented in the context of data driven organizational decision making using statistical and machine learning approaches.

DNSC 6290 ("Machine Learning") provides a follow up course to DNSC 6279 that will expand on both the theoretical and practical aspects of subjects covered in the pre-requisite course while optionally introducing new materials. Techniques covered may include feature engineering, penalized regression, neural networks and deep learning, ensemble models including stacked generalization and super learner approaches, matrix factorization, model validation, and model interpretation. Classes will be taught as workshops where groups of students will apply lecture materials to the ongoing Kaggle Advanced Regression and Digit Recognizer contests.

Course Topics

Topics
Section 00: Intro and History
Section 01: Basic Data Prep
Section 02: Analytical Data Prep
Section 03: Regression
Section 04: Decision Trees and Ensembles
Section 05: Neural Networks
Section 06: Clustering
Section 07: Association Rules
Section 08: Text Mining
Section 09: Matrix Factorization
Section 10: Model Interpretability

Some external reference material

Course Syllabi (Outdated/Unofficial)

Pre-requisite Courses

Instructor

Mr. Patrick Hall

E-mail: jphall@gwu.edu

Twitter: @jpatrickhall

Linkedin: https://www.linkedin.com/in/jpatrickhall/

Course Location

Location: Duques Hall, Room 255 Thursdays 6:10-8:40 PM

Office Hours: Funger Hall, Room 415 Thursdays 5:00 - 6:00 PM

Copyrights and Licenses

Some teaching materials are copyrighted by the instructor. Some copyrights are owned by other individuals and entities.

Most code examples are copyrighted by the instructor and provided with an MIT license, meaning they can be used for almost anything as long as the copyright and license notice are preserved. Some code examples are copyrighted by other entities, and usually provided with an Apache Version 2 license. These code examples can be also used for nearly any purpose, even commercially, as long as the copyright and license notice are preserved.

Recommended Textbooks

DNSC 6279 ("Data Mining")
DNSC 6290 ("Machine Learning")

Reading Assignments

The student is responsible for studying and understanding all assigned materials. If reading generates questions that are not discussed in class, the student has the responsibility of addressing the instructor privately or raising the issue in an appropriate digital medium.

Blackboard

Some materials for this class have personal or corporate copyrights or licenses that prevent them from being shared on GitHub. Those materials or other internal information will be shared with students via Blackboard.

Grading

DNSC 6279 ("Data Mining")
Numeric GradeLetter grade
94-100:A
90-93.99:A-
87-89.99:B+
84-86.99:B
80-83.99:B-
77-79.99:C+
74-76.99:C
70-73.99:C-
<= 69.99:F
DNSC 6290 ("Machine Learning")

Academic Integrity

If you are struggling with an assignment or class materials, require extra time for an assignment, or simply require additional assistance, see the instructor immediately.

Cheating and plagiarism will not be tolerated. Any case will automatically result in loss of all the points for the assignment, and may be a reason for a failing grade and/or grounds for dismissal. In case of a group assignment, all group members will receive a zero grade.

Any suspected case of cheating or plagiarism or behavior in violation of the rules of this course will be reported to the Office of Academic Integrity. Students are expected to know and understand all college policies, especially the code of academic integrity.

Disability Services

Please contact the Disability Support Services to establish eligibility and to coordinate reasonable accommodation.

Attendance

Regular attendance is expected, except for remote students. All students are held responsible for all of the work of the courses in which they are registered, and all absences must be excused by the instructor before provision is made to make up the work missed.

Class Policy Changes

The instructor reserves the right to revise any item on this syllabus, including, but not limited to any class policy, course outline or schedule, grading policy, tests, etc. Note that the requirements for deliverables may be clarified and expanded in class, via email, on GitHub, or on Blackboard. Students are expected to complete the deliverables incorporating such additions.

Software

Using Git for this Material

You are welcome to use git and/or GitHub to save and manage your own copies of class materials.

The easiest way to do so is to download this entire repository as a zip file. However you will need to download a new copy of the repository whenever changes are made to this repository. To download the course repository, navigate to the course GitHub repository (i.e. this page) and click the 'Clone or Download' button and then select 'Download Zip'.

alt text

If you would like to take advantage of the version control capabilities of git then you need to follow these steps.

Install required software
Fork and pull materials

Navigate to the course GitHub repository (i.e. this page) and click the 'Fork' button.

alt text

Enter the following statements on the git bash command line:

$ cd <parent directory>

$ mkdir GWU_data_mining

$ cd GWU_data_mining

$ git init

$ git remote add origin https://github.com/<your username>/GWU_data_mining.git

$ git remote add upstream https://github.com/jphall663/GWU_data_mining.git

$ git pull origin master

$ git lfs install

$ git lfs track '*.jpg' '*.png' '*.csv' '*.sas7bdat'

Docker

Dockerfile to create Anaconda Python 3.5 environment with H2O, XGBoost, and GraphViz.

Start the image with:

docker run -i -t -p 8888:8888 <image_id> /bin/bash -c "/opt/conda/bin/conda install jupyter -y --quiet && /opt/conda/bin/jupyter notebook --notebook-dir=/GWU_data_mining --ip='*' --port=8888 --no-browser"