Home

Awesome

datascience-fails

Collection of articles listing reasons why data science projects fail.

If you have an article that should be added, please suggest it with its link in the Issues.

I summarised my findings on my blog: Data Science Risk Categorisation

I added the post to my new company's (hypergolic.co.uk) blog as well at : Data Science Risk Categorisation

Follow me at @xLaszlo on Twitter for updates. <br><br>

Categorisation

After looking through the 300+ failures listed below there is a notable absence of any concern about domain experts and any collaboration with them apart from off-hand mentions regarding labelled data. The reader should take this into account when using the above categorisation. (Laszlo) <br><br>

I created this image on how I imagine communicating this on a single slide (excuse my design skills, it's a 2x3 table in lucidcharts with the middle row merged). <br> <img src="images/DSRisks.png" width="50%" height="50%"> <br><br>

David Dao's collection of Awful AI on GitHub (link)

51 things that can go wrong in a real-world ML project (link) 01. Vague success metrics of the ML model 02. Even if we had the perfect model — no clue of how it will be used within existing workflows 03. Building a 100% accurate model — no clarity on the acceptable trade-offs such as precision versus recall 04. Using a hammer to kill an ant — not checking the performance of simpler alternatives 05. Not all ML problems are worth solving — the impact may not be worth the effort 06. Drowning the business team in technical mumbo jumbo 07. I thought this dataset attribute means something else 08. 5 definitions of a business metric 09. Where is the dataset I need for my model? 10. The data warehouse is stale 11. Need to instrument app for more clickstream events — it will take months 12. Assuming all the datasets have the same quality 13. Customer changed preference to not use their data for ML. Why are those records still included 14. Uncoordinated schema changes at the data source 15. We have lots of data — don't forget data expires? 16. Systematic data issues making overall dataset bias 17. Unnoticed sudden distribution changes in the data 18. Using all the data for training — each model iteration can take days 19. We are using the best polyglot datastores — but how do I now write queries effectively across this data? 20. Training versus inference inconsistency 21. Model accuracy too good to be true — check for feature leakage 22. Limited Feature value coverage 23. Flaky pipeline for generating features that are time-dependent 24. Lack of balance between bias (underfitting) and variance (overfitting) 25. Compromising interpretability prematurely for performance 26. Always using deep learning instead of traditional feature engineering 27. Not applying hashing for sparse features 28. Not attempting to reduce the dimensionality of models 29. Ad-hoc tuning is faster compared to a scientific approach 30. Improper tracking of details related to model versions and experiments 31. Ignoring the specificity and sparsity trade-off 32. Prematurely jumping to online experimentation 33. Not measuring model’s sensitivity to recency 34. Not paying attention to infrastructure capacity 35. Evaluating models using different datasets 36. Reporting model accuracy for the overall data 37. Training results not reproducible 38. Long time before first online experiment 39. Model behaves differently in online experimentation compared to offline validation 40. Ignoring feedback loops 41. Making multiple changes within an experiment 42. Ad-hoc framework to analyze the results of the experiment 43. No backup plan if the test goes south 44. Not calibrating the model 45. ETL Pipeline SLA was 8 am. It’s now 4 pm and still processing — why is my metrics processing slow today 46. Metrics processing pipelines completed successfully but results are wrong? 47. Response time to generate an inference is too high 48. Data quality issues at source, or ingestion into the lake, or ETL processing 49. Cloud costs jumped up 3X this month 50. Model has not been re-trained for 3 months — it was supposed to happen weekly 51. No checks and bounds for data and concept drift

Why 87% of Machine learning Projects Fail (link)

Top 10 Challenges to Practicing Data Science at Work (link)

The State of Data Science & Machine Learning 2017 (link, webarchive)

OpML '20 - How ML Breaks: A Decade of Outages for One Large ML Pipeline (Google) (link, youtube)

geckoboard's Data fallacies (link)

Three Risks in Building Machine Learning Systems (link)

AI Engineering: 11 Foundational Practices (link, pdf)

Machine Learning: The High-Interest Credit Card of Technical Debt (link, pdf)

Managing the Risks of Adopting AI Engineering (link)

What is ML Ops? Best Practices for DevOps for ML (Cloud Next '18) (link, youtube)

A Brief Guide to Running ML Systems in Production (link)

6 myths about big data (link)

How your executives will screw up your next analytics project (link)

The state of data quality in 2020 (link)

AI adoption in the enterprise 2020 (link)

Move Fast and Break Things? The AI Governance Dilemma (link)

9 machine learning myths (link)

10 signs you’re ready for AI — but might not succeed (link)

AI’s Biggest Risk Factor is Big Data Itself (link)

Forrester Predictions 2018 (link)

How To Underwhelm With Artificial Intelligence (link)

A Guide to Underwhelming with AI (link)

AI is not set and forget (link)

How to Fail with Artificial Intelligence (link)

Top 5 AI Failures From 2017 Which Prove That ‘Perfect AI’ Is Still A Dream (link)

Stories of AI Failure and How to Avoid Similar AI Fails (link)

NewVantage Partners: Big Data Executive Survey 2017 (link, pdf)

Five Reasons Why Your Data Science Project is Likely to Fail (link)

6 Reasons Why Data Science Projects Fail (link)

Why Data Science Succeeds or Fails (link)

Why data science projects fail revisited (link)

Why Most AI Projects Fail (link)

Why You’re Not Getting Value from Your Data Science (link)

Data Science Project Failures (link)

Why do 87% of data science projects never make it into production? (link)

How to fail as a data scientist: 3 common mistakes (link)

We need to spend more time talking about data science failures (link)

Why Data Science Projects Fail (link)

Data Science: 4 Reasons Why Most Are Failing to Deliver (link)

Why so many Data Science projects fail to deliver (link)

<br><br>