Biased vs Unbiased: Debunking Statistical Myths

clique na imagem para seguir o link

Uma reflexão sobre os enviesamentos que usamos na ciência de dados.

Anyone who attended statistical training at the college level has been taught the four rules that you should always abide by, when developing statistical models and predictions:

  1. You should only use unbiased estimates
  2. You should use estimates that have minimum variance
  3. In any optimization problem (for instance to compute an estimate from a maximum likelihood function, or to detect the best, most predictive subset of variables), you should always shoot for a global optimum, not a local one.
  4. And if you violate any of the above three rules, at least you need to make sure that your estimate, when the number of observations is large, satisfies them.

As a data scientist and ex-statistician, I violate these rules (especially #1 – #3) almost daily. Indeed, that’s part of what makes data science different from statistical science.

 

Tags:

Making it easier to discover datasets

Clique na imagem para seguir o link

Clique na imagem para seguir o link

Novo recurso da google para identificar conjuntos de dados.

In today’s world, scientists in many disciplines and a growing number of journalists live and breathe data. There are many thousands of data repositories on the web, providing access to millions of datasets; and local and national governments around the world publish their data as well. To enable easy access to this data, we launched Dataset Search, so that scientists, data journalists, data geeks, or anyone else can find the data required for their work and their stories, or simply to satisfy their intellectual curiosity.

Tags:

Basketball Stat Cherry Picking

clicar na imagem para seguir o link

clicar na imagem para seguir o link

Deep into the NBA playoffs, we are graced with stats-o-plenty before, during, and after every game. Some of the numbers are informative. Most of them are randomly used to illustrate a commentator’s point.

One of the most common stats is the conditional that says something like, “When player X scores at least Y points, the team wins 90 percent of their games.” It implies a cause-and-effect relationship.

The Cleveland Cavaliers won the most games when LeBron James scored 30 or more points. So James should just score that many points every time. Easy. I should be a coach.

It’s a bit of stat cherry picking, trying to find something in common among games won. So to make things easier, and for you to wow your friends during the games, I compiled winning percentages for several stats during the 2017-18 regular season. Select among the star players still in the playoffs.

Tags:

Data Analysis Method: Mathematics Optimization to Build Decision Making

clique na imagem para seguir o link

clique na imagem para seguir o link

Uma pequena introdução à utilização de otimização na análise de dados

Optimization is a problem associated with the best decision that is effective and efficient decisions whether it is worth maximum or minimum by way of determining a satisfactory solution.

Optimization is not a new science. It has grown even since Newton in the 17th century discovered how to count roots. Currently the science of optimization is still evolving in terms of techniques and applications. Many cases or problems in everyday life that involve optimization to solve them. Lately much developed especially in the emergence of new techniques to solve the problem of optimization. To mention some, among others, conic programming, semi definite programming, semi infinite programming and some meta heuristic techniques.

Tags: ,

Summary guide to SPSS tutorials

clique no ícon para seguir o link

clique no ícon para seguir o link

Bom site com vários recursos sobre a utilização do IBM SPSS

Catalogue of SPSS tutorials is an Excel *.xlms file containing a full listing (with hyperlinks) of all tutorial files.[may not be completely up-to-date]

Guide to pop-out menus shows all the screenshots for menus and sub-menus for Survey  Analysis Workshop [may not be completely up-to-date and site has been re-organised, so needs a re-write, but still useful to show you what to expect]

There are more than 600 pages of downloadable tutorials arranged in four blocks.

Block  1: From questionnaire to SPSS saved file

1.1:   The language of survey analysis
1.2:   How do data relate to questionnaires?
1.3:   Reading raw data into SPSS
1.4:   Completing your data dictionary
1.5:   Utilities [still in preparation]

Block 2:  Analysing one variable

2.1:   Nominal and ordinal variables
2.2:   Interval scale variables
2.3:   Data transformations

Block 3:  Analysing two variables (and sometimes three)

3.1   Contingency tables
3.2   Three variables
3.3    Multiple response
3.4    Comparing means
3.5:   Conditional transformations

Block 4:   Hypothesis testing
[Still in preparation: provisional contents listed below: page also has links to some useful resources for statistical concepts]

Hypothesis testing
4.2a  t-test and one way anova
4.2b  Testing differences between three or more means
4.3  Chi-square (has one tutorial)
4.4  Regression and correlation
4.5  Association, structure and cause

SPSS files and documentation used for tutorials and exercises

Tags: , , ,

SPSS videos e trials

clicar na imagem para seguir o link

clicar na imagem para seguir o link

Um site da IBM com bastantes recursos sobre o SPSS

Software Trial: IBM SPSS Statistics Desktop

Start leveraging your data today to identify your best customers, forecast future trends, improve supplier performance, and more.

Try trial

Software Trial: IBM SPSS Text Analytics for Surveys Trial Software [US]

Software Trial: IBM SPSS Text Analytics for Surveys Trial Software [US]

IBM SPSS Text Analytics for Surveys uses powerful natural language processing technologies specifically designed for survey text.

Try trial

Software Trial: IBM SPSS Amos Trial

Software Trial: IBM SPSS Amos Trial

IBM SPSS Amos gives you the power to easily perform structural equation modeling (SEM).

Try trial

Online Demo: IBM SPSS Regression in action

Online Demo: IBM SPSS Regression in action

Watch how powerful regression techniques can help you discover hidden relationships in your data.

Download

Online Demo: Two-step cluster analysis: Find natural groups in your data [US]

Online Demo: Two-step cluster analysis: Find natural groups in your data [US]

Watch a short demonstration of the two-step cluster analysis technique in SPSS Statistics Base.

Download

Online Demo: Online Demo: Statistical analysis with confidence using IBM SPSS Statistics [US]

Online Demo: Online Demo: Statistical analysis with confidence using IBM SPSS Statistics [US]

Explore the power of statistical analysis in your organization IBM Analytics IBM Analytics

Download

White Paper: White Paper: Better decision making under uncertain conditions [US]

White Paper: White Paper: Better decision making under uncertain conditions [US]

This paper describes Monte Carlo simulation, the value of this technique for risk analysis and how SPSS Statistics and its Monte Carlo simulation capabilities can help businesses assess for risk.

Read paper

White Paper: The Risk of Using Spreadsheets for Statistical Analysis [US]

White Paper: The Risk of Using Spreadsheets for Statistical Analysis [US]

Despite their popularity, spreadsheets may not be well suited for analysis and decision making. This paper explores why, and describes a better alternative.

Tags: , ,

curso de KNIME

clicar na imagem para seguir o link

clicar na imagem para seguir o link

Muito bom curso de KNIME, é introdutório mas introduz um grande número de funcionalidades.

KNIME Online Self-Training

Welcome to the KNIME Self-training course. The focus of this document is to get you started with KNIME as quickly as possible and guide you through essential steps of advanced analytics with KNIME. Optional and very useful topics such as reporting, KNIME Server and database handling are also included to give you an idea of what else is possible with KNIME.

  1. Installing KNIME Analytics Platform and Extensions
  2. Data Import / Export and Database / Big Data
  3. ETL
  4. Visualization
  5. Advanced Analytics
  6. Reporting
  7. KNIME Server

Tags: , , ,

Decision trees: Do Splitting Rules Really Matter?

clicar na imagem para seguir o link

clicar na imagem para seguir o link

Um bom texto sobre o critério de divisão em subgrupos nas árvores de decisão.

Do decision-tree splitting criteria matter? Contrary to popular opinion in data mining circles, our experience indicates that splitting criteria do matter; in fact, the difference between using the right rule and the wrong rule could add up to millions of dollars of lost opportunity.

So, why haven’t the differences been noticed? The answer is simple. When data sets are small and highly-accurate trees can be generated easily, the particular splitting rule does not matter. When your golf ball is one inch from the cup, which club or even which end you use is not important because you will be able to sink the ball in one stroke. Unfortunately, previous examinations of splitting rule performance, the ones that found no differences, did not look at data-mining problems with large data sets where obtaining a good answer is genuinely difficult.

When you are trying to detect fraud, identify borrowers who will declare bankruptcy in the next 12 months, target a direct mail campaign, or tackle other real-world business problems that do not admit of 90+ percent accuracy rates (with currently available data), the splitting rule you choose could materially affect the accuracy and value of your decision tree. Further, even when different splitting rules yield similarly accurate classifiers, the differences between them may still matter. With multiple classes, you might care how the errors are distributed across classes. Between two trees with equal overall error rates, you might prefer a tree that performs better on a particular class or classes. If the purpose of a decision tree is to yield insight into a causal process or into the structure of a database, splitting rules of similar accuracy can yield trees that vary greatly in their usefulness for interpreting and understanding the data.

This paper explores the key differences between three important splitting criteria: Gini, Twoing and Entropy, for three- and greater-level classification trees, and suggests how to choose the right one for a particular problem type. Although we can make recommendations as to which splitting rule is best suited to which type of problem, it is good practice to always use several splitting rules and compare the results. You should experiment with several different splitting rules and should expect different results from each. As you work with different types of data and problems, you will begin to learn which splitting rules typically work best for specific problem types. Nevertheless, you should never rely on a single rule alone; experimentation is always wise.

Gini, Twoing, and Entropy

The best known rules for binary recursive partitioning are Gini, Twoing, and Entropy. Because each rule represents a different philosophy as to the purpose of the decision tree, each grows a different style of tree.

Guardar

Tags:

MARS – Multivariate Adaptive Regression Splines

clique na imagem para seguir o link

clique na imagem para seguir o link

Boa descrição destes algoritmos de análise de dados pelos proprios autores

An Overview of MARS

What is “MARS”?

MARS®, an acronym for Multivariate Adaptive Regression Splines, is a multivariate non-parametric regression procedure introduced in 1991 by world-renowned Stanford statistician and physicist, Jerome Friedman (Friedman, 1991). Salford Systems’ MARS, based on the original code, has been substantially enhanced with new features and capabilities in exclusive collaboration with Friedman.

Tags: ,

SAP video analytics

clicar para seguir o link

clicar para seguir o link

montes de vídeos sobre analytics da SAP
Digital Enterprise Platform
SAP Digital Business Services
SAPIndustry
SAPLineOfBusiness

SME Solutions and Partner Innovation

Tags: