Making it easier to discover datasets
Posted by Armando Brito Mendes | Filed under Bases de Dados, materiais para profissionais
Novo recurso da google para identificar conjuntos de dados.
In today’s world, scientists in many disciplines and a growing number of journalists live and breathe data. There are many thousands of data repositories on the web, providing access to millions of datasets; and local and national governments around the world publish their data as well. To enable easy access to this data, we launched Dataset Search, so that scientists, data journalists, data geeks, or anyone else can find the data required for their work and their stories, or simply to satisfy their intellectual curiosity.
Tags: análise de dados
Basketball Stat Cherry Picking
Posted by Armando Brito Mendes | Filed under estatística, visualização
Deep into the NBA playoffs, we are graced with stats-o-plenty before, during, and after every game. Some of the numbers are informative. Most of them are randomly used to illustrate a commentator’s point.
One of the most common stats is the conditional that says something like, “When player X scores at least Y points, the team wins 90 percent of their games.” It implies a cause-and-effect relationship.
The Cleveland Cavaliers won the most games when LeBron James scored 30 or more points. So James should just score that many points every time. Easy. I should be a coach.
It’s a bit of stat cherry picking, trying to find something in common among games won. So to make things easier, and for you to wow your friends during the games, I compiled winning percentages for several stats during the 2017-18 regular season. Select among the star players still in the playoffs.
Tags: análise de dados
Data Analysis Method: Mathematics Optimization to Build Decision Making
Posted by Armando Brito Mendes | Filed under Investigação Operacional, matemática, SAD - DSS
Uma pequena introdução à utilização de otimização na análise de dados
Optimization is a problem associated with the best decision that is effective and efficient decisions whether it is worth maximum or minimum by way of determining a satisfactory solution.
Optimization is not a new science. It has grown even since Newton in the 17th century discovered how to count roots. Currently the science of optimization is still evolving in terms of techniques and applications. Many cases or problems in everyday life that involve optimization to solve them. Lately much developed especially in the emergence of new techniques to solve the problem of optimization. To mention some, among others, conic programming, semi definite programming, semi infinite programming and some meta heuristic techniques.
Tags: análise de dados, data mining, otimização
Summary guide to SPSS tutorials
Posted by Armando Brito Mendes | Filed under data sets, estatística, materiais ensino, software
Bom site com vários recursos sobre a utilização do IBM SPSS
Catalogue of SPSS tutorials is an Excel *.xlms file containing a full listing (with hyperlinks) of all tutorial files.[may not be completely up-to-date]
Guide to pop-out menus shows all the screenshots for menus and sub-menus for Survey Analysis Workshop [may not be completely up-to-date and site has been re-organised, so needs a re-write, but still useful to show you what to expect]
There are more than 600 pages of downloadable tutorials arranged in four blocks.
Block 1: From questionnaire to SPSS saved file
1.1: The language of survey analysis
1.2: How do data relate to questionnaires?
1.3: Reading raw data into SPSS
1.4: Completing your data dictionary
1.5: Utilities [still in preparation]
Block 2: Analysing one variable
2.1: Nominal and ordinal variables
2.2: Interval scale variables
2.3: Data transformations
Block 3: Analysing two variables (and sometimes three)
3.1 Contingency tables
3.2 Three variables
3.3 Multiple response
3.4 Comparing means
3.5: Conditional transformations
Block 4: Hypothesis testing
[Still in preparation: provisional contents listed below: page also has links to some useful resources for statistical concepts]
Hypothesis testing
4.2a t-test and one way anova
4.2b Testing differences between three or more means
4.3 Chi-square (has one tutorial)
4.4 Regression and correlation
4.5 Association, structure and cause
SPSS files and documentation used for tutorials and exercises
Tags: análise de dados, IBM SPSS Statistics, inquéritos, software estatístico
SPSS videos e trials
Posted by Armando Brito Mendes | Filed under estatística, software
Um site da IBM com bastantes recursos sobre o SPSS
Software Trial: IBM SPSS Statistics Desktop
Start leveraging your data today to identify your best customers, forecast future trends, improve supplier performance, and more.
Software Trial: IBM SPSS Text Analytics for Surveys Trial Software [US]
IBM SPSS Text Analytics for Surveys uses powerful natural language processing technologies specifically designed for survey text.
Software Trial: IBM SPSS Amos Trial
IBM SPSS Amos gives you the power to easily perform structural equation modeling (SEM).
Online Demo: IBM SPSS Regression in action
Watch how powerful regression techniques can help you discover hidden relationships in your data.
Online Demo: Two-step cluster analysis: Find natural groups in your data [US]
Watch a short demonstration of the two-step cluster analysis technique in SPSS Statistics Base.
Online Demo: Online Demo: Statistical analysis with confidence using IBM SPSS Statistics [US]
Explore the power of statistical analysis in your organization IBM Analytics IBM Analytics
White Paper: White Paper: Better decision making under uncertain conditions [US]
This paper describes Monte Carlo simulation, the value of this technique for risk analysis and how SPSS Statistics and its Monte Carlo simulation capabilities can help businesses assess for risk.
White Paper: The Risk of Using Spreadsheets for Statistical Analysis [US]
Despite their popularity, spreadsheets may not be well suited for analysis and decision making. This paper explores why, and describes a better alternative.
Tags: análise de dados, IBM SPSS Statistics, software estatístico
curso de KNIME
Posted by Armando Brito Mendes | Filed under mapas SIG's, materiais para profissionais, software, videos, visualização
Muito bom curso de KNIME, é introdutório mas introduz um grande número de funcionalidades.
KNIME Online Self-Training
Welcome to the KNIME Self-training course. The focus of this document is to get you started with KNIME as quickly as possible and guide you through essential steps of advanced analytics with KNIME. Optional and very useful topics such as reporting, KNIME Server and database handling are also included to give you an idea of what else is possible with KNIME.
- Installing KNIME Analytics Platform and Extensions
- Data Import / Export and Database / Big Data
- ETL
- Visualization
- Advanced Analytics
- Reporting
- KNIME Server
Tags: análise de dados, big data, data mining, Knime, text mining
Decision trees: Do Splitting Rules Really Matter?
Posted by Armando Brito Mendes | Filed under Sem categoria
Um bom texto sobre o critério de divisão em subgrupos nas árvores de decisão.
Do decision-tree splitting criteria matter? Contrary to popular opinion in data mining circles, our experience indicates that splitting criteria do matter; in fact, the difference between using the right rule and the wrong rule could add up to millions of dollars of lost opportunity.
So, why haven’t the differences been noticed? The answer is simple. When data sets are small and highly-accurate trees can be generated easily, the particular splitting rule does not matter. When your golf ball is one inch from the cup, which club or even which end you use is not important because you will be able to sink the ball in one stroke. Unfortunately, previous examinations of splitting rule performance, the ones that found no differences, did not look at data-mining problems with large data sets where obtaining a good answer is genuinely difficult.
When you are trying to detect fraud, identify borrowers who will declare bankruptcy in the next 12 months, target a direct mail campaign, or tackle other real-world business problems that do not admit of 90+ percent accuracy rates (with currently available data), the splitting rule you choose could materially affect the accuracy and value of your decision tree. Further, even when different splitting rules yield similarly accurate classifiers, the differences between them may still matter. With multiple classes, you might care how the errors are distributed across classes. Between two trees with equal overall error rates, you might prefer a tree that performs better on a particular class or classes. If the purpose of a decision tree is to yield insight into a causal process or into the structure of a database, splitting rules of similar accuracy can yield trees that vary greatly in their usefulness for interpreting and understanding the data.
This paper explores the key differences between three important splitting criteria: Gini, Twoing and Entropy, for three- and greater-level classification trees, and suggests how to choose the right one for a particular problem type. Although we can make recommendations as to which splitting rule is best suited to which type of problem, it is good practice to always use several splitting rules and compare the results. You should experiment with several different splitting rules and should expect different results from each. As you work with different types of data and problems, you will begin to learn which splitting rules typically work best for specific problem types. Nevertheless, you should never rely on a single rule alone; experimentation is always wise.
Gini, Twoing, and Entropy
The best known rules for binary recursive partitioning are Gini, Twoing, and Entropy. Because each rule represents a different philosophy as to the purpose of the decision tree, each grows a different style of tree.
Guardar
Tags: análise de dados, data mining
MARS – Multivariate Adaptive Regression Splines
Posted by Armando Brito Mendes | Filed under materiais ensino, materiais para profissionais
Boa descrição destes algoritmos de análise de dados pelos proprios autores
An Overview of MARS
What is “MARS”?
MARS®, an acronym for Multivariate Adaptive Regression Splines, is a multivariate non-parametric regression procedure introduced in 1991 by world-renowned Stanford statistician and physicist, Jerome Friedman (Friedman, 1991). Salford Systems’ MARS, based on the original code, has been substantially enhanced with new features and capabilities in exclusive collaboration with Friedman.
Tags: análise de dados, data mining, machine learning
SAP video analytics
Posted by Armando Brito Mendes | Filed under materiais para profissionais, videos
SME Solutions and Partner Innovation
Tags: análise de dados, data mining
Deeplearning4j Documentation
Posted by Armando Brito Mendes | Filed under materiais para profissionais, software
O site de um pacote java para deeplearing com montes de info. sobre redes neuronais e afins.
- How To
- Quickstart: Running Examples and DL4J in Your Projects
- Comprehensive Setup Guide
- Build Locally From Master
- Contribute to DL4J (Developer Guide)
- Choose a Neural Net
- Use the Maven Build Tool
- Vectorize Data With Canova
- Build a Data Pipeline
- Run Benchmarks
- Configure DL4J in Ivy, Gradle, SBT etc
- Find a DL4J Class or Method
- Save and Load Models
- Interpret Neural Net Output
- Visualize Data with t-SNE
- Swap CPUs for GPUs
- Customize an Image Pipeline
- Perform Regression With Neural Nets
- Troubleshoot Training & Select Network Hyperparameters
- Visualize, Monitor and Debug Network Learning
- Speed Up Spark With Native Binaries
- Build a Recommendation Engine With DL4J
- Use Recurrent Networks in DL4J
- Build Complex Network Architectures with Computation Graph
- Train Networks using Early Stopping
- Download Snapshots With Maven
- Customize a Loss Function
- Introduction to Neural Networks
- Multilayer Neural Nets
- Tutorials
- Datasets
- Scaleout
- Text
- Resources
- DL4J, Torch7, Theano and Caffe
- Glossary of Terms for Deep Learning and Neural Nets
- Deep Learning’s Accuracy
- DataVec: ETL for ML
- ND4J Backends: How They Work
- Model Zoo
- Unsupervised Learning: Use Cases
- Eigenvectors, PCA, Covariance and Entropy
- Thought Vectors, AI and NLP
- Questions to Ask When Applying DL
- AI, Machine Learning and Deep Learning
- DL and Reinforcement Learning
- Javadoc: DL4J Methods and Classes
- Canova Javadoc: Canova Methods and Classes
- ND4J User Guide
- ND4J Javadoc
- Scala, Spark and Deep Learning
- Further Reading on Deep Learning
- Deep Learning in Other Languages
- Use Cases
- Architecture
- Features
- Roadmap
- About
- Open Data
- Latest Release Notes
Guardar
Tags: análise de dados, big data, data mining, desnvolvimento de software, machine learning