Posted by Armando Brito Mendes | Filed under data mining, estatística

Bom texto com conclusões exclusivas

1. Re-sampling and Statistical Inference

- Main Result
- Sampling with or without Replacement
- Illustration
- Optimum Sample Size
- Optimum
*K*in*K*-fold Cross-Validation - Confidence Intervals, Tests of Hypotheses

2. Generic, All-purposes Algorithm

- Re-sampling Algorithm with Source Code
- Alternative Algorithm
- Using a Good Random Number Generator

3. Applications

- A Challenging Data Set
- Results and Excel Spreadsheet
- A New Fundamental Statistics Theorem
- Some Statistical Magic
- How does this work?
- Does this contradict entropy principles?

4. Conclusions

Tags: inferência, machine learning

## Biased vs Unbiased: Debunking Statistical Myths

Posted by Armando Brito Mendes | Filed under data mining, estatística

Uma reflexão sobre os enviesamentos que usamos na ciência de dados.

Anyone who attended statistical training at the college level has been taught the four rules that you should always abide by, when developing statistical models and predictions:

- You should only use unbiased estimates
- You should use estimates that have minimum variance
- In any optimization problem (for instance to compute an estimate from a maximum likelihood function, or to detect the best, most predictive subset of variables), you should always shoot for a global optimum, not a local one.
- And if you violate any of the above three rules, at least you need to make sure that your estimate, when the number of observations is large, satisfies them.

As a data scientist and ex-statistician, I violate these rules (especially #1 – #3) almost daily. Indeed, that’s part of what makes data science different from statistical science.

Tags: análise de dados

## The 5 Computer Vision Techniques

Posted by Armando Brito Mendes | Filed under data mining, lições, materiais ensino

Boa introdução ao tema da visão por computador

# The 5 Computer Vision Techniques That Will Change How You See The World

Computer Vision is one of the hottest research fields within Deep Learning at the moment. It sits at the intersection of many academic subjects, such as Computer Science (Graphics, Algorithms, Theory, Systems, Architecture), Mathematics (Information Retrieval, Machine Learning), Engineering (Robotics, Speech, NLP, Image Processing), Physics (Optics), Biology (Neuroscience), and Psychology (Cognitive Science). As Computer Vision represents a relative understanding of visual environments and their contexts, many scientists believe the field paves the way towards Artificial General Intelligence due to its cross-domain mastery.

So **what is Computer Vision?**

Tags: machine learning, robot

## SQL Server Data Mining News

Posted by Armando Brito Mendes | Filed under Bases de Dados, data mining

# Um site com visão da microsoft para o data mining

# Welcome to SQLServerDataMining.com

This site has been designed by the SQL Server Data Mining team to provide the SQL Server community with access to and information about our in-database data mining and analytics features. SQL Server 2000 was the first major database release to put analytics in the database. Catch up with the latest SQL Server Data Mining news in our newsletter.

### SQL Server 2012 SP1 Data Mining Add-ins for Office (with 32-bit or 64-bit Support)

The Data Mining Add-ins allow you to harness the power of SQL Server 2012 predictive analytics in Excel and Visio and they have been updated to include 32-bit or 64-bit support for Office 2010 or Office 2013. Use Table Analysis Tools to get insight with a couple of clicks. Use the Data Mining tab for full-lifecycle data mining, and build models which can be exported to a production server. Visualize your models in Visio.

### SQL Server 2012 Data Mining

Microsoft expert Rafal Lukawiecki provides free and paid videos on data mining for SQL Server 2012 at Project Botticelli. The website has other Microsoft BI topics too from leading Microsoft experts.

### SQL Server DM with Excel 2010 and PowerPivot

Microsoft MVP Mark Tabladillo shows you how to unleash SQL Server 2008 Data Mining with Excel 2010 and SQL Server PowerPivot for Excel, Microsoft’s new self-service BI offering.

Tags: DW \ BI, SQL, text mining

## When Variable Reduction Doesn’t Work

Posted by Armando Brito Mendes | Filed under data mining, estatística, materiais para profissionais

Um bom exemplo de como os procedimentos habituais nem sempre funcionam

*Summary:** Exceptions sometimes make the best rules. Here’s an example of well accepted variable reduction techniques resulting in an inferior model and a case for dramatically expanding the number of variables we start with.*

of the things that keeps us data scientists on our toes is that the well-established rules-of-thumb don’t always work. Certainly one of the most well-worn of these rules is the parsimonious model; always seek to create the best model with the fewest variables. And woe to you who violate this rule. Your model will over fit, include false random correlations, or at very least will just be judged to be slow and clunky.

Certainly this is a rule I embrace when building models so I was surprised and then delighted to find a well conducted study by Lexis/Nexis that lays out a case where this clearly isn’t true.

Tags: problemas

## How To Use Multivariate Time Series Techniques For Capacity Planning on VMs

Posted by Armando Brito Mendes | Filed under data mining, estatística, Investigação Operacional, materiais ensino

Métodos multivariados para séries cronológicas com VMs

Capacity planning is an arduous, ongoing task for many operations teams, especially for those who rely on Virtual Machines (VMs) to power their business. At Pivotal, we have developed a data science model capable of forecasting hundreds of thousands of models to automate this task using a multivariate time series approach. Open to reuse for other areas such as industrial equipment or vehicles engines, this technique can be applied broadly to anything where regular monitoring data can be collected.

Tags: machine learning, previsão

## Three classes of metrics: centrality, volatility, and bumpiness

Posted by Armando Brito Mendes | Filed under data mining, estatística, Investigação Operacional

introduz uma nova classe de estatísticas para séries cronológicas: bumpiness

All statistical textbooks focus on centrality (median, average or mean) and volatility (variance). None mention the third fundamental class of metrics: bumpiness.

Here we introduce the concept of *bumpiness* and show how it can be used. Two different datasets can have same *mean* and *variance*, but a different *bumpiness*. Bumpiness is linked to how the data points are ordered, while centrality and volatility completely ignore order. So, bumpiness is useful for datasets where order matters, in particular time series. Also, bumpiness integrates the notion of dependence (among the data points), while centrality and variance do not. Note that a time series can have high volatility (high variance) and low bumpiness. The converse is true.

The attached Excel spreadsheet shows computations of the bumpiness coefficient r for various time series. It is also of interest to readers who wish to learn new Excel concepts such a random number generation with Rand, indirect references with Indirect, Rank, Large and other powerful but not well known Excel functions. It is also an example of a fully interactive Excel spreadsheet driven by two core parameters.

Finally, this article shows (1) how a new concept is thought of, (2) then a robust, modern definition materialized, and (3) eventually a more meaningful definition created based on, and compatible with previous science.

Tags: previsão

## Recurrent neural networks, Time series data and IoT – Part One

Posted by Armando Brito Mendes | Filed under data mining, estatística, Investigação Operacional, materiais ensino

Utilização de redes neuronais para previsão de séries univariadas

RNNs are already used for Time series analysis. Because IoT problems can often be modelled as a Time series, RNNs could apply to IoT data. In this multi-part blog, we first discuss Time series applications and then discuss how RNNs could apply to Time series applications. Finally, we discuss applicability to IoT.

In this article (Part One), we present the overall thought process behind the use of Recurrent neural networks and Time series applications – especially a type of RNN called Long Short Term Memory networks (LSTMs).

Tags: machine learning, previsão

## Time Series Analysis using R-Forecast package

Posted by Armando Brito Mendes | Filed under data mining, estatística, Investigação Operacional

Demonstra algumas das funcionalidades do pacote R forecast

In today’s blog post, we shall look into time series analysis using R package – forecast. Objective of the post will be explaining the different methods available in forecast package which can be applied while dealing with time series analysis/forecasting.

Tags: previsão, R-software

## The 7 Most Important Data Mining Techniques

Posted by Armando Brito Mendes | Filed under data mining, materiais para profissionais

Pequena introdução a ulguns dos métodos mais usados em data mining

Data mining is the process of looking at large banks of information to generate new information. Intuitively, you might think that data “mining” refers to the extraction of new data, but this isn’t the case; instead, data mining is about extrapolating patterns and new knowledge from the data you’ve already collected.

Relying on techniques and technologies from the intersection of database management, statistics, and machine learning, specialists in data mining have dedicated their careers to better understanding how to process and draw conclusions from vast amounts of information. But what are the techniques they use to make this happen?