Time series decomposition works by splitting a time series into three components: seasonality, trends and random fluctiation. To show how this works, we will study the decompose( ) and STL( ) functions in the R language.

Posted by Armando Brito Mendes | Filed under data mining, estatística

Bom texto com conclusões exclusivas

1. Re-sampling and Statistical Inference

- Main Result
- Sampling with or without Replacement
- Illustration
- Optimum Sample Size
- Optimum
*K*in*K*-fold Cross-Validation - Confidence Intervals, Tests of Hypotheses

2. Generic, All-purposes Algorithm

- Re-sampling Algorithm with Source Code
- Alternative Algorithm
- Using a Good Random Number Generator

3. Applications

- A Challenging Data Set
- Results and Excel Spreadsheet
- A New Fundamental Statistics Theorem
- Some Statistical Magic
- How does this work?
- Does this contradict entropy principles?

4. Conclusions

Tags: inferência, machine learning

## Eurostat – Eurpean Statistics

Posted by Armando Brito Mendes | Filed under data sets, estatística

## Extracting Seasonality and Trend from Data: Decomposition Using R

Posted by Armando Brito Mendes | Filed under estatística, Investigação Operacional, lições, linguagens de programação, materiais ensino, materiais para profissionais

Uma excelente descrição da decomposição clássica com Python e R.

## Understanding Decomposition

#### Decompose One Time Series into Multiple Series

Time series decomposition is a mathematical procedure which transforms a time series into multiple different time series. The original time series is often split into 3 component series:

**Seasonal:**Patterns that repeat with a fixed period of time. For example, a website might receive more visits during weekends; this would produce data with a seasonality of 7 days.**Trend:**The underlying trend of the metrics. A website increasing in popularity should show a general trend that goes up.**Random:**Also call “noise”, “irregular” or “remainder,” this is the residuals of the original time series after the seasonal and trend series are removed.

Tags: engenharia, inferência, otimização, previsão

## Biased vs Unbiased: Debunking Statistical Myths

Posted by Armando Brito Mendes | Filed under data mining, estatística

Uma reflexão sobre os enviesamentos que usamos na ciência de dados.

Anyone who attended statistical training at the college level has been taught the four rules that you should always abide by, when developing statistical models and predictions:

- You should only use unbiased estimates
- You should use estimates that have minimum variance
- In any optimization problem (for instance to compute an estimate from a maximum likelihood function, or to detect the best, most predictive subset of variables), you should always shoot for a global optimum, not a local one.
- And if you violate any of the above three rules, at least you need to make sure that your estimate, when the number of observations is large, satisfies them.

As a data scientist and ex-statistician, I violate these rules (especially #1 – #3) almost daily. Indeed, that’s part of what makes data science different from statistical science.

Tags: análise de dados

## Basketball Stat Cherry Picking

Posted by Armando Brito Mendes | Filed under estatística, visualização

Deep into the NBA playoffs, we are graced with stats-o-plenty before, during, and after every game. Some of the numbers are informative. Most of them are randomly used to illustrate a commentator’s point.

One of the most common stats is the conditional that says something like, “When player *X* scores at least *Y* points, the team wins 90 percent of their games.” It implies a cause-and-effect relationship.

The Cleveland Cavaliers won the most games when LeBron James scored 30 or more points. So James should just score that many points every time. Easy. I should be a coach.

It’s a bit of stat cherry picking, trying to find something in common among games won. So to make things easier, and for you to wow your friends during the games, I compiled winning percentages for several stats during the 2017-18 regular season. Select among the star players still in the playoffs.

Tags: análise de dados

## When Variable Reduction Doesn’t Work

Posted by Armando Brito Mendes | Filed under data mining, estatística, materiais para profissionais

Um bom exemplo de como os procedimentos habituais nem sempre funcionam

*Summary:** Exceptions sometimes make the best rules. Here’s an example of well accepted variable reduction techniques resulting in an inferior model and a case for dramatically expanding the number of variables we start with.*

of the things that keeps us data scientists on our toes is that the well-established rules-of-thumb don’t always work. Certainly one of the most well-worn of these rules is the parsimonious model; always seek to create the best model with the fewest variables. And woe to you who violate this rule. Your model will over fit, include false random correlations, or at very least will just be judged to be slow and clunky.

Certainly this is a rule I embrace when building models so I was surprised and then delighted to find a well conducted study by Lexis/Nexis that lays out a case where this clearly isn’t true.

Tags: problemas

## How signal processing can be used to identify patterns in complex time series

Posted by Armando Brito Mendes | Filed under estatística, Investigação Operacional

Uso de técnicas de processamento de sinal em séries cronológicas

The trend and seasonality can be accounted for in a linear model by including sinusoidal components with a given frequency. However, finding the appropriate frequency for each sinusoidal component requires a little more digging. This post shows how to use fast Fourier transforms to find these frequencies.

## How To Forecast Time Series Data With Multiple Seasonal Periods

Posted by Armando Brito Mendes | Filed under estatística, matemática, materiais para profissionais

Análise de séries complexas com múltiplos períodos sazonais

Time series data is produced in domains such as IT operations, manufacturing, and telecommunications. Examples of time series data include the number of client logins to a website on a daily basis, cell phone traffic collected per minute, and temperature variation in a region by the hour. Forecasting a time series signal ahead of time helps us make decisions such as planning capacity and estimating demand. Previous time series analysis blog posts focused on processing time series data that resides on Greenplum database using SQL functions. In this post, I will examine the modeling steps involved in forecasting a time series sequence with multiple seasonal periods. The various steps involved are outlined below:

- Multiple seasonality is modelled with the help of fourier series with different periods
- External regressors in the form of fourier terms are added to an ARIMA model to account for the seasonal behavior
- Akaike Information Criteria (AIC) is used to find the best fit model

Tags: previsão

## How To Use Multivariate Time Series Techniques For Capacity Planning on VMs

Posted by Armando Brito Mendes | Filed under data mining, estatística, Investigação Operacional, materiais ensino

Métodos multivariados para séries cronológicas com VMs

Capacity planning is an arduous, ongoing task for many operations teams, especially for those who rely on Virtual Machines (VMs) to power their business. At Pivotal, we have developed a data science model capable of forecasting hundreds of thousands of models to automate this task using a multivariate time series approach. Open to reuse for other areas such as industrial equipment or vehicles engines, this technique can be applied broadly to anything where regular monitoring data can be collected.

Tags: machine learning, previsão

## Three classes of metrics: centrality, volatility, and bumpiness

Posted by Armando Brito Mendes | Filed under data mining, estatística, Investigação Operacional

introduz uma nova classe de estatísticas para séries cronológicas: bumpiness

All statistical textbooks focus on centrality (median, average or mean) and volatility (variance). None mention the third fundamental class of metrics: bumpiness.

Here we introduce the concept of *bumpiness* and show how it can be used. Two different datasets can have same *mean* and *variance*, but a different *bumpiness*. Bumpiness is linked to how the data points are ordered, while centrality and volatility completely ignore order. So, bumpiness is useful for datasets where order matters, in particular time series. Also, bumpiness integrates the notion of dependence (among the data points), while centrality and variance do not. Note that a time series can have high volatility (high variance) and low bumpiness. The converse is true.

The attached Excel spreadsheet shows computations of the bumpiness coefficient r for various time series. It is also of interest to readers who wish to learn new Excel concepts such a random number generation with Rand, indirect references with Indirect, Rank, Large and other powerful but not well known Excel functions. It is also an example of a fully interactive Excel spreadsheet driven by two core parameters.

Finally, this article shows (1) how a new concept is thought of, (2) then a robust, modern definition materialized, and (3) eventually a more meaningful definition created based on, and compatible with previous science.

Tags: previsão