Time series decomposition works by splitting a time series into three components: seasonality, trends and random fluctiation. To show how this works, we will study the decompose( ) and STL( ) functions in the R language.
Eurostat – Eurpean Statistics
Posted by Armando Brito Mendes | Filed under data sets, estatística
Extracting Seasonality and Trend from Data: Decomposition Using R
Posted by Armando Brito Mendes | Filed under estatística, Investigação Operacional, lições, linguagens de programação, materiais ensino, materiais para profissionais
Uma excelente descrição da decomposição clássica com Python e R.
Understanding Decomposition
Decompose One Time Series into Multiple Series
Time series decomposition is a mathematical procedure which transforms a time series into multiple different time series. The original time series is often split into 3 component series:
- Seasonal: Patterns that repeat with a fixed period of time. For example, a website might receive more visits during weekends; this would produce data with a seasonality of 7 days.
- Trend: The underlying trend of the metrics. A website increasing in popularity should show a general trend that goes up.
- Random: Also call “noise”, “irregular” or “remainder,” this is the residuals of the original time series after the seasonal and trend series are removed.
Tags: engenharia, inferência, otimização, previsão
Biased vs Unbiased: Debunking Statistical Myths
Posted by Armando Brito Mendes | Filed under estatística
Uma reflexão sobre os enviesamentos que usamos na ciência de dados.
Anyone who attended statistical training at the college level has been taught the four rules that you should always abide by, when developing statistical models and predictions:
- You should only use unbiased estimates
- You should use estimates that have minimum variance
- In any optimization problem (for instance to compute an estimate from a maximum likelihood function, or to detect the best, most predictive subset of variables), you should always shoot for a global optimum, not a local one.
- And if you violate any of the above three rules, at least you need to make sure that your estimate, when the number of observations is large, satisfies them.
As a data scientist and ex-statistician, I violate these rules (especially #1 – #3) almost daily. Indeed, that’s part of what makes data science different from statistical science.
Tags: análise de dados, data mining
Basketball Stat Cherry Picking
Posted by Armando Brito Mendes | Filed under estatística, visualização
Deep into the NBA playoffs, we are graced with stats-o-plenty before, during, and after every game. Some of the numbers are informative. Most of them are randomly used to illustrate a commentator’s point.
One of the most common stats is the conditional that says something like, “When player X scores at least Y points, the team wins 90 percent of their games.” It implies a cause-and-effect relationship.
The Cleveland Cavaliers won the most games when LeBron James scored 30 or more points. So James should just score that many points every time. Easy. I should be a coach.
It’s a bit of stat cherry picking, trying to find something in common among games won. So to make things easier, and for you to wow your friends during the games, I compiled winning percentages for several stats during the 2017-18 regular season. Select among the star players still in the playoffs.
Tags: análise de dados
When Variable Reduction Doesn’t Work
Posted by Armando Brito Mendes | Filed under estatística, materiais para profissionais
Um bom exemplo de como os procedimentos habituais nem sempre funcionam
Summary: Exceptions sometimes make the best rules. Here’s an example of well accepted variable reduction techniques resulting in an inferior model and a case for dramatically expanding the number of variables we start with.
of the things that keeps us data scientists on our toes is that the well-established rules-of-thumb don’t always work. Certainly one of the most well-worn of these rules is the parsimonious model; always seek to create the best model with the fewest variables. And woe to you who violate this rule. Your model will over fit, include false random correlations, or at very least will just be judged to be slow and clunky.
Certainly this is a rule I embrace when building models so I was surprised and then delighted to find a well conducted study by Lexis/Nexis that lays out a case where this clearly isn’t true.
Tags: data mining, problemas
How signal processing can be used to identify patterns in complex time series
Posted by Armando Brito Mendes | Filed under estatística, Investigação Operacional
Uso de técnicas de processamento de sinal em séries cronológicas
The trend and seasonality can be accounted for in a linear model by including sinusoidal components with a given frequency. However, finding the appropriate frequency for each sinusoidal component requires a little more digging. This post shows how to use fast Fourier transforms to find these frequencies.
How To Forecast Time Series Data With Multiple Seasonal Periods
Posted by Armando Brito Mendes | Filed under estatística, matemática, materiais para profissionais
Análise de séries complexas com múltiplos períodos sazonais
Time series data is produced in domains such as IT operations, manufacturing, and telecommunications. Examples of time series data include the number of client logins to a website on a daily basis, cell phone traffic collected per minute, and temperature variation in a region by the hour. Forecasting a time series signal ahead of time helps us make decisions such as planning capacity and estimating demand. Previous time series analysis blog posts focused on processing time series data that resides on Greenplum database using SQL functions. In this post, I will examine the modeling steps involved in forecasting a time series sequence with multiple seasonal periods. The various steps involved are outlined below:
- Multiple seasonality is modelled with the help of fourier series with different periods
- External regressors in the form of fourier terms are added to an ARIMA model to account for the seasonal behavior
- Akaike Information Criteria (AIC) is used to find the best fit model
Tags: previsão
How To Use Multivariate Time Series Techniques For Capacity Planning on VMs
Posted by Armando Brito Mendes | Filed under estatística, Investigação Operacional, materiais ensino
Métodos multivariados para séries cronológicas com VMs
Capacity planning is an arduous, ongoing task for many operations teams, especially for those who rely on Virtual Machines (VMs) to power their business. At Pivotal, we have developed a data science model capable of forecasting hundreds of thousands of models to automate this task using a multivariate time series approach. Open to reuse for other areas such as industrial equipment or vehicles engines, this technique can be applied broadly to anything where regular monitoring data can be collected.
Tags: data mining, machine learning, previsão
Three classes of metrics: centrality, volatility, and bumpiness
Posted by Armando Brito Mendes | Filed under estatística, Investigação Operacional
introduz uma nova classe de estatísticas para séries cronológicas: bumpiness
All statistical textbooks focus on centrality (median, average or mean) and volatility (variance). None mention the third fundamental class of metrics: bumpiness.
Here we introduce the concept of bumpiness and show how it can be used. Two different datasets can have same mean and variance, but a different bumpiness. Bumpiness is linked to how the data points are ordered, while centrality and volatility completely ignore order. So, bumpiness is useful for datasets where order matters, in particular time series. Also, bumpiness integrates the notion of dependence (among the data points), while centrality and variance do not. Note that a time series can have high volatility (high variance) and low bumpiness. The converse is true.
The attached Excel spreadsheet shows computations of the bumpiness coefficient r for various time series. It is also of interest to readers who wish to learn new Excel concepts such a random number generation with Rand, indirect references with Indirect, Rank, Large and other powerful but not well known Excel functions. It is also an example of a fully interactive Excel spreadsheet driven by two core parameters.
Finally, this article shows (1) how a new concept is thought of, (2) then a robust, modern definition materialized, and (3) eventually a more meaningful definition created based on, and compatible with previous science.
Tags: data mining, previsão
Recurrent neural networks, Time series data and IoT – Part One
Posted by Armando Brito Mendes | Filed under estatística, Investigação Operacional, materiais ensino
Utilização de redes neuronais para previsão de séries univariadas
RNNs are already used for Time series analysis. Because IoT problems can often be modelled as a Time series, RNNs could apply to IoT data. In this multi-part blog, we first discuss Time series applications and then discuss how RNNs could apply to Time series applications. Finally, we discuss applicability to IoT.
In this article (Part One), we present the overall thought process behind the use of Recurrent neural networks and Time series applications – especially a type of RNN called Long Short Term Memory networks (LSTMs).
Tags: data mining, machine learning, previsão