Axes of evil: How to lie with graphs

clicar na imagem para seguir o link
clicar na imagem para seguir o link

Um blog com exemplos e links para outros sites.

As Mark Twain once said, “Never let the truth get in the way of a good story.” Here are a few techniques to hide those pesky numbers and tell the story you feel, not the one you can prove.

Don your handlebar mustache and practice your evil laugh — we’re going in.

Tags: ,

A Beginner’s Guide to learn web scraping with python!

clique na imagem para seguir link
clique na imagem para seguir link
Boa descrição de web scraping com Python

Web Scraping with Python

Imagine you have to pull a large amount of data from websites and you want to do it as quickly as possible. How would you do it without manually going to each website and getting the data? Well, “Web Scraping” is the answer. Web Scraping just makes this job easier and faster. 

In this article on Web Scraping with Python, you will learn about web scraping in brief and see how to extract data from a website with a demonstration. I will be covering the following topics:

Tags:

The Beautiful Hidden Logic of Cities

clicar na imagem para seguir o link

clicar na imagem para seguir o link

Padrões identificados em mapas de cidades.

After finishing my map of the most common road suffixes by length, I realized I could also map each individual road, colored by its suffix. This has led to the loveliest maps I’ve made.

Driving around your city, you’re probably somewhat aware of Avenues and Boulevards and Streets and Roads and so on. Here in Portland, at least, I know that Avenues run north-south and Streets run east-west. However, it’s hard to get an overall view of how all these road designations knit together. By coloring them, we can suddenly see a new, stunning view of what we normally take for granted.

Tags: , ,

Making of the Illustrations of the Natural Orders of Plants

clique na imagem para seguir o link

If someone told me when I was young that I would spend three months of my time tracing nineteenth century botanical illustrations and enjoy it, I would have scoffed, but that’s what I did to reproduce Elizabeth Twining’s Illustrations of the Natural Orders of Plants and I loved every minute.

After the unexpected successes of my Byrne’s Euclid and Werner’s Nomenclature of Colours projects (for which I’m very grateful) I got the itch to follow them up with another reproduction of an obscure catalog from the 1800s. However, finding interesting obscure catalogs want an easy task when I didn’t know what would pique my interest. Anything was fair game but I had an inkling that something based on the sciences would be most interesting. Scientific catalogs are organized, structured, and data can be extracted from them with some elbow grease.

Tags: , , ,

Extracting Seasonality and Trend from Data: Decomposition Using R

Clique na imagem para seguir o link.

Uma excelente descrição da decomposição clássica com Python e R.

Time series decomposition works by splitting a time series into three components: seasonality, trends and random fluctiation. To show how this works, we will study the decompose( ) and STL( ) functions in the R language.

Understanding Decomposition

Decompose One Time Series into Multiple Series

Time series decomposition is a mathematical procedure which transforms a time series into multiple different time series. The original time series is often split into 3 component series:

  • Seasonal: Patterns that repeat with a fixed period of time. For example, a website might receive more visits during weekends; this would produce data with a seasonality of 7 days.
  • Trend: The underlying trend of the metrics. A website increasing in popularity should show a general trend that goes up.
  • Random: Also call “noise”, “irregular” or “remainder,” this is the residuals of the original time series after the seasonal and trend series are removed.

Tags: , , ,

Making it easier to discover datasets

Clique na imagem para seguir o link

Clique na imagem para seguir o link

Novo recurso da google para identificar conjuntos de dados.

In today’s world, scientists in many disciplines and a growing number of journalists live and breathe data. There are many thousands of data repositories on the web, providing access to millions of datasets; and local and national governments around the world publish their data as well. To enable easy access to this data, we launched Dataset Search, so that scientists, data journalists, data geeks, or anyone else can find the data required for their work and their stories, or simply to satisfy their intellectual curiosity.

Tags:

When Variable Reduction Doesn’t Work

clique na imagem para seguir o link

clique na imagem para seguir o link

Um bom exemplo de como os procedimentos habituais nem sempre funcionam

Summary: Exceptions sometimes make the best rules.  Here’s an example of well accepted variable reduction techniques resulting in an inferior model and a case for dramatically expanding the number of variables we start with.

of the things that keeps us data scientists on our toes is that the well-established rules-of-thumb don’t always work.  Certainly one of the most well-worn of these rules is the parsimonious model; always seek to create the best model with the fewest variables.  And woe to you who violate this rule.  Your model will over fit, include false random correlations, or at very least will just be judged to be slow and clunky.

Certainly this is a rule I embrace when building models so I was surprised and then delighted to find a well conducted study by Lexis/Nexis that lays out a case where this clearly isn’t true.

Tags: ,

How To Forecast Time Series Data With Multiple Seasonal Periods

clique na imagem para seguir o link

clique na imagem para seguir o link

Análise de séries complexas com múltiplos períodos sazonais

Time series data is produced in domains such as IT operations, manufacturing, and telecommunications. Examples of time series data include the number of client logins to a website on a daily basis, cell phone traffic collected per minute, and temperature variation in a region by the hour. Forecasting a time series signal ahead of time helps us make decisions such as planning capacity and estimating demand. Previous time series analysis blog posts focused on processing time series data that resides on Greenplum database using SQL functions. In this post, I will examine the modeling steps involved in forecasting a time series sequence with multiple seasonal periods. The various steps involved are outlined below:

  • Multiple seasonality is modelled with the help of fourier series with different periods
  • External regressors in the form of fourier terms are added to an ARIMA model to account for the seasonal behavior
  • Akaike Information Criteria (AIC) is used to find the best fit model

Tags:

Avoiding a common mistake with time series

clique na imagem para seguir o link

clique na imagem para seguir o link

Um caso em q a tendência mascara o resto da série criando correlações elevadas

A basic mantra in statistics and data science is correlation is not causation, meaning that just because two things appear to be related to each other doesn’t mean that one causes the other. This is a lesson worth learning.

If you work with data, throughout your career you’ll probably have to re-learn it several times. But you often see the principle demonstrated with a graph like this:

Dow Jones vs. Jennifer Lawrence

One line is something like a stock market index, and the other is an (almost certainly) unrelated time series like “Number of times Jennifer Lawrence is mentioned in the media.” The lines look amusingly similar. There is usually a statement like: “Correlation = 0.86”.  Recall that a correlation coefficient is between +1 (a perfect linear relationship) and -1 (perfectly inversely related), with zero meaning no linear relationship at all.  0.86 is a high value, demonstrating that the statistical relationship of the two time series is strong.

The correlation passes a statistical test. This is a great example of mistaking correlation for causality, right? Well, no, not really: it’s actually a time series problem analyzed poorly, and a mistake that could have been avoided. You never should have seen this correlation in the first place.

The more basic problem is that the author is comparing two trended time series. The rest of this post will explain what that means, why it’s bad, and how you can avoid it fairly simply. If any of your data involves samples taken over time, and you’re exploring relationships between the series, you’ll want to read on.

Tags:

How and Why: Decorrelate Time Series

clique na imagem para seguir o link

clique na imagem para seguir o link

O problemas das autocorrelações nas séries cronológicas.

When dealing with time series, the first step consists in isolating trends and periodicites. Once this is done, we are left with a normalized time series, and studying the auto-correlation structure is the next step, called model fitting. The purpose is to check whether the underlying data follows some well known stochastic process with a similar auto-correlation structure, such as ARMA processes, using tools such as Box and Jenkins. Once a fit with a specific model is found, model parameters can be estimated and used to make predictions.

A deeper investigation consists in isolating the auto-correlations to see whether the remaining values, once decorrelated, behave like white noise, or not. If departure from white noise is found (using a few tests of randomness), then it means that the time series in question exhibits unusual patterns not explained by trends, seasonality or auto correlations. This can be useful knowledge in some contexts  such as high frequency trading, random number generation, cryptography or cyber-security. The analysis of decorrelated residuals can also help identify change points and instances of slope changes in time series, or reveal otherwise undetected outliers.

Tags: