Spreadsheet Addiction
Posted by Armando Brito Mendes | Filed under estatística, materiais para profissionais
Um bom e muito completo relato dos defeitos do MS Excel para análise de dados.
Some people will think that the “addiction” in the title is over the top, or at least used metaphorically. It is used literally, and is not an exaggeration.
Addiction is the persistent use of a substance where that use is detrimental to the user. It is not the substance that is the problem — more limited use may be beneficial. It is the extent and circumstances of the use that determine if the behavior is addictive or not.
Spreadsheets are a wonderful invention. They are an excellent tool for what they are good at. The problem is that they are often stretched far beyond their home territory. Dangerous abuse of spreadsheets is only too common.
I know there are many spreadsheets in financial companies that take all night to compute. These are complicated and commonly fail. When such spreadsheets are replaced by code more suited to the task, it is not unusual for the computation time to be cut to a few minutes and the process much easier to understand.
A 2012 example of spreadsheet addiction.
The technology acceptance model holds that there are two main factors that determine the uptake of a technology: the perceived usefulness and the perceived ease-of-use. Perception need not correspond to reality.
The perception of the ease-of-use of spreadsheets is to some extent an illusion. It is dead easy to get an answer from a spreadsheet, however, it is not necessarily easy to get the right answer. Thus the distorted view.
The difficulty of using alternatives to spreadsheets is overestimated by many people. Safety features can give the appearance of difficulty when in fact these are an aid.
The hard way looks easy, the easy way looks hard.
The remainder of this page is divided into the sections:
Spreadsheet Computation
The Treatment Center (Alternatives)
If You Must Persist
Specific Problems with Excel
Additional Links
Tags: análise de dados, Excel, programação em folha de cálculo
Using Open Source Technology in Higher Education
Posted by Armando Brito Mendes | Filed under estatística, software
Using R for Basic Cross Tabulation Analysis: Part Three, Using the xtabs Function
crosstabsrr programmingr statisticstable analysis
Using R to Work with GSS Survey Data: Cross Tabulation Tables
chi squaredcross tablescrosstabsrr programmingr statisticstable analysis
R Tutorial: Using R to Work With Datasets From the NORC General Social Science Survey
create csv filefile conversionrr programmingr statisticsr tutorialread spss filesresearch
How to Set Up SSH to Remotely Control Your Raspberry Pi
mmand lineraspberry piraspberry pi computingRaspberry Pi Software Configuationremote access with sshset up sshsshterminal program
Tags: análise de dados, data mining, desnvolvimento de software, Estat Descritiva, R-software, software estatístico
Income inequality seen in satellite images from Google Earth
Posted by Armando Brito Mendes | Filed under estatística, visualização
Researchers Pengyu Zhua and Yaoqi Zhang noted in their 2008 paper that “the demand for urban forests is elastic with respect to price and highly responsive to changes in income.” Poor neighborhoods tend to have fewer trees and the rate of forestry growth is slower than that of richer neighborhoods.
Tim De Chant of Per Square Mile wondered if this difference could be seen through satellite images in Google Earth. It turns out that you can see the distinct difference in a lot of places. Above, for example, shows two areas in Rio de Janeiro: Rocinha on the left and Zona Sul on the right. Notice the tree-lined streets versus the not so green.
De Chant notes:
It’s easy to see trees as a luxury when a city can barely keep its roads and sewers in working order, but that glosses over the many benefits urban trees provide. They shade houses in the summer, reducing cooling bills. They scrub the air of pollution, especially of the particulate variety, which in many poor neighborhoods is responsible for increased asthma rates and other health problems. They also reduce stress, which has its own health benefits. Large, established trees can even fight crime.
Okay, I don’t now about that last part about fighting crime. Without seeing the data, I think that sounds like a correlation more than anything else, but still. Trees. Good.
Tags: análise de dados, data mining, image mining, mapas
Site sobre visualização da GE.com
Posted by Armando Brito Mendes | Filed under estatística, visualização
GE Works. Building, Moving, Powering and Curing the world. In the process, our technologies are generating data on a petabyte scale. This data contains valuable information that will drive insights, innovations, and discoveries, but it can be difficult to access and digest. Using data visualization, we’re pairing science and design to simplify the complexity and drive a deeper understanding of the context in which we operate.
We encourage you to explore the projects below.
For further information about GE’s data visualization program, please contact us at datavizinfo@ge.com
To share your own visualizations, please visit www.visualizing.org
Tags: análise de dados, belo, data mining, Estat Descritiva, mapas
Better data centers through machine learning
Posted by Armando Brito Mendes | Filed under materiais para profissionais
It’s no secret that we’re obsessed with saving energy. For over a decade we’ve been designing and building data centers that use half the energy of a typical data center, and we’re always looking for ways to reduce our energy use even further. In our pursuit of extreme efficiency, we’ve hit upon a new tool: machine learning. Today we’re releasing a white paper (PDF) on how we’re using neural networks to optimize data center operations and drive our energy use to new lows.
Tags: análise de dados, data mining, previsão
Erros em gráficos na notícias
Posted by Armando Brito Mendes | Filed under estatística, visualização
Fox News bar chart gets it wrong
Because Fox News. See also this, this, and this. [Thanks, Meron]
Tags: análise de dados, belo, data mining, Estat Descritiva
Read Histograms and Use Them in R
Posted by Armando Brito Mendes | Filed under estatística, materiais para profissionais, visualização
How to Read Histograms and Use Them in R
The histogram is one of my favorite chart types, and for analysis purposes, I probably use them the most. Devised by Karl Pearson (the father of mathematical statistics) in the late 1800s, it’s simple geometrically, robust, and allows you to see the distribution of a dataset.
If you don’t understand what’s driving the chart though, it can be confusing, which is probably why you don’t see it often in general publications.
Tags: análise de dados, data mining, Estat Descritiva, R-software, software estatístico
portal smart datacollective.com
Posted by Armando Brito Mendes | Filed under materiais para profissionais
SmartData Collective, an online community moderated by Social Media Today, provides enterprise leaders access to the latest trends in Business Intelligence and Data Management. Our innovative model serves as a platform for recognized, global experts to share their insights through peer contributions, custom content publishing and alignment with industry leaders. SmartData Collective is a key resource for executives who need to make informed data management decisions.
Tags: análise de dados, big data, bioinformatica, captura de conhecimento, data mining, decisao em grupo
17 short tutorials all data scientists should read
Posted by Armando Brito Mendes | Filed under estatística, materiais para profissionais
Here’s the list:
- Practical illustration of Map-Reduce (Hadoop-style), on real data
- A synthetic variance designed for Hadoop and big data
- Fast Combinatorial Feature Selection with New Definition of Predict…
- A little known component that should be part of most data science a…
- 11 Features any database, SQL or NoSQL, should have
- Clustering idea for very large datasets
- Hidden decision trees revisited
- Correlation and R-Squared for Big Data
- Marrying computer science, statistics and domain expertize
- New pattern to predict stock prices, multiplies return by factor 5
- What Map Reduce can’t do
- Excel for Big Data
- Fast clustering algorithms for massive datasets
- Source code for our Big Data keyword correlation API
- The curse of big data
- How to detect a pattern? Problem and solution
- Interesting Data Science Application: Steganography
Related link: The Data Science Toolkit
Tags: análise de dados, big data, captura de conhecimento, data mining, Excel, R-software
Apache Spark
Posted by Armando Brito Mendes | Filed under materiais para profissionais, software
What is Apache Spark?
Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.
To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop.
To make programming faster, Spark provides clean, concise APIs in Scala, Java and Python. You can also use Spark interactively from the Scala and Python shells to rapidly query big datasets.
What can it do?
Spark was initially developed for two applications where placing data in memory helps: iterative algorithms, which are common in machine learning, and interactive data mining. In both cases, Spark can run up to 100x faster than Hadoop MapReduce. However, you can use Spark for general data processing too. Check out our example jobs.
Spark is also the engine behind Shark, a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive.
While Spark is a new engine, it can access any data source supported by Hadoop, making it easy to run over existing data.
Tags: análise de dados, big data, data mining, DW \ BI