Learn R interactively with the swirl package
Posted by Armando Brito Mendes | Filed under estatística, materiais ensino, software
swirl is a software package for the R statistical programming language. Its purpose is to teach users statistics and R simultaneously and interactively.
Tags: data mining, desnvolvimento de software, R-software, software estatístico
How R came to be
Posted by Armando Brito Mendes | Filed under estatística, software, videos
How R came to be
Statistician John Chambers, the creator of S and a core member of R, talks about how R came to be in the short video below. Warning: Super nerdy waters ahead.
Tags: desnvolvimento de software, R-software, software estatístico
introducing R to a non-programmer in one hour
Posted by Armando Brito Mendes | Filed under estatística, materiais ensino, software
Biostatistics PhD candidate Alyssa Frazee was tasked with teaching her sister, an undergraduate in sociology, how to use R. She had only one hour.
Once you load in a dataset, things start to get fun. We learned a whole bunch of stuff from this data frame, like how to do basic tabulations and calculate summary statistics, how to figure out if you have missing data, and how to fit a simple linear model. This part was pretty fun because my sister started leading the session: instead of me saying “I’m going to show you how to do this,” it was her asking “Hey, could we make a scatterplot?” or “Do you think we could put the best-fit line on that plot?” I was really glad this happened — I hope it meant she was engaged and enjoying herself!
This is the nice thing about R. There are so many built-in functions and packages that you can get something useful with a few lines of code, and you don’t really even have to know what a function is to get started (although you should eventually). Then you can go as far down the rabbit hole as you want.
Tags: análise de dados, bioinformatica, Estat Descritiva, R-software, software estatístico
The Age of Data
Posted by Armando Brito Mendes | Filed under estatística, videos
Whiteboards
The Age of Data
Actian Big Data Analytics Platform
Actian DataCloud Platform
Big Data Analytics
Creating Value from Big Data and Hadoop
A New World for Analytics
The Need for an Analytic Platform
Seamless Integration
Analytic Offload
Creating Business Value with Analytics
Tags: big data, data mining, DW \ BI, Estat Descritiva
Posted by Armando Brito Mendes | Filed under estatística, materiais para profissionais
Alex Reinhart, a PhD statistics student at Carnegie Mellon University, covers some of the common analysis mistakes in Statistics Done Wrong.
Statistics Done Wrong is a guide to the most popular statistical errors and slip-ups committed by scientists every day, in the lab and in peer-reviewed journals. Many of the errors are prevalent in vast swathes of the published literature, casting doubt on the findings of thousands of papers. Statistics Done Wrong assumes no prior knowledge of statistics, so you can read it before your first statistics course or after thirty years of scientific practice.
The text is available for free online, and there’s a physical book version on the way.
Tags: análise de dados, data mining, decisão médica, inferência
Probability and Monte Carlo methods
Posted by Armando Brito Mendes | Filed under estatística, Habilitações Académicas, matemática, materiais ensino
This is a lecture post for my students in the CUNY MS Data Analytics program. In this series of lectures I discuss mathematical concepts from different perspectives. The goal is to ask questions and challenge standard ways of thinking about what are generally considered basic concepts. I also emphasize using programming to help gain insight into mathematics. Consequently these lectures will not always be as rigorous as they could be.
Tags
monte carlo, numerical integration, probability, simulation
Tags: Estat Descritiva, R-software, software estatístico
Machine Learning MOOC
Posted by Armando Brito Mendes | Filed under estatística, materiais ensino, videos
About the Course
Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI. In this class, you will learn about the most effective machine learning techniques, and gain practice implementing them and getting them to work for yourself. More importantly, you’ll learn about not only the theoretical underpinnings of learning, but also gain the practical know-how needed to quickly and powerfully apply these techniques to new problems. Finally, you’ll learn about some of Silicon Valley’s best practices in innovation as it pertains to machine learning and AI.
This course provides a broad introduction to machine learning, datamining, and statistical pattern recognition. Topics include: (i) Supervised learning (parametric/non-parametric algorithms, support vector machines, kernels, neural networks). (ii) Unsupervised learning (clustering, dimensionality reduction, recommender systems, deep learning). (iii) Best practices in machine learning (bias/variance theory; innovation process in machine learning and AI). The course will also draw from numerous case studies and applications, so that you’ll also learn how to apply learning algorithms to building smart robots (perception, control), text understanding (web search, anti-spam), computer vision, medical informatics, audio, database mining, and other areas.
FAQ
- What is the format of the class?The class will consist of lecture videos, which are broken into small chunks, usually between eight and twelve minutes each. Some of these may contain integrated quiz questions. There will also be standalone quizzes that are not part of video lectures, and programming assignments.
- How much programming background is needed for the course?The course includes programming assignments and some programming background will be helpful.
- Do I need to buy a textbook for the course?No, it is self-contained.
- Will I get a statement of accomplishment after completing this class?Yes. Students who successfully complete the class will receive a statement of accomplishment signed by the instructor.
Tags: big data, bioinformatica, captura de conhecimento, data mining, DW \ BI
Why Predictive Modelers Should be Suspicious of Statistical Tests
Posted by Armando Brito Mendes | Filed under estatística
Well, the danger is really not the statistical test per se, it the interpretation of the statistical test.
Yesterday I tweeted (@deanabb) this fun factoid: “Redskins predict Romney wins POTUS #overfit. if Redskins lose home game before election => challenger wins (17/18) http://www.usatoday.com/story/gameon/2012/11/04/nfl-redskins-rule-romney/1681023/” I frankly had never heard of this “rule” before and found it quite striking. It even has its own Wikipedia page (http://en.wikipedia.org/wiki/Redskins_Rule).
For those of us in the predictive analytics or data mining community, and those of us who use statistical tests to help out interpreting small data, 17/18 we know is a hugely significant finding. This can frequently be good: statistical tests will help us gain intuition about value of relationships in data even when they aren’t obvious.
Tags: data mining, IBM SPSS Statistics, inferência, software estatístico
4 Faces of Big Data
Posted by Armando Brito Mendes | Filed under estatística
The 4 Faces of Big Data Challenges You just Can’t Ignore
Tags: análise de dados, big data, data mining
Paddy – design a multi-stage survey
Posted by Armando Brito Mendes | Filed under estatística, materiais ensino, software
This game is a rice survey based on an actual survey carried out in Sri Lanka. In a small district there are 10 villages with a total of 160 farmers who each have one field in which to grow rice. A census of the area has been undertaken and the acreage cultivated by each farmer is known. There is now to be a crop cuttin survey whose main aim is to estimate the mean yield of rice per acre and hence the total production of rice in the district. The survey will also be used to investigate the use of fertilisers and the different varieties of rice used in the district.
The resources available allow for 30 plots to be sampled. The plots to be harvested are 1/80 acre but the yields are recorded in bushels per acre. Students use a multistage sampling scheme. For example:
- Select x villages
- From each village choose y fields
- Select z plots from each field
The game consists of 10 boxes each containing a number of envelopes, which themselves contain a number of slips of paper. The boxes represent a village so students select the boxes corresponding to their chosen villages. They open the boxes and select the envelopes labelled with their chosen field number. Information on the size of the field, the variety of rice used and the amount of fertiliser applied is also displayed on the envelope label. Finally, they select the slip of paper labelled with their chosen plot number and record the yield.