Apache Spark

Posted by Armando Brito Mendes | Filed under materiais para profissionais, software

Uma alternativa ao Hadoop para computação com dados em memória

What is Apache Spark?

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop.

To make programming faster, Spark provides clean, concise APIs in Scala, Java and Python. You can also use Spark interactively from the Scala and Python shells to rapidly query big datasets.

What can it do?

Spark was initially developed for two applications where placing data in memory helps: iterative algorithms, which are common in machine learning, and interactive data mining. In both cases, Spark can run up to 100x faster than Hadoop MapReduce. However, you can use Spark for general data processing too. Check out our example jobs.

Spark is also the engine behind Shark, a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive.

While Spark is a new engine, it can access any data source supported by Hadoop, making it easy to run over existing data.

Tags: análise de dados, big data, data mining, DW \ BI

Read more | Comments off | January 13th, 2014

introducing R to a non-programmer in one hour

Posted by Armando Brito Mendes | Filed under estatística, materiais ensino, software

Uma introdução muito rápida

Biostatistics PhD candidate Alyssa Frazee was tasked with teaching her sister, an undergraduate in sociology, how to use R. She had only one hour.

Once you load in a dataset, things start to get fun. We learned a whole bunch of stuff from this data frame, like how to do basic tabulations and calculate summary statistics, how to figure out if you have missing data, and how to fit a simple linear model. This part was pretty fun because my sister started leading the session: instead of me saying “I’m going to show you how to do this,” it was her asking “Hey, could we make a scatterplot?” or “Do you think we could put the best-fit line on that plot?” I was really glad this happened — I hope it meant she was engaged and enjoying herself!

This is the nice thing about R. There are so many built-in functions and packages that you can get something useful with a few lines of code, and you don’t really even have to know what a function is to get started (although you should eventually). Then you can go as far down the rabbit hole as you want.

Tags: análise de dados, bioinformatica, Estat Descritiva, R-software, software estatístico

Read more | Comments off | January 8th, 2014

What is Apache Mahout?

Posted by Armando Brito Mendes | Filed under software

Um exemplo dos muitos projetos open source for big data

The Apache Mahout™ machine learning library’s goal is to build scalable machine learning libraries.

Mahout currently has

User and Item based recommenders
Matrix factorization based recommenders
K-Means, Fuzzy K-Means clustering
Latent Dirichlet Allocation
Singular value decomposition
Logistic regression based classifier
Complementary Naive Bayes classifier
Random forest decision tree based classifier
High performance java collections (previously colt collections)
A vibrant community

With scalable we mean:

Scalable to reasonably large data sets. Our core algorithms for clustering, classfication and collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms

Scalable to support your business case. Mahout is distributed under a commercially friendly Apache Software license.

Tags: big data, data mining, DW \ BI

Read more | Comments off | January 8th, 2014

Data Mining with Weka MOOC

Posted by Armando Brito Mendes | Filed under Habilitações Académicas, materiais ensino, software, videos

Um curso em vídeo sobre a utilização do WEKA para data mining

Welcome to the free online course Data Mining with Weka

This 5 week MOOC introduced data mining concepts through practical experience with the free Weka tool.

The course featured:

video lectures by Professor Ian H. Witten
- available on YouTube and YouKu
- English captions on YouTube
- English & Chinese subtitles on YouKu
- CC-BY videos & slides
the open-source Weka data mining platform
access to chapters from Data Mining (3rd Edition)
- discounts from Morgan Kaufmann
online assessment leading to a statement of completion

The course will run again in early March 2014. To get notified about dates (enrolment, commencement), please subscribe to the announcement forum.

You can access the course material (videos, slides, etc) from here.

Tags: data mining, software estatístico, WEKA

Read more | Comments off | December 18th, 2013

Paddy – design a multi-stage survey

Posted by Armando Brito Mendes | Filed under estatística, materiais ensino, software

Jogo sério para desenho de inquéritos

This game is a rice survey based on an actual survey carried out in Sri Lanka. In a small district there are 10 villages with a total of 160 farmers who each have one field in which to grow rice. A census of the area has been undertaken and the acreage cultivated by each farmer is known. There is now to be a crop cuttin survey whose main aim is to estimate the mean yield of rice per acre and hence the total production of rice in the district. The survey will also be used to investigate the use of fertilisers and the different varieties of rice used in the district.

The resources available allow for 30 plots to be sampled. The plots to be harvested are 1/80 acre but the yields are recorded in bushels per acre. Students use a multistage sampling scheme. For example:

Select x villages
From each village choose y fields
Select z plots from each field

The game consists of 10 boxes each containing a number of envelopes, which themselves contain a number of slips of paper. The boxes represent a village so students select the boxes corresponding to their chosen villages. They open the boxes and select the envelopes labelled with their chosen field number. Information on the size of the field, the variety of rice used and the amount of fertiliser applied is also displayed on the envelope label. Finally, they select the slip of paper labelled with their chosen plot number and record the yield.

Tags: Estat Descritiva, inquéritos, software estatístico

Read more | Comments off | September 26th, 2013

To the Woods – a detailed comparison of Sampling methods

Posted by Armando Brito Mendes | Filed under estatística, materiais ensino, software

Simulação para aprender amostragem simples e estratificada

To the Woods – a detailed comparison of Simple Random Sampling and Stratified Sampling

In this game the aim is to conduct a small survey to estimate the total number of trees in a forest and the proportion of large trees. A tree is considered ‘large’ if its diameter at breast height (DBH) is greater than 30cm. The area of forest from which the sample is to be taken is divided into two regions (‘East’ and ‘West’) by a river. Within each region it is possible to count the number of trees in any 50m x 50m plot. There are 168 plots in total – 96 to the West of the river and 72 to the East.

There are two alternative sampling solutions. Students take a sample of 14 plots and can either use simple random sampling or stratified sampling to choose them. They record the number of small trees, the number of large trees and the total number of trees for each of the 14 observations.

The game consists of 168 small pieces of card, which represent the plots, slipped into slits in a large piece of card representing the forest. A river can be drawn on the large piece of card to divide the forest into two regions. One side is labelled ‘West’ and the other ‘East’. The protruding sections of the plots are labelled with their region side (West or East) and plot number (1 to 96 and 1 to 72, respectively). The student pulls out the chosen plots and records the numbers of large and small trees, which is printed on the lower section of the plot.

Tags: Estat Descritiva, software estatístico

Read more | Comments off | September 26th, 2013

Excel: Esconder valores duplicados

Posted by Armando Brito Mendes | Filed under materiais para profissionais, software

Tip para esconder valores repetidos usando formatação condicional

Duplicate values aren’t wrong or bad, but they can be distracting. You probably won’t want to delete them, but you might just want to hide them. For example, the simple sheet below repeats date values in column A. In a reporting scenario, those duplicate dates might be annoying.

Tags: Excel, programação em folha de cálculo

Read more | Comments off | August 29th, 2013

Rattle: A Graphical User Interface for Data Mining using R

Posted by Armando Brito Mendes | Filed under estatística, materiais ensino, materiais para profissionais, software

Rattle (the R Analytical Tool To Learn Easily) presents statistical and visual summaries of data, transforms data into forms that can be readily modelled, builds both unsupervised and supervised models from the data, presents the performance of models graphically, and scores new datasets.

Tags: data mining, R-software, software estatístico, text mining

Read more | Comments off | August 1st, 2013

Reddit Data Is Beautiful

Posted by Armando Brito Mendes | Filed under estatística, software, visualização

Um blog sobre visualização e R

Data is Beautiful

A place for visual representations of data: Graphs, charts, maps, etc.

Best of 2012 Results

Rules

A post must be a data visualization.
Link to original authors or tag as [OC] if you made it.
Questions must include a visualization. more info
Infographics belong in /r/infographics

Infographic vs. Visualization? Data from Star Trek? Data ARE? How do I make one? Read the FAQ

Tags: análise de dados, belo, IBM SPSS Statistics, R-software, software estatístico

Read more | Comments off | July 12th, 2013

Data-Driven Documents D3.js

Posted by Armando Brito Mendes | Filed under materiais para profissionais, software, visualização

Uma biblioteca java para construir visualizações de dados

D3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG and CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation.

Download the latest version here:

d3.v3.zip

Or, to link directly to the latest release, copy this snippet:

<script src="http://d3js.org/d3.v3.min.js" charset="utf-8"></script>

Armando B. Mendes

Apache Spark

What is Apache Spark?

What can it do?

introducing R to a non-programmer in one hour

What is Apache Mahout?

The Apache Mahout™ machine learning library’s goal is to build scalable machine learning libraries.

Mahout currently has

Data Mining with Weka MOOC

Welcome to the free online course Data Mining with Weka

Paddy – design a multi-stage survey

To the Woods – a detailed comparison of Sampling methods

To the Woods – a detailed comparison of Simple Random Sampling and Stratified Sampling

Excel: Esconder valores duplicados

Rattle: A Graphical User Interface for Data Mining using R

Reddit Data Is Beautiful

Data is Beautiful

Rules

Related

Data-Driven Documents D3.js

Categorias de Posts

Palavras chave mais usadas

Arquivo

Recent Posts

Recent Comments

About