Lexical Distance Among the Languages of Europe
Posted by Armando Brito Mendes | Filed under Investigação Operacional, visualização
This chart shows the lexical distance — that is, the degree of overall vocabulary divergence — among the major languages of Europe.
The size of each circle represents the number of speakers for that language. Circles of the same color belong to the same language group. All the groups except for Finno-Ugric (in yellow) are in turn members of the Indo-European language family.
English is a member of the Germanic group (blue) within the Indo-European family. But thanks to 1066, William of Normandy, and all that, about 75% of the modern English vocabulary comes from French and Latin (ie the Romance languages, in orange) rather than Germanic sources. As a result, English (a Germanic language) and French (a Romance language) are actually closer to each other in lexical terms than Romanian (a Romance language) and French.
Tags: ARS\SNA applicações, belo, data mining, grafos
Apache Spark
Posted by Armando Brito Mendes | Filed under materiais para profissionais, software
What is Apache Spark?
Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.
To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop.
To make programming faster, Spark provides clean, concise APIs in Scala, Java and Python. You can also use Spark interactively from the Scala and Python shells to rapidly query big datasets.
What can it do?
Spark was initially developed for two applications where placing data in memory helps: iterative algorithms, which are common in machine learning, and interactive data mining. In both cases, Spark can run up to 100x faster than Hadoop MapReduce. However, you can use Spark for general data processing too. Check out our example jobs.
Spark is also the engine behind Shark, a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive.
While Spark is a new engine, it can access any data source supported by Hadoop, making it easy to run over existing data.
Tags: análise de dados, big data, data mining, DW \ BI
Century of rock history
Posted by Armando Brito Mendes | Filed under visualização
Jessica Edmondson visualized the history of rock music, from foundations in the pre-1900s to a boom in the 1960s and finally to what we have now. Nodes represent music styles, and edges represent musical connections. There are a lot of them and as a whole it’s a screen of spaghetti, but it’s animated, which is key. It starts at the beginning and develops over time, so you know where to go and what to look at. Music samples for each genre is also a nice touch. [Thanks, Jessica]
Tags: ARS\SNA applicações, ARS\SNA intro, belo, captura de conhecimento, grafos
introducing R to a non-programmer in one hour
Posted by Armando Brito Mendes | Filed under estatística, materiais ensino, software
Biostatistics PhD candidate Alyssa Frazee was tasked with teaching her sister, an undergraduate in sociology, how to use R. She had only one hour.
Once you load in a dataset, things start to get fun. We learned a whole bunch of stuff from this data frame, like how to do basic tabulations and calculate summary statistics, how to figure out if you have missing data, and how to fit a simple linear model. This part was pretty fun because my sister started leading the session: instead of me saying “I’m going to show you how to do this,” it was her asking “Hey, could we make a scatterplot?” or “Do you think we could put the best-fit line on that plot?” I was really glad this happened — I hope it meant she was engaged and enjoying herself!
This is the nice thing about R. There are so many built-in functions and packages that you can get something useful with a few lines of code, and you don’t really even have to know what a function is to get started (although you should eventually). Then you can go as far down the rabbit hole as you want.
Tags: análise de dados, bioinformatica, Estat Descritiva, R-software, software estatístico
What is Apache Mahout?
Posted by Armando Brito Mendes | Filed under software
The Apache Mahout™ machine learning library’s goal is to build scalable machine learning libraries.
Mahout currently has
- User and Item based recommenders
- Matrix factorization based recommenders
- K-Means, Fuzzy K-Means clustering
- Latent Dirichlet Allocation
- Singular value decomposition
- Logistic regression based classifier
- Complementary Naive Bayes classifier
- Random forest decision tree based classifier
- High performance java collections (previously colt collections)
- A vibrant community
With scalable we mean:
Scalable to reasonably large data sets. Our core algorithms for clustering, classfication and collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms
Scalable to support your business case. Mahout is distributed under a commercially friendly Apache Software license.
Tags: big data, data mining, DW \ BI
The Age of Data
Posted by Armando Brito Mendes | Filed under estatística, videos
Whiteboards
The Age of Data
Actian Big Data Analytics Platform
Actian DataCloud Platform
Big Data Analytics
Creating Value from Big Data and Hadoop
A New World for Analytics
The Need for an Analytic Platform
Seamless Integration
Analytic Offload
Creating Business Value with Analytics
Tags: big data, data mining, DW \ BI, Estat Descritiva
Big Data or Pig Data?
Posted by Armando Brito Mendes | Filed under materiais para profissionais
(A fable on huge amounts of data and why we don’t need models)
There was a pig who wanted to be a scientist. He was not interested in models. When asked how he planned on making sense of the world, the pig would say in a deep mysterious voice, “I don’t do models: the world is my model” and then with a twinkle in his eyes, look at his interlocutor smugly.
By his phrase, “I don’t do models, the world is my model”, he meant that the world’s data was enough for him, the pig scientist. The more the data, the more accurately the pig declared, he would be able to predict what might happen in the world.
Tags: big data, data mining, DW \ BI
Brainstorm
Posted by Armando Brito Mendes | Filed under materiais para profissionais, planeamento
Brainstorm, ou ainda Brainstorming, significa literalmente “tempestade de ideias”. No Brasil, por vezes é jocosamente denominado “toró de parpites”. É uma técnica criativa para obter ideias e soluções. De tão simples que é, muitas vezes é aplicada de forma inadequada, simplesmente como se fosse um bate-papo. Iremos ver aqui no Blogtek algumas técnicas para a busca de soluções de problemas.
Brainstorm – definição e aplicações
Brainstorm – princípios
Brainstorm – regras
Brainstorm – etapas
Tags: captura de conhecimento, decisao em grupo, gestão de projetos
The Field Guide to Data Science
Posted by Armando Brito Mendes | Filed under materiais para profissionais
Data Science is the competitive advantage of the future for organizations interested in turning their data into a product through analytics. Industries from health, to national security, to finance, to energy can be improved by creating better data analytics through Data Science. The winners and the losers in the emerging data economy are going to be determined by their Data Science teams.
Booz Allen Hamilton created The Field Guide to Data Science to help organizations of all types and missions understand how to make use of data as a resource. The text spells out what Data Science is and why it matters to organizations as well as how to create Data Science teams. Along the way, our team of experts provides field-tested approaches, personal tips and tricks, and real-life case studies. Senior leaders will walk away with a deeper understanding of the concepts at the heart of Data Science. Practitioners will add to their toolboxes.
Tags: big data, captura de conhecimento, data mining, DW \ BI
Posted by Armando Brito Mendes | Filed under estatística, materiais para profissionais
Alex Reinhart, a PhD statistics student at Carnegie Mellon University, covers some of the common analysis mistakes in Statistics Done Wrong.
Statistics Done Wrong is a guide to the most popular statistical errors and slip-ups committed by scientists every day, in the lab and in peer-reviewed journals. Many of the errors are prevalent in vast swathes of the published literature, casting doubt on the findings of thousands of papers. Statistics Done Wrong assumes no prior knowledge of statistics, so you can read it before your first statistics course or after thirty years of scientific practice.
The text is available for free online, and there’s a physical book version on the way.
Tags: análise de dados, data mining, decisão médica, inferência