Crunching The Web
Common Crawl meets MIA – Gathering and Crunching Open Web Data.
As the largest and most diverse collection of information in human history, the Web grants us tremendous insight if we can only understand it better. For example, Web crawl data can be used to spot trends and identify patterns in politics, economics, health, popular culture and many other aspects of life. It provides an immensely rich corpus for scientific research, technological advancement, and innovative new businesses. It is crucial for our information-based society that the Web be openly accessible to anyone who desires to utilize it.
In this Data Talk, we present two projects that set off to democratize the access to public Web data and provide the means of analysis to virtually anyone. Be our guest when Lisa Green and Jordan Mendelson present Common Crawl, a Web crawl made publicly accessible for further research and dissemination. In a second talk, Peter Adolphs introduces MIA, a Cloud-based platform for analyzing Web-scale data sets with a toolbox of natural language processing algorithms.
This event will be held in English!
Common Crawl: Big Open Data
Lisa Green (Director, Common Crawl) and Jordan Mendelson (Chief Technologist, Common Crawl)
Lisa Green and Jordan Mendelson present their insights into the relationship between data accessibility and innovation and talk about the general meaning of open data by discussing Common Crawl as an example case. You will get an overview of the technical architecture and learn how you can access to the 100s of TB of open web data. You will also hear examples of projects that extracted insight from the Common Crawl data and may inspire you as to how you can benefit from incorporating web data analysis into your own research or products.
MIA: A Cloud-Based Platform for Analyzing the Web
Peter Adolphs (Project Manager R&D, Neofonie)
With the MIA platform, especially small and medium-sized enterprises get their first opportunity to perform analyses on the Web and other publicly available datasets at minimized costs. To this end, the MIA platform provides access to a crawl of the German-speaking Web, its own cloud-based execution platform, and the query language MiaQL for easily accessing the data that is relevant to the job at hand. Data sources of different origin, size and variety can be easily searched, linked and evaluated using a variety of algorithms. Custom data sources and algorithms can be added, if the stock options are not sufficient.
The Common Crawl Foundation is a California registered non-profit founded by Gil Elbaz with the goal of democratizing access to Web information by producing and maintaining an open repository of Web crawl data that is universally accessible and analyzable.
Further information at: http://commoncrawl.org/
MIA is a cloud-based software platform that hosts different data sources as well as algorithms that can be applied on the data. The biggest data source that we offer is a crawl of the German-speaking Web. At the moment, the crawl covers about half a billion Web pages and is constantly updated.
Further information at: http://labs.neofonie.de/Projekte/MIA
Lisa Green – Director of Common Crawl. Lisa is motivated by a strong belief in the power of open systems to drive innovation in education, arts and research. Over the last several years she has been active in the areas of Open Access publishing, Open Science, Open Data, copyright, digital rights and policy. Immediately prior to joining Common Crawl, Lisa was Chief of Staff at Creative Commons. She holds a PhD in physical chemistry from the University of California Berkeley.
Jordan Mendelson – Chief Technologist at Common Crawl. Jordan is a product-focused technologist with over 18 years experience building tech startups. Prior to joining Common Crawl, he was Chief Architect of the music startup Napster, worked on big data analytics at LinkedIn, and was CTO of a restaurant technology startup SeatMe.
Peter Adolphs – R&D Project Manager at Neofonie. Peter studied computer science and linguistics at the Humboldt University in Berlin and worked for 6 years in the German Research Centre for Artificial Intelligence (DFKI). His focus of work is the development and improvement of information access solutions by leveraging linguistic analyses at various levels and applying them to large amounts of data, for arriving at an application-specific semantics with minimalized human efforts.
Zeit und Ort
Neofonie ist führender Fullservice-Dienstleister für IT, Web und Mobile und Spezialist für Data-Lösungen. www.neofonie.de