As you know from our previous article, the entire idea behind machine learning is to go from data to insight. No matter what type or kind of project you plan to conduct, you always need datasets for machine learning projects. Your algorithm has to have a base to learn from and gain the necessary information. Thankfully, there are tons of datasets for machine learning projects. Many of them are available for free, from open sources.
You can obtain machine learning datasets in two ways. First, if you work for a client, they can provide you with the dataset you need. Such a dataset can consist of, for instance, the list of orders, data from Google Analytics, past financial results, and other operational data. In such a situation, the company itself is a source of the necessary dataset. Things get complicated when you need a dataset you don’t possess. Then, there is the second option–public-available datasets. You’d be surprised how many of them are just at hand, on the Internet! But before we switch to that, let’s talk for a moment about a perfect dataset. As it turns out, not every dataset will be usable, not to mention suitable for the given project.
The datasets for machine learning projects
When deciding which dataset ought to be used, follow two simple rules:
- Search for datasets with relevant information
- Search for datasets of high quality
Why is this approach crucial? The reasons are also twofold. First, if you input irrelevant data to your AI algorithm, not only will you receive a distorted outcome, but, in many instances, no outcome at all. Second, a high-quality database makes efficient work accessible. If the algorithm has to plough through unnecessary data instead of doing its job, the whole process will take much longer. No one wants to fight with useless information, and machine learning algorithms are no exception.
So, what does the high-quality dataset look like? First, high-quality dataset should not be messy or filled with too much information. You do not want to spend a lot of time cleaning and selecting data, or deleting unnecessary columns and rows. Keep it simple–concentrate only on relevant information. Second, always have your goal in mind. You should have a question or decision to answer, which in turn can be answered with the data you possess.
How to find relevant datasets?
There are over hundreds of various datasets containing hundreds of megabytes of information. Thankfully, you don’t have to search through them manually. There are three useful dataset finders/aggregators, which can do that for you.
The UCI ML Repository
The UCI Machine Learning Repository is owned by the University of California, School of Information and Computer Science. It currently has 488 publically available datasets, specifically for machine learning and data analysis. The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an FTP archive in 1987 by David Aha and fellow graduate students. Since that time, it has been widely used by students, educators, and researchers all over the world as a primary source of machine learning datasets. The datasets within UCI finder are tagged with specific categories, e.g. classification or regression, in order to help you practice given ML technique. The vast majority of the datasets within UCI come from the real world.
Google Dataset Search
In late 2018, Google launched its own finder (they call it “the toolbox”) that can search for datasets by name. Even though it’s still in the BETA phase, it is fully operational, and you can find it here. Google Dataset Search allows you to look for the datasets available on the Internet using given keywords. This search engine obtains information about datasets stored in thousands of online repositories.
Kaggle is another outstanding resource for machine learning datasets. Compared to UCI, it’s simply enormous. It contains over 19,500 datasets! Kaggle is not merely a search engine. It is an online community of data scientists and machine learning specialists, where you can have a discussion about data, find some public code or even create your own projects. Kaggle allows users to find and publish datasets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve the data science challenges. Currently, it has over 1 million users in almost 200 countries. What’s interesting, Google acquired Kaggle in 2017.
Obviously, this doesn’t limit your searching possibilities. You can explore hundreds of different datasets available on the Internet. Many of them are related to the specific discipline or industry, but the range of available databases is so much wider! We will examine some public datasets for machine learning. Let’s begin with the various government datasets.
The public and open datasets for machine learning
A variety of governments and international organizations freely share their datasets, and most of them can be downloaded free of charge, directly from their websites. Consider a couple of interesting open datasets for machine learning provided by various governments and organizations.
This is a vast US dataset base, managed and hosted by the US General Services Administration, Technology Transformation Service. As of June 2017, the approximately 200,000 datasets reported as the total on Data.gov. They represent about 10 million data resources. This is an unimaginable number! You can search Data.gov from its catalog of government data. The searching process and quite simple and based on keywords. Obtained results can be browsed through types, tags, formats, groups, organization types, organizations, and categories.
EU OPEN DATA PORTAL
This is the second, fascinating catalog comprising thousands of datasets for machine learning projects. What’s interesting, all data within this catalog is free to use and reuse, even for commercial purposes! You can browse datasets by subject or groups. There are nine primary subjects: Agriculture, fisheries, forestry and food, economy and finance, education, culture and sport, energy, environment, government and public sector, health, international issues, justice, legal system, and public safety. A similar portal is called Eurostat (ec.europa.eu/Eurostat). This is the statistical office of the European Union. Eurostat’s key task is to “provide the European Union with statistics at European level that enable comparisons between countries and regions.” Here, you can also browse databases by themes, for instance, economy and finance, population and social conditions, industry, trade, and services, science, technology, or international trade.
These are the two most interesting government datasets. Apart from them, you can also find the datasets catalogs from the following countries:
- New Zealand: data.govt.nz
- India: data.gov.in
- Northern Ireland: opendatani.gov.uk
The datasets for machine translation
This is another interesting example of machine learning datasets. The machine translation applications work on a similar to machine learning basis. The MT software translates a text from the source language into the target language, all on its own. To make that possible, these applications use massive databases containing hundreds of millions of words, phrases, and expressions as the source they can base on. With this source accessible, the MT applications can “decide”, which translation method is the most appropriate and accurate.
One of the most modern and complex MT systems in the world is Microsoft Translator. Initially, it was based entirely on the Statistical Machine Translation (SMT) method, which means that it was searching for the most probable translation–statistically, the most frequently found in its database. This is a good place for a small digression–what exactly is in such a database? The simplest example–the entire Wikipedia entry database. And how much is that? Well, as it turns out, 27 billion words in 40 million articles in 293 languages.
Currently, however, Microsoft is focusing on a technology called Neutral Machine Translation (NMT). This does not mean that the SMT (statistical machine translation) has been abandoned. The new technology is designed to complement it and, thus, significantly increase the accuracy of the translation. NMT has been developed since 2016. A huge database is still required, but the new system allows to translate not only on the basis of the frequency of a given translation but also on the context in which it appears in the text.
Other public datasets for machine learning
When thinking of possible datasets for machine learning projects, you are literally spoiled for choice. There are available various machine learning datasets for almost every field, discipline, and industry. We have a couple of interesting machine learning datasets examples.
- Finances and economics: Quandl.com
- Imaging: xViewdataset.org
- Autonomous vehicles: Comma.ai
- Health: mimic.physionet.org
- Face recognition: SCface.org
- Object detection: VisualGenomer.org
- Handwriting: yann.lecun.com/exdb/mnist/
- E-mails: cs.cmu.edu/~enron
- Music: millionsongdataset.com
And many, many more. For instance, you can see over 300 datasets, grouped into different categories here.
As you can see, there are tons of various open datasets for machine learning. Sometimes, we also use them in our projects when they can be beneficial. Mostly, however, we work with datasets and databases provided by our clients, in order to solve their challenges and issues. We can help your company as well! Give us a call or write us an e-mail and let’s get in touch! You can go straight to the contact section.
We’re looking forward to hearing from you!