Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
It is not always easy to find databases from real world manufacturing plants, specially mining plants. So, I would like to share this database with the community, which comes from one of the most important parts of a mining process: a flotation plant!
PLEASE HELP ME GET MORE DATASETS LIKE THIS FILLING A 30s SURVEY:
The main goal is to use this data to predict how much impurity is in the ore concentrate. As this impurity is measured every hour, if we can predict how much silica (impurity) is in the ore concentrate, we can help the engineers, giving them early information to take actions (empowering!). Hence, they will be able to take corrective actions in advance (reduce impurity, if it is the case) and also help the environment (reducing the amount of ore that goes to tailings as you reduce silica in the ore concentrate).
The first column shows time and date range (from march of 2017 until september of 2017). Some columns were sampled every 20 second. Others were sampled on a hourly base.
The second and third columns are quality measures of the iron ore pulp right before it is fed into the flotation plant. Column 4 until column 8 are the most important variables that impact in the ore quality in the end of the process. From column 9 until column 22, we can see process data (level and air flow inside the flotation columns, which also impact in ore quality. The last two columns are the final iron ore pulp quality measurement from the lab. Target is to predict the last column, which is the % of silica in the iron ore concentrate.
I have been working in this dataset for at least six months and would like to see if the community can help to answer the following questions:
Is it possible to predict % Silica Concentrate every minute?
How many steps (hours) ahead can we predict % Silica in Concentrate? This would help engineers to act in predictive and optimized way, mitigatin the % of iron that could have gone to tailings.
Is it possible to predict % Silica in Concentrate whitout using % Iron Concentrate column (as they are highly correlated)?
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Spoken Wikipedia project unites volunteer readers of Wikipedia articles. Hundreds of spoken articles in multiple languages are available to users who are – for one reason or another – unable or unwilling to consume the written version of the article. Our resource, the Spoken Wikipedia Corpus, consolidates the Spoken Wikipediae, adding text segmentation, normalization, time-alignment and further annotations, making it accessible for research and fostering new ways of interacting with the material.
Timo Baumann and Arne Köhn and Felix Hennig. 2018. The Spoken Wikipedia Corpus Collection: Harvesting, Alignment and an Application to Hyperlistening, in Language Resources and Evaluation, Special Issue representing significant contributions of LREC 2016.
Arne Köhn, Florian Stegen, Timo Baumann. 2016. Mining the Spoken Wikipedia for Speech Data and Beyond, in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016).
CLARIN Metadata summary for The Spoken Wikipedia Corpora (CMDI-based)
Title: The Spoken Wikipedia Corpora
Description: The Spoken Wikipedia project unites volunteer readers of Wikipedia articles. Hundreds of spoken articles in multiple languages are available to users who are – for one reason or another – unable or unwilling to consume the written version of the article. Our resource, the Spoken Wikipedia Corpus, consolidates the Spoken Wikipediae, adding text segmentation, normalization, time-alignment and further annotations, making it accessible for research and fostering new ways of interacting with the material.
Publication date: 2017
Data owner: Timo Baumann - Universität Hamburg
Contributors: Timo Baumann (author), Arne Köhn (author), Florian Stegen (author)
Languages: English (eng), German (deu), Dutch (nld)
Size: 5397 article, 1005 hour
Segmentation units: other
Genre: encyclopedia
Modality: spoken
References: Timo Baumann; Arne Köhn; Felix Hennig (2018) The Spoken Wikipedia Corpus Collection: Harvesting, Alignment and an Application to Hyperlistening References: Arne Köhn; Florian Stegen; Timo Baumann (2016) Mining the Spoken Wikipedia for Speech Data and Beyond
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Bible (or Biblia in Greek) is a collection of sacred texts or scriptures that Jews and Christians consider to be a product of divine inspiration and a record of the relationship between God and humans (Wiki). And for data mining purpose, we could do many things using Bible scriptures as for NLP, Classification, Sentiment Analysis and other particular topics between Data Science and Theology perspective.
Here you will find the following bible versions in sql, sqlite, xml, csv, and json format:
American Standard-ASV1901 (ASV)
Bible in Basic English (BBE)
Darby English Bible (DARBY)
King James Version (KJV)
Webster's Bible (WBT)
World English Bible (WEB)
Young's Literal Translation (YLT)
Each verse is accessed by a unique key, the combination of the BOOK+CHAPTER+VERSE id.
Example:
Genesis 1:1 (Genesis chapter 1, verse 1) = 01001001 (01 001 001)
Exodus 2:3 (Exodus chapter 2, verse 3) = 02002003 (02 002 003)
The verse-id system is used for faster, simplified queries.
For instance: 01001001 - 02001005 would capture all verses between Genesis 1:1 through Exodus 1:5.
Written simply:
SELECT * FROM bible.t_asv WHERE id BETWEEN 01001001 AND 02001005
Coordinating Tables
There is also a number-to-book key (key_english table), a cross-reference list (cross_reference table), and a bible key containing meta information about the included translations (bible_version_key table). See below SQL table layout. These tables work together providing you a great basis for a bible-reading and cross-referencing app. In addition, each book is marked with a particular genre, mapping in the number-to-genre key (key_genre_english table) and common abbreviations for each book can be looked up in the abbreviations list (key_abbreviations_english table). While its expected that your programs would use the verse-id system, book #, chapter #, and verse # columns have been included in the bible versions tables.
A Valuable Cross-Reference Table
A very special and valuable addition to these databases is the extensive cross-reference table. It was created from the project at http://www.openbible.info/labs/cross-references/. See .txt version included from http://www.openbible.info website. Its extremely useful in bible study for discovering related scriptures. For any given verse, you simply query vid (verse id), and a list of rows will be returned. Each of those rows has a rank (r) for relevance, start-verse (sv), and end verse (ev) if there is one.
Basic Web Interaction
The web folder contains two php files. Edit the first few lines of index.php to match your server's settings. Place these in a folder on your webserver. The references search box can be multiple comma separated values. (i.e. John 3:16, Rom 3:23, 1 Jn 1:9, Romans 10:9-10) You can also directly link to a verse by altering the URI: http://localhost/index.php?b=John 3:16, Rom 3:23, 1 Jn 1:9, Romans 10:9-10
In CSV folder, you will find (same list order with the other formats):
http://i.imgur.com/S9JialN.png" alt="bible_version_key">
http://i.imgur.com/v59SpQs.png" alt="key_abbreviations_english">
http://i.imgur.com/BbKMQgF.png" alt="key_english">
http://i.imgur.com/lJVVW2C.png" alt="key_genre_english">
http://i.imgur.com/jJ4cf4q.png" alt="t_version">
In behalf of the original contributors (Github)
WordNet as an additional semantic resource for NLP
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This dataset is part of a larger mission to transform Tamil into a high-resource language in the field of Natural Language Processing (NLP). As one of the oldest and most culturally rich languages, Tamil has a unique linguistic structure, yet it remains underrepresented in the NLP landscape. This dataset, extracted from Tamil Wikipedia, serves as a foundational resource to support Tamil language processing, text mining, and machine learning applications.
- Text Data: This dataset contains over 569,000 articles in raw text form, extracted from Tamil Wikipedia. The collection is ideal for language model training, word frequency analysis, and text mining.
- Scripts and Processing Tools: Code snippets are provided for processing .bz2 compressed files, generating word counts, and handling data for NLP applications.
Despite having a documented lexicon of over 100,000 words, only a fraction of these are actively used in everyday communication. The largest available Tamil treebank currently holds only 10,000 words, limiting the scope for training accurate language models. This dataset aims to bridge that gap by providing a robust, open-source corpus for researchers, developers, and linguists who want to work on Tamil language technologies.
- Language Modeling: Train or fine-tune models like BERT, GPT, or LSTM-based language models for Tamil. - Linguistic Research: Analyze Tamil morphology, syntax, and vocabulary usage. - Data Augmentation: Use the raw text to generate augmented data for multilingual NLP applications. - Word Embeddings and Semantic Analysis: Create embeddings for Tamil words, useful in multilingual setups or standalone applications.
I believe that advancing Tamil in NLP cannot be a solo effort. Contributions in the form of additional data, annotations, or even new tools for Tamil language processing are welcome! By working together, we can make Tamil a truly high-resource language in NLP.
This dataset is based on content from Tamil Wikipedia and is shared under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA 3.0). Proper attribution to Wikipedia is required when using this data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dominance hierarchies have been studied for almost 100 years. The science of science approach used here provides high-level insight into how the dynamics of dominance hierarchy research have shifted over this long timescale. To summarize these patterns, I extracted publication metadata using a Google Scholar search for the phrase ‘dominance hierarchy’, resulting in over 26 000 publications. I used text mining approaches to assess patterns in three areas: (1) general patterns in publication frequency and rate, (2) dynamics of term usage and (3) term co-occurrence in publications across the history of the field. While the overall number of publications per decade continues to rise, the percent growth rate has fallen in recent years, demonstrating that although there is sustained interest in dominance hierarchies, the field is no longer experiencing the explosive growth it showed in earlier decades. Results from title term co-occurrence networks and community structure show that the different subfields of dominance hierarchy research were most strongly separated early in the field’s history while modern research shows more evidence for cohesion and a lack of distinct term community boundaries. These methods provide a general view of the history of research on dominance hierarchies and can be applied to other fields or search terms to gain broad synthetic insight into patterns of interest, especially in fields with large bodies of literature.This article is part of the theme issue ‘The centennial of the pecking order: current state and future prospects for the study of dominance hierarchies’.
Facebook
TwitterI always wanted to access a data set that was related to the world’s population (Country wise). But I could not find a properly documented data set. Rather, I just created one manually.
Now I knew I wanted to create a dataset but I did not know how to do so. So, I started to search for the content (Population of countries) on the internet. Obviously, Wikipedia was my first search. But I don't know why the results were not acceptable. And also there were only I think 190 or more countries. So then I surfed the internet for quite some time until then I stumbled upon a great website. I think you probably have heard about this. The name of the website is Worldometer. This is exactly the website I was looking for. This website had more details than Wikipedia. Also, this website had more rows I mean more countries with their population.
Once I got the data, now my next hard task was to download it. Of course, I could not get the raw form of data. I did not mail them regarding the data. Now I learned a new skill which is very important for a data scientist. I read somewhere that to obtain the data from websites you need to use this technique. Any guesses, keep reading you will come to know in the next paragraph.
https://fiverr-res.cloudinary.com/images/t_main1,q_auto,f_auto/gigs/119580480/original/68088c5f588ec32a6b3a3a67ec0d1b5a8a70648d/do-web-scraping-and-data-mining-with-python.png" alt="alt text">
You are right its, Web Scraping. Now I learned this so that I could convert the data into a CSV format. Now I will give you the scraper code that I wrote and also I somehow found a way to directly convert the pandas data frame to a CSV(Comma-separated fo format) and store it on my computer. Now just go through my code and you will know what I'm talking about.
Below is the code that I used to scrape the code from the website
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3200273%2Fe814c2739b99d221de328c72a0b2571e%2FCapture.PNG?generation=1581314967227445&alt=media" alt="">
Now I couldn't have got the data without Worldometer. So special thanks to the website. It is because of them I was able to get the data.
As far as I know, I don't have any questions to ask. You guys can let me know by finding your ways to use the data and let me know via kernel if you find something interesting
Facebook
TwitterI always wanted to access a data set that was related to the coronavirus (Country wise). But I could not find a properly documented data set. Rather, I just created one manually thinking this dataset would be really helpful for others.
Now I knew I wanted to create a dataset but I did not know how to do so. So, I started to search for the content (Coronavirus) country-wise cases on the internet. Obviously, Wikipedia was my first search. But I don't know why the results were not acceptable. The results were not satisfactory. So then I surfed the internet for quite some time until then I stumbled upon a great website. I think you probably have heard about this. The name of the website is Worldometer. This is exactly the website I was looking for. This website had more details than Wikipedia. Also, this website had more rows I mean more countries with more details about cases.
Once I got the data, now my next hard task was to download it. Of course, I could not get the raw form of data. I did not mail them regarding the data. Now I learned a new skill which is very important for a data scientist. I read somewhere that to obtain the data from websites you need to use this technique. Any guesses, keep reading you will come to know in the next paragraph.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3200273%2F929b6e449f4d4962299445bc9cf9e7f2%2Fdo-web-scraping-and-data-mining-with-python.jfif?generation=1585172688729088&alt=media" alt="">
You are right its, Web Scraping. Now I learned this so that I could convert the data into a CSV format. Now I will give you the scraper code that I wrote and also I somehow found a way to directly convert the pandas data frame to a CSV(Comma-separated fo format) and store it on my computer. Now just go through my code and you will know what I'm talking about.
Below is the code that I used to scrape the code from the website
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3200273%2F20da1f48036897a048a72e94f982acb8%2FCapture.PNG?generation=1585172815269902&alt=media" alt="">
Now I couldn't have got the data without Worldometer. So special thanks to the website. It is because of them I was able to get the data. This data was scraped on 25th March at 3:45 PM. I will try to update the data every day.
As far as I know, I don't have any questions to ask. You guys can let me know by finding your ways to use the data and let me know via kernel if you find something interesting
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The two differents datasets are related to Red Wine and White Wine variants of the Portuguese "**Vinho Verde**" wine. For more details, consult the reference [*Paulo Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis*, 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.)
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).
This dataset is also available from the UCI machine learning repository, Source I just shared it to Kaggle for Convenience. If I am mistaken and the public license type disallowed me from doing so, I will take to remove this dataset, if requested and notified to me.For more information, please read [Cortez et al., 2009].
| Tables | Count |
|---|---|
| Red Wine | 1599 |
| White Wine | 4898 |
11 + output attribute. Input and Output of feature:
Input variables (based on physicochemical tests):
1. fixed acidity
2. volatile acidity
3. citric acid
4. residual sugar
5. chlorides
6. free sulfur dioxide
7. total sulfur dioxide
8. density
9. pH
10. sulphates
11. alcohol
Output variable (based on sensory data):
12. quality (score between 0 and 10)
This dataset is also available from the UCI machine learning repository, Source
I just shared it to Kaggle for Convenience. If I am mistaken and the public license type disallowed me from doing so, I will take to remove this dataset, if requested and notified to me. I am not the owner of this dataset. Also, if you plan to use this database in your article research or else you must taken and read main Source in the UCI machine learning repository.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
It is not always easy to find databases from real world manufacturing plants, specially mining plants. So, I would like to share this database with the community, which comes from one of the most important parts of a mining process: a flotation plant!
PLEASE HELP ME GET MORE DATASETS LIKE THIS FILLING A 30s SURVEY:
The main goal is to use this data to predict how much impurity is in the ore concentrate. As this impurity is measured every hour, if we can predict how much silica (impurity) is in the ore concentrate, we can help the engineers, giving them early information to take actions (empowering!). Hence, they will be able to take corrective actions in advance (reduce impurity, if it is the case) and also help the environment (reducing the amount of ore that goes to tailings as you reduce silica in the ore concentrate).
The first column shows time and date range (from march of 2017 until september of 2017). Some columns were sampled every 20 second. Others were sampled on a hourly base.
The second and third columns are quality measures of the iron ore pulp right before it is fed into the flotation plant. Column 4 until column 8 are the most important variables that impact in the ore quality in the end of the process. From column 9 until column 22, we can see process data (level and air flow inside the flotation columns, which also impact in ore quality. The last two columns are the final iron ore pulp quality measurement from the lab. Target is to predict the last column, which is the % of silica in the iron ore concentrate.
I have been working in this dataset for at least six months and would like to see if the community can help to answer the following questions:
Is it possible to predict % Silica Concentrate every minute?
How many steps (hours) ahead can we predict % Silica in Concentrate? This would help engineers to act in predictive and optimized way, mitigatin the % of iron that could have gone to tailings.
Is it possible to predict % Silica in Concentrate whitout using % Iron Concentrate column (as they are highly correlated)?