100+ datasets found

DatasetofDatasets (DoD)
kaggle.com
Updated Aug 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konstantinos Malliaridis (2024). DatasetofDatasets (DoD) [Dataset]. https://www.kaggle.com/datasets/terminalgr/datasetofdatasets-124-1242024
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 12, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Konstantinos Malliaridis
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset is essentially the metadata from 164 datasets. Each of its lines concerns a dataset from which 22 features have been extracted, which are used to classify each dataset into one of the categories 0-Unmanaged, 2-INV, 3-SI, 4-NOA (DatasetType).

This Dataset consists of 164 Rows. Each row is the metadata of an other dataset. The target column is datasetType which has 4 values indicating the dataset type. These are:

2 - Invoice detail (INV): This dataset type is a special report (usually called Detailed Sales Statement) produced by a Company Accounting or an Enterprise Resource Planning software (ERP). Using a INV-type dataset directly for ARM is extremely convenient for users as it relieves them from the tedious work of transforming data into another more suitable form. INV-type data input typically includes a header but, only two of its attributes are essential for data mining. The first attribute serves as the grouping identifier creating a unique transaction (e.g., Invoice ID, Order Number), while the second attribute contains the items utilized for data mining (e.g., Product Code, Product Name, Product ID).

3 - Sparse Item (SI): This type is widespread in Association Rules Mining (ARM). It involves a header and a fixed number of columns. Each item corresponds to a column. Each row represents a transaction. The typical cell stores a value, usually one character in length, that depicts the presence or absence of the item in the corresponding transaction. The absence character must be identified or declared before the Association Rules Mining process takes place.

4 - Nominal Attributes (NOA): This type is commonly used in Machine Learning and Data Mining tasks. It involves a fixed number of columns. Each column registers nominal/categorical values. The presence of a header row is optional. However, in cases where no header is provided, there is a risk of extracting incorrect rules if similar values exist in different attributes of the dataset. The potential values for each attribute can vary.

0 - Unmanaged for ARM: On the other hand, not all datasets are suitable for extracting useful association rules or frequent item sets. For instance, datasets characterized predominantly by numerical features with arbitrary values, or datasets that involve fragmented or mixed types of data types. For such types of datasets, ARM processing becomes possible only by introducing a data discretization stage which in turn introduces information loss. Such types of datasets are not considered in the present treatise and they are termed (0) Unmanaged in the sequel.

The dataset type is crucial to determine for ARM, and the current dataset is used to classify the dataset's type using a Supervised Machine Learning Model.

There is and another dataset type named 1 - Market Basket List (MBL) where each dataset row is a transaction. A transaction involves a variable number of items. However, due to this characteristic, these datasets can be easily categorized using procedural programming and DoD does not include instances of them. For more details about Dataset Types please refer to article "WebApriori: a web application for association rules mining". https://link.springer.com/chapter/10.1007/978-3-030-49663-0_44
f
Data and tools for studying isograms
figshare.com
Updated Jul 31, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
Explore at:
application/x-sqlite3Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5245810.v1
Dataset updated
Jul 31, 2017
Dataset provided by
figshare
Authors
Florian Breit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

Label Data type Description

isogramy int The order of isogramy, e.g. "2" is a second order isogram

length int The length of the word in letters

word text The actual word/isogram in ASCII

source_pos text The Part of Speech tag from the original corpus

count int Token count (total number of occurences)

vol_count int Volume count (number of different sources which contain the word)

count_per_million int Token count per million words

vol_count_as_percent int Volume count as percentage of the total number of volumes

is_palindrome bool Whether the word is a palindrome (1) or not (0)

is_tautonym bool Whether the word is a tautonym (1) or not (0)

The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

Label

Data type

Description

!total_1grams

int

The total number of words in the corpus

!total_volumes

int

The total number of volumes (individual sources) in the corpus

!total_isograms

int

The total number of isograms found in the corpus (before compacting)

!total_palindromes

int

How many of the isograms found are palindromes

!total_tautonyms

int

How many of the isograms found are tautonyms

The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
S
A dataset of for cross-course learning path planning with 7 types of learner...
scidb.cn
Updated May 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yong-Wei Zhang (2024). A dataset of for cross-course learning path planning with 7 types of learner and 7 types of course materials [Dataset]. http://doi.org/10.57760/sciencedb.18420
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.18420
Dataset updated
May 14, 2024
Dataset provided by
Science Data Bank
Authors
Yong-Wei Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset accompanies the research paper titled "Enhancing Personalized Learning in Online Education through Integrated Cross-Course Learning Path Planning." The dataset consists of MATLAB data files (.mat format).The dataset includes data on seven types of learner attributes, named from LearnerA.mat to LearnerG.mat. Each learner dataset contains two variables: L and LP. L is a 10x16 matrix that stores learner attributes, where each row represents a learner. The first column indicates the learner's ability level, the second column indicates the expected learning time, columns 3 to 6 represent normalized learning styles, and columns 7 to 16 represent learning objectives. LP is a structure that stores statistical information about this matrix.The dataset also includes data on seven types of learning resource attributes, named DatasetA.mat, DatasetB.mat, DatasetC.mat, DatasetAB.mat, DatasetAC.mat, DatasetBC.mat, and DatasetABC.mat. Each resource dataset contains two variables: M and MP. M is a matrix that stores the attributes of learning materials, where each row represents a material. The first column indicates the material's difficulty level, the second column represents the learning time required for the material, columns 3 to 6 describe the type of material, columns 7 to 16 cover the knowledge points addressed by the material, and columns 17 to 26 list the prerequisite knowledge points required for the material. MP is a structure that stores statistical information about this matrix.The dataset encompasses results from learning path planning involving seven types of learners across seven datasets, totaling 49 datasets, named in the format PathCost4_LSHADE_cnEpSin_D_X_L_Y.mat. Here, X represents the type of learning resource dataset (A, B, C, AB, AC, BC, ABC) and Y represents the type of learner (A to G). Each data file contains three variables: Gbest, Gtime, and S. Gbest is a 30x10 matrix, where each column stores the best cost function obtained from 30 runs of path planning for a learner on the corresponding dataset. Gtime is a 30x10 matrix, where each column stores the time spent on each run for a learner on the corresponding dataset. S is a 30x10 cell array storing the status information from each run.Finally, the dataset includes a compilation of the best cost functions for all runs for all learners across all learning material datasets, named learnerBest.mat. The file contains a variable, learnerBest, which is a 7x7x10x30 four-dimensional array. The first dimension represents the type of learner, the second dimension represents the type of learning material, the third dimension represents the learner index, and the fourth dimension represents the run index.
Dataset: Ethnicity-Based Name Partitioning for Author Name Disambiguation...
figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jinseok Kim; Jenna Kim; Jason Owen-Smith (2023). Dataset: Ethnicity-Based Name Partitioning for Author Name Disambiguation Using Supervised Machine Learning [Dataset]. http://doi.org/10.6084/m9.figshare.14043791.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14043791.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Jinseok Kim; Jenna Kim; Jason Owen-Smith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains data files for a research paper, "Ethnicity-Based Name Partitioning for Author Name Disambiguation Using Supervised Machine Learning," published in the Journal of the Association for Information Science and Technology.Four zipped files are uploaded.Each zipped file contains five data files: signatures_train.txt, signatures_test.txt, records.txt, clusters_train.txt, and clusters_test.txt.1. 'Signatures' files contain lists of name instances. Each name instance (a row) is associated with information as follows. - 1st column: instance id (numeric): unique id assigned to a name instance - 2nd column: paper id (numeric): unique id assigned to a paper in which the name instance appears as an author name - 3rd column: byline position (numeric): integer indicating the position of the name instance in the authorship byline of the paper - 4th column: author name (string): name string formatted as surname, comma, and forename(s) - 5th column: ethnic name group (string): name ethnicity assigned by Ethnea to the name instance - 6th column: affiliation (string): affiliation associated with the name instance, if available in the original data - 7th column: block (string): simplified name string of the name instance to indicate its block membership (surname and first forename initial) - 8th column: author id (string): unique author id (i.e., author label) assigned by the creators of the original data2. 'Records' files contain lists of papers. Each paper is associated with information as follows. -1st column: paper id (numeric): unique paper id; this is the unique paper id (2nd column) in Signatures files -2nd column: year (numeric): year of publication * Some papers may have wrong publication years due to incorrect indexing or delayed updates in original data -3rd column: venue (string): name of journal or conference in which the paper is published * Venue names can be in full string or in a shortened format according to the formats in original data -4th column: authors (string; separated by vertical bar): list of author names that appear in the paper's byline * Author names are formatted into surname, comma, and forename(s) -5th column: title words (string; separated by space): words in a title of the paper. * Note that common words are stop-listed and each remaining word is stemmed using Porter's stemmer.3. 'Clusters' files contain lists of clusters. Each cluster is associated with information as follows. -1st column: cluster id (numeric): unique id of a cluster -2nd column: list of name instance ids (Signatures - 1st column) that belong to the same unique author id (Signatures - 8th column). Signatures and Clusters files consist of two subsets - train and test files - of original labeled data which are randomly split into 50%-50% by the authors of this study.Original labeled data for AMiner.zip, KISTI.zip, and GESIS.zip came from the studies cited below.If you use one of the uploaded data files, please cite them accordingly.[AMiner.zip]Tang, J., Fong, A. C. M., Wang, B., & Zhang, J. (2012). A Unified Probabilistic Framework for Name Disambiguation in Digital Library. IEEE Transactions on Knowledge and Data Engineering, 24(6), 975-987. doi:10.1109/Tkde.2011.13Wang, X., Tang, J., Cheng, H., & Yu, P. S. (2011). ADANA: Active Name Disambiguation. Paper presented at the 2011 IEEE 11th International Conference on Data Mining.[KISTI.zip]Kang, I. S., Kim, P., Lee, S., Jung, H., & You, B. J. (2011). Construction of a Large-Scale Test Set for Author Disambiguation. Information Processing & Management, 47(3), 452-465. doi:10.1016/j.ipm.2010.10.001Note that the original KISTI data contain errors and duplicates. This study reuses the revised version of KISTI reported in a study below.Kim, J. (2018). Evaluating author name disambiguation for digital libraries: A case of DBLP. Scientometrics, 116(3), 1867-1886. doi:10.1007/s11192-018-2824-5[GESIS.zip]Momeni, F., & Mayr, P. (2016). Evaluating Co-authorship Networks in Author Name Disambiguation for Common Names. Paper presented at the 20th international Conference on Theory and Practice of Digital Libraries (TPDL 2016), Hannover, Germany.Note that this study reuses the 'Evaluation Set' among the original GESIS data which was added titles by a study below.Kim, J., & Kim, J. (2020). Effect of forename string on author name disambiguation. Journal of the Association for Information Science and Technology, 71(7), 839-855. doi:10.1002/asi.24298[UM-IRIS.zip]This labeled dataset was created for this study. For description about the labeling method, please see 'Method' in the paper below.Kim, J., Kim, J., & Owen-Smith, J. (In print). Ethnicity-based name partitioning for author name disambiguation using supervised machine learning. Journal of the Association for Information Science and Technology. doi:10.1002/asi.24459.For details on the labeling method and limitations, see the paper below.Kim, J., & Owen-Smith, J. (2021). ORCID-linked labeled data for evaluating author name disambiguation at scale. Scientometrics. doi:10.1007/s11192-020-03826-6

Sample Graph Datasets in CSV Format

zenodo.org

csv

Updated Dec 9, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Edwin Carreño; Edwin Carreño (2024). Sample Graph Datasets in CSV Format [Dataset]. http://doi.org/10.5281/zenodo.14335015

Explore at:

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14335015

Dataset updated

Dec 9, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Edwin Carreño; Edwin Carreño

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Sample Graph Datasets in CSV Format

Note: none of the data sets published here contain actual data, they are for testing purposes only.

Description

This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:

dataset_30_nodes_interactions.csv:contains 30 rows (nodes).
dataset_30_edges_interactions.csv: contains 47 rows (edges).
the common identifier dataset_30 refers to the same graph.

CSV nodes

Each dataset contains the following columns:

Name of the Column	Type	Description
UniProt ID	string	protein identification
label	string	protein label (type of node)
properties	string	a dictionary containing properties related to the protein.

CSV edges

Each dataset contains the following columns:

Name of the Column	Type	Description
Relationship ID	string	relationship identification
Source ID	string	identification of the source protein in the relationship
Target ID	string	identification of the target protein in the relationship
label	string	relationship label (type of relationship)
properties	string	a dictionary containing properties related to the relationship.

Metadata

Graph	Number of Nodes	Number of Edges	Sparse graph
dataset_30*	30	47	Y
dataset_60*	60	181	Y
dataset_120*	120	689	Y
dataset_240*	240	2819	Y
dataset_300*	300	4658	Y
dataset_600*	600	18004	Y
dataset_1200*	1200	71785	Y
dataset_2400*	2400	288600	Y
dataset_3000*	3000	449727	Y
dataset_6000*	6000	1799413	Y
dataset_12000*	12000	7199863	Y
dataset_24000*	24000	28792361	Y
dataset_30000*	30000	44991744	Y

This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.

CSV nodes (tiny graphs)

Each dataset contains the following columns:

Name of the Column	Type	Description
ID	string	node identification
label	string	node label (type of node)
properties	string	a dictionary containing properties related to the node.

CSV edges (tiny graphs)

Each dataset contains the following columns:

Name of the Column	Type	Description
ID	string	relationship identification
source	string	identification of the source node in the relationship
target	string	identification of the target node in the relationship
label	string	relationship label (type of relationship)
properties	string	a dictionary containing properties related to the relationship.

Metadata (tiny graphs)

Graph	Number of Nodes	Number of Edges	Sparse graph
dataset_dummy*	3	6	N
dataset_dummy2*	3	6	N

Data from: A Community Resource for Exploring and Utilizing Genetic...
catalog.data.gov
agdatacommons.nal.usda.gov
+1more
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data from: A Community Resource for Exploring and Utilizing Genetic Diversity in the USDA Pea Single Plant Plus Collection [Dataset]. https://catalog.data.gov/dataset/data-from-a-community-resource-for-exploring-and-utilizing-genetic-diversity-in-the-usda-p-3edc2
Explore at:
Dataset updated
Jun 5, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
Included in this dataset are SNP and fasta data for the Pea Single Plant Plus Collection (PSPPC) and the PSPPC augmented with 25 P. fulvum accessions. These 6 datasets can be roughly divided into two groups. Group 1 consists of three datasets labeled PSPPC which refer to SNP data pertaining to the USDA Pea Single Plant Plus Collection. Group 2 consists of three datasets labeled PSPPC + P. fulvum which refer to SNP data pertaining to the USDA PSPPC with 25 accessions of Pisum fulvum added. SNPs for each of these groups were called independently; therefore SNP names that are shared between the PSPPC and PSPPC + P. fulvum groups should NOT be assumed to refer to the same locus. For analysis, SNP data is available in two widely used formats: hapmap and vcf. These formats can be successfully loaded into TASSEL v. 5.2.25 (http://www.maizegenetics.net/tassel). Explanations of fields (columns) in the VCF files are contained within commented (##) rows at the top of the file. Descriptions of the first 11 columns in the hapmap file are as follows: rs#- Name of locus (i.e. SNP name) alleles- Indicates the SNPs for each allele at the locus chrom- Irrelevant for these datasets, since markers are unordered. pos- Irrelevant for these datasets, since markers are unordered. strand- Irrelevant for these datasets, since markers are unordered assembly#- required field for hapmap format. NA for these datasets center- required field for hapmap format. NA for these datasets protLSID- required field for hapmap format. NA for these datasets assayLSID- required field for hapmap format. NA for these datasets panel- required field for hapmap format. NA for these datasets QCcode- required field for hapmap format. NA for these datasets The fasta sequences containing the SNPs are also available for such downstream applications as development of primers for platform-specific markers. For more information about this dataset, contact Clarice Coyne at Clarice.Coyne@usda.gov or coynec@wsu.edu. Resources in this dataset:Resource Title: PSPPC SNPs in hapmap format. File Name: PSPPC.hmp.txtResource Description: 66591 unanchored SNPs for the PSPPC collection in hapmap formatResource Software Recommended: TASSEL,url: http://www.maizegenetics.net/tassel Resource Title: PSPPC SNP FASTA Sequences. File Name: PSPPC.fa.txtResource Description: FASTA sequences for each allele of the PSPPC SNP datasetResource Title: PPSPPC + P. fulvum SNPs in hapmap format. File Name: PSPPC+fulvums.hmp.txtResource Description: 67400 SNPs from the PSPPC augmented with 25 P. fulvum accessions in hapmap format. SNP names are independent and unrelated to plain PSPPC SNP files.Resource Software Recommended: TASSEL,url: http://www.maizegenetics.net/tassel Resource Title: PSPPC + P. fulvum SNP FASTA Sequences. File Name: PSPPC+fulvums.fa.txtResource Description: FASTA sequences for each allele of the PSPPC + P. fulvum SNP dataset. SNP names are independent and unrelated to plain PSPPC SNP files.Resource Title: PSPPC + P. fulvum SNPs in vcf format. File Name: PSPPC+fulvums.vcf.txtResource Description: 67400 SNPs from the PSPPC augmented with 25 P. fulvum accessions in vcf format. SNP names are independent and unrelated to plain PSPPC SNP files.Resource Software Recommended: TASSEL,url: http://www.maizegenetics.net/tassel Resource Title: PSPPC SNPs in vcf format. File Name: PSPPC.vcf.txtResource Description: 66591 SNPs from the PSPPC in vcf formatResource Software Recommended: TASSEL,url: http://www.maizegenetics.net/tassel Resource Title: README. File Name: Data Dictionary.docxResource Description: These data are for the Pea Single Plant Plus Collection (PSPPC) and the PSPPC augmented with 25 P. fulvum accessions. The 6 datasets can be divided into two groups. Group 1 consists of 3 datasets labeled “PSPPC” which refer to SNP data pertaining to the USDA Pea Single Plant Plus Collection. Group 2 consists of 3 datasets labeled “PSPPC + P. fulvum” which refer to SNP data pertaining to the PSPPC with 25 accessions of Pisum fulvum added. SNPs for each of these groups were called independently; therefore any SNP name that is shared between the PSPPC and PSPPC + P. fulvum groups should NOT be assumed to refer to the same locus. For analysis, SNP data is available in two widely used formats: hapmap and vcf. These files were successfully loaded into the standalone version of TASSEL v. 5.2.25 (http://www.maizegenetics.net/tassel). Explanations of fields (columns) in the VCF files are contained within commented (##) rows at the top of the file. The first 11 columns required for the hapmap format are as follows: rs#- Name of locus (i.e. SNP name) alleles- Indicates the SNPs for each allele at the locus chrom- N/A, since markers are unordered. pos- N/A, since markers are unordered. strand- N/A, since markers are unordered assembly#- N/A center- N/A protLSID- N/A assayLSID- N/A panel- N/A QCcode- N/A The fasta sequences containing the SNPs are also available here for such downstream applications as development of primers for platform-specific markers.
t
Data set collection for flow delegation - Vdataset - LDM
service.tib.eu
Updated Aug 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Data set collection for flow delegation - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/rdr-doi-10-35097-1218
Explore at:
Dataset updated
Aug 4, 2023
Description
Abstract: This data set collection consists of 17 data sets used for the analytical / simulative evaluation of the flow delegation concept presented in "Flow Delegation: Flow Table Capacity Bottleneck Mitigation for Software-defined Networks". Example code for processing the data sets can be found at https://github.com/kit-tm/fdeval. TechnicalRemarks: The data set collection is a zip file that contains 17 sqlite database files that can be inspected with any sqlite capable database reader (such as https://sqlitebrowser.org/). The folder names in the unzipped file indicate the names of the data sets (from d20 to d5050). Each database consists of a single table called "statistics" that gives access to the scenario parameters and evaluation results. Each row in the table represents a single execution of the evaluation environment (i.e., one experiment). The columns starting with scenario are the parameters used for scenario / experiment generation. All other columns except for id and resultid (those two columns are not essential to the data set and can be ignored) refer to statistics gathered for one experiment. Columns starting with json contain a serialized json object and need to be de-serialized, e.g., by something like arr = json.loads(string) if python is used where string is the content from the column and arr is an array of floating point numbers. These columns contain time series data, i.e., the statistics were gathered for multiple time slots. Example code for processing the data sets can be found at https://github.com/kit-tm/fdeval (plotters folder). The GitHub page also contains additional details about the data sets in this collection.
w
Dataset of books called Developing analytical database applications
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called Developing analytical database applications [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Developing+analytical+database+applications
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 1 row and is filtered where the book is Developing analytical database applications. It features 7 columns including author, publication date, language, and book publisher.
Data from: Proxy data
figshare.com
zip
Updated Mar 24, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lilo Henke (2017). Proxy data [Dataset]. http://doi.org/10.6084/m9.figshare.4787575.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4787575.v1
Dataset updated
Mar 24, 2017
Dataset provided by
Figsharehttp://figshare.com/
Authors
Lilo Henke
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Zip file containing .csv files of all proxy records listed in Tables A1 and A2 of the paper "Was the Little Ice Age more or less El Niño-like than the Medieval Climate Anomaly? Evidence from hydrological and temperature proxy data".Tree ring records taken from the Mann et al. (2008) dataset are in a separate folder called "Mann2008". These are named as abbreviated location followed by a core number.All other file names follow the format [author][yyyy][optional suffix].Each file contains a "YearBP" column (age in years before present) and a column with proxy data points.All proxy records in this collection were originally downloaded from the NOAA Paleoclimatology/Pangaea Databases (https://www.ncdc.noaa.gov/data-access/paleoclimatology-data/datasets), including the Mann et al. (2008) tree ring dataset and the Tierney et al. (2015) coral dataset.
w
Dataset of books called Formal set theory
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called Formal set theory [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Formal+set+theory
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 1 row and is filtered where the book is Formal set theory. It features 7 columns including author, publication date, language, and book publisher.
d
U.S. Select Demographics by Census Block Groups
dataone.org
dataverse.harvard.edu
+1more
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bryan, Michael (2023). U.S. Select Demographics by Census Block Groups [Dataset]. http://doi.org/10.7910/DVN/UZGNMM
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/UZGNMM
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Bryan, Michael
Area covered
United States
Description
Overview This dataset re-shares cartographic and demographic data from the U.S. Census Bureau to provide an obvious supplement to Open Environments Block Group publications.These results do not reflect any proprietary or predictive model. Rather, they extract from Census Bureau results with some proportions and aggregation rules applied. For additional support or more detail, please see the Census Bureau citations below. Cartographics refer to shapefiles shared in the Census TIGER/Line publications. Block Group areas are updated annually, with major revisions accompanying the Decennial Census at the turn of each decade. These shapes are useful for visualizing estimates as a map and relating geographies based upon geo-operations like overlapping. This data is kept in a geodatabase file format and requires the geopandas package and its supporting fiona and DAL software. Demographics are taken from popular variables in the American Community Survey (ACS) including age, race, income, education and family structure. This data simply requires csv reader software or pythons pandas package. While the demographic data has many columns, the cartographic data has a very, very large column called "geometry" storing the many-point boundaries of each shape. So, this process saves the data separately, with demographics columns in a csv file and geometry in a gpd file needed an installation of geopandas, fiona and DAL software. More details on the ACS variables selected and derivation rules applied can be found in the commentary docstrings in the source code found here: https://github.com/OpenEnvironments/blockgroupdemographics. ## Files While the demographic data has many columns, the cartographic data has a very, very large column called "geometry" storing the many-point boundaries of each shape. So, this process saves the data separately, with demographics columns in a csv file named YYYYblcokgroupdemographics.csv. The cartographic column, 'geometry', is shared as file named YYYYblockgroupdemographics-geometry.pkl. This file needs an installation of geopandas, fiona and DAL software.
Datasets for Sentiment Analysis
zenodo.org
csv
Updated Dec 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10157504
Dataset updated
Dec 10, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.

----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
product_id - Product ID
product_name - Name of the Product
category - Category of the Product
discounted_price - Discounted Price of the Product
actual_price - Actual Price of the Product
discount_percentage - Percentage of Discount for the Product
rating - Rating of the Product
rating_count - Number of people who voted for the Amazon rating
about_product - Description about the Product
user_id - ID of the user who wrote review for the Product
user_name - Name of the user who wrote review for the Product
review_id - ID of the user review
review_title - Short review
review_content - Long review
img_link - Image Link of the Product
product_link - Official Website Link of the Product
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
Z
The Con Espressione Game Dataset
data.niaid.nih.gov
zenodo.org
Updated Nov 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chowdhury, Shreyan (2020). The Con Espressione Game Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3968827
Explore at:
Dataset updated
Nov 5, 2020
Dataset provided by
Widmer, Gerhard
Cancino-Chacón, Carlos Eduardo
Chowdhury, Shreyan
Aljanaki, Anna
Peter, Silvan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Con Espressione Game Dataset

A piece of music can be expressively performed, or interpreted, in a variety of ways. With the help of an online questionnaire, the Con Espressione Game, we collected some 1,500 descriptions of expressive character relating to 45 performances of 9 excerpts from classical piano pieces, played by different famous pianists. More specifically, listeners were asked to describe, using freely chosen words (preferably: adjectives), how they perceive the expressive character of the different performances. The aim of this research is to find the dimensions of musical expression (in Western classical piano music) that can be attributed to a performance, as perceived and described in natural language by listeners.

The Con Espressione Game was launched on the 3rd of April 2018.

Dataset structure

Listeners’ Descriptions of Expressive performance

piece_performer_data.csv: A comma separated file (CSV) containing information about the pieces in the dataset. Strings are delimited with ". The columns in this file are:

music_id: An integer ID for each performance in the dataset.

performer_name: (Last) name of the performer.

piece_name: (Short) name of the piece.

performance_name: Name of the the performance. All files in different modalities (alignments, MIDI, loudness features, etc) corresponding to a single performance will have the same name (but possibly different extensions).

composer: Name of the composer of the piece.

piece: Full name of the piece.

album: Name of the album.

performer_name_full: Full name of the performer.

year_of_CD_issue: Year of the issue of the CD.

track_number: Number of the track in the CD.

length_of_excerpt_seconds: Length of the excerpt in seconds.

start_of_excerpt_seconds: Start of the excerpt in its corresponding track (in seconds).

end_of_excerpt_seconds: End of the excerpt in its corresponding track (in seconds).

con_espressione_game_answers.csv: This is the main file of the dataset which contains listener’s descriptions of expressive character. This CSV file contains the following columns:

answer_id: An integer representing the ID of the answer. Each answer gets a unique ID.

participant_id: An integer representing the ID of a participant. Answers with the same ID come from the same participant.

music_id: An integer representing the ID of the performance. This is the same as the music_id in piece_performer_data.csv described above.

answer: (cleaned/formatted) participant description. All answers have been written as lower-case, typos were corrected, spaces replaced by underscores (_) and individual terms are separated by commas. See cleanup_rules.txt for a more detailed description of how the answers were formatted.

original_answer: Raw answers provided by the participants.

timestamp: Timestamp of the answer.

favorite: A boolean (0 or 1) indicating if this performance of the piece is the participant’s favorite.

translated_to_english. Raw translation (from German, Russian, Spanish and Italian).

performer. (Last) name of the performer. See piece_performer_data.csv described above.

piece_name. (Short) name of the piece. See piece_performer_data.csv described above.

performance_name. Name of the performance. See piece_performer_data.csv described above.

participant_profiles.csv. A CSV file containing musical background information of the participants. Empty cells mean that the participant did not provide an answer. This file contains the following columns:

participant_id: An integer representing the ID of a participant.

music_education_years: (Self reported) number of years of musical education of the participants

listening_to_classical_music: Answers to the question “How often do you listen to classical music?”. The possible answers are:

1: Never

2: Very rarely

3: Rarely

4: Occasionally

5: Frequently

6: Very frequently

registration_date: Date and time of registration of the participant.

playing_piano: Answer to the question “Do you play the piano?”. The possible answers are

1: No

2: A little bit

3: Quite well

4: Very well

cleanup_rules.txt: Rules for cleaning/formatting the terms in the participant’s answers.

translations_GERMAN.txt: How the translations from German to English were made.

Metadata

Related meta data is stored in the MetaData folder.

Alignments. This folders contains the manually-corrected score-to-performance alignments for each of the pieces in the dataset. Each of these alignments is a text file.

ApproximateMIDI. This folder contains reconstructed MIDI performances created from the alignments and the loudness curves. The onset time and offset times of the notes were determined from the alignment times and the MIDI velocity was computed from the loudness curves.

Match. This folder contains score-to-performance alignments in Matchfile format.

Scores_MuseScore. Manually encoded sheet music in MuseScore format (.mscz)

Scores_MusicXML. Sheet music in MusicXML format.

Scores_pdf. Images of the sheet music in pdf format.

Audio Features

Audio features computed from the audio files. These features are located in the AudioFeatures folder.

Loudness: Text files containing loudness curves in dB of the audio files. These curves were computed using code provided by Olivier Lartillot. Each of these files contains the following columns:

performance_time_(seconds): Performance time in seconds.

loudness_(db): Loudness curve in dB.

smooth_loudness_(db): Smoothed loudness curve.

Spectrograms. Numpy files (.npy) containing magnitude spectrograms (as Numpy arrays). The shape of each array is (149 frequency bands, number of frames of the performance). The spectrograms were computed from the audio files with the following parameters:

Sample rate (sr): 22050 samples per second

Window length: 2048

Frames per Second (fps): 31.3 fps

Hop size: sample_rate // fps = 704

Filterbank: log scaled filterbank with 24 bands per octave and min frequency 20 Hz

MIDI Performances

Since the dataset consists of commercial recordings, we cannot include the audio files in the dataset. We can, however, share the 2 synthesized MIDI performances used in the Con Espressione game (for Bach’s Prelude in C and the second movement of Mozart’s Sonata in C K 545) in mp3 format. These performances can be found in the MIDIPerformances folder.
United States Baby Names Count
kaggle.com
Updated Dec 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). United States Baby Names Count [Dataset]. https://www.kaggle.com/datasets/thedevastator/united-states-baby-names-count/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 4, 2023
Dataset provided by
Kaggle
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
United States
Description
United States Baby Names Count

United States Baby Names Dataset

By Amber Thomas [source]

About this dataset

The data is based on a complete sample of records on Social Security card applications as of March 2021 and is presented in three main files: baby-names-national.csv, baby-names-state.csv, and baby-names-territories.csv. These files contain detailed information about names given to babies at the national level (50 states and District of Columbia), state level (individual states), and territory level (including American Samoa, Guam, Northern Mariana Islands Puerto Rico and U.S. Virgin Islands) respectively.

Each entry in the dataset includes several key attributes such as state_abb or territory_code representing the abbreviation or code indicating the specific state or territory where the baby was born. The sex attribute denotes the gender of each baby – either male or female – while year represents the specific birth year when each baby was born.

Another important attribute is name which indicates given name selected for each individual newborn.The count attribute provides numerical data about how many babies received a particular name within a specific state/territory, gender combination for a given year.

It's also worth noting that all names included have at least two characters in length to ensure high data quality standards.

How to use the dataset

- Understanding the Columns

The dataset consists of multiple columns with specific information about each baby name entry. Here are the key columns in this dataset:

state_abb: The abbreviation of the state or territory where the baby was born.

sex: The gender of the baby.

year: The year in which the baby was born.

name: The given name of the baby.

count: The number of babies with a specific name born in a certain state, gender, and year.

- Exploring National Data

To analyze national trends or overall popularity across all states and years: a) Focus on baby-names-national.csv. b) Use columns like name, sex, year, and count to study trends over time.

- Analyzing State-Level Data

To examine specific states' data: a) Utilize baby-names-state.csv file. b) Filter data by desired states using state_abb column values. c) Combine analysis with other relevant attributes like gender, year, etc., for detailed insights.

- Understanding Territory Data

For insights into United States territories (American Samoa, Guam, Northern Mariana Islands, Puerto Rico, U.S Virgin Islands): a) Access informative data from baby-names-territories.csv. b) Analyze based on similar principles as state-level data but considering unique territory factors.

- Gender-Specific Analysis

You can study names' popularity specifically among males or females by filtering the data using the sex column. This will allow you to explore gender-specific naming trends and preferences.

- Identifying Regional Patterns

To identify naming patterns in specific regions: a) Analyze state-level or territory-level data. b) Look for variations in name popularity across different states or territories.

- Analyzing Name Popularity over Time

Track the popularity of specific names over time using the name, year, and count columns. This can help uncover trends, fluctuations, and changes in names' usage and popularity.

- Comparing Names and Variations

Use this

Research Ideas

Tracking Popularity Trends: This dataset can be used to analyze the popularity of baby names over time. By examining the count of babies with a specific name born in different years, trends and shifts in naming preferences can be identified.

Gender Analysis: The dataset includes information on the gender of each baby. It can be used to study gender patterns and differences in naming choices. For example, it would be possible to compare the frequency and popularity of certain names among males and females.

Regional Variations: With state abbreviations provided, it is possible to explore regional variations in baby naming trends within the United States. Researchers could examine how certain names are more popular or unique to specific states or territories, highlighting cultural or geographical factors that influence naming choices

Acknowledgements

If you use this dataset in your research, please credit the original a...
d
Column heading and attribute field name correlation and description for the...
datasets.ai
data.usgs.gov
+2more
55
Updated Aug 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of the Interior (2024). Column heading and attribute field name correlation and description for the Titanium_vanadium_deposits.csv, and Titanium_vanadium_deposits.shp files. [Dataset]. https://datasets.ai/datasets/column-heading-and-attribute-field-name-correlation-and-description-for-the-titanium-vanad
Explore at:
55Available download formats
Dataset updated
Aug 8, 2024
Dataset authored and provided by
Department of the Interior
Description
This Titanium_vanadium_column_headings.csv file correlates the column headings in the Titanium_vanadium_deposits.csv file with the attribute field names in the Titanium_vanadium_deposits.shp file and provides a brief description of each column heading and attribute field name. Also included with this data release are the following files: Titanium_vanadium_deposits.csv file, which lists the deposits and associated information such as the host intrusion, location, grade, and tonnage data, along with other miscellaneous descriptive data about the deposits; Titanium_vanadium_deposits.shp file, which duplicates the information in the Titanium_vanadium_deposits.csv file in a spatial format for use in a GIS; Titanium_vanadium_deposits_concentrate_grade.csv file, which lists the concentrate grade data for the deposits, when available; and Titanium_vanadium_deposits_references.csv file, which lists the abbreviated and full references that are cited in the Titanium_vanadium_deposits.csv, and Titanium_vanadium_deposits.shp, and Titanium_vanadium_deposits_concentrate_grade.csv files.
Open Science in Asia
kaggle.com
Updated Jan 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Open Science in Asia [Dataset]. https://www.kaggle.com/datasets/thedevastator/open-science-in-asia
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 24, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Asia
Description
Open Science in Asia

Understanding the Dynamics and Growth of Events and Orgs

By [source]

About this dataset

This dataset encapsulates an innovative joint effort to construct a thriving and progressive Open Science community from the ground up. Covering both Chinese Open Science Network (COSN) Events and various Open Science Organizations across Asia, it includes valuable information about topics, reads, countries of origin, representative codes as well as development status and platform utilized. By examining this data in aggregate, we can gain deep insights into success factors on a global level that will inform our efforts to bring together scientific resources effectively. Explore these datasets today to discover more about the potential of Open Science

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

How to Use this Dataset

This dataset contains information about Chinese Open Science Network (COSN) Events and various Open Science Organizations across Asia. It includes event names, topics, countries, representative codes, development statuses, platforms and full names of the organizations. It will be helpful for anyone interested in learning more about the Open Science process in Asia.

To get started with this dataset, first you should explore the data set columns: events, reads, topic countries rep_code developed platform and fullname. This can give you an insight into what is included in this dataset.

Once you have a better understanding about the contents of the dataset; You can then begin exploring different datasets for each type of data. For example if you want to find out more about COSN Events in specific country - then explore the events columns; similarly if you are looking to discover new open science organizations around your region - then look onto the ‘rea’ column which tells us how many reads an organizations gets towards its website/platform .

Once familiar with different datasets ,you can use R or python scripts as needed as per your requirement . This can enables us to go beyond simple querying and start getting interesting visualizations or complex correlations out of our data sets that wasn’t possible previously . Similarly other techniques like correlation analysis , feature engineering etc.. may be explored on the basis on what user intend to do with this data set Finally by using these tools ,one may acquire meaningful insights from their own discoveries trough datasheet exploration !

Research Ideas

Using the data about COSN Events and various Open Science Organizations across Asia, organizations can track the popularity and success of their events/programs according to number of reads as well analyzing topics discussed at each event.

Rep_code, platform, developed columns information can be used to segment users based on products they’re bought or used by them. This would help manufacturers in understanding preferences of buyers who are involved in Open Science Community activities & helps them in better targeting those buyers using right campaign message

The full name column can be useful for government or non-government regulatory bodies to take corrective action against organizations which are defying laws by working without proper authorisation

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: cosn_events.csv | Column name | Description | |:--------------|:-------------------------------------------| | Events | Name of the event. (String) | | Date | Date on which the event took place. (Date) | | Reads | Number of reads for the event. (Integer) | | Topic | Topics discussed at the event. (String) |

File: os_organizations.csv | Column name | Description | |:--------------|:---------------------------------------------------------------------------------------------------| | country | Country of operation for the Open Science Organization. (String) ...
S
NASICON-type solid electrolyte materials named entity recognition dataset
scidb.cn
Updated Apr 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liu Yue; Liu Dahui; Yang Zhengwei; Shi Siqi (2023). NASICON-type solid electrolyte materials named entity recognition dataset [Dataset]. http://doi.org/10.57760/sciencedb.j00213.00001
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.j00213.00001
Dataset updated
Apr 27, 2023
Dataset provided by
Science Data Bank
Authors
Liu Yue; Liu Dahui; Yang Zhengwei; Shi Siqi
Description
1.Framework overview. This paper proposed a pipeline to construct high-quality datasets for text mining in materials science. Firstly, we utilize the traceable automatic acquisition scheme of literature to ensure the traceability of textual data. Then, a data processing method driven by downstream tasks is performed to generate high-quality pre-annotated corpora conditioned on the characteristics of materials texts. On this basis, we define a general annotation scheme derived from materials science tetrahedron to complete high-quality annotation. Finally, a conditional data augmentation model incorporating materials domain knowledge (cDA-DK) is constructed to augment the data quantity.2.Dataset information. The experimental datasets used in this paper include: the Matscholar dataset publicly published by Weston et al. (DOI: 10.1021/acs.jcim.9b00470), and the NASICON entity recognition dataset constructed by ourselves. Herein, we mainly introduce the details of NASICON entity recognition dataset.2.1 Data collection and preprocessing. Firstly, 55 materials science literature related to NASICON system are collected through Crystallographic Information File (CIF), which contains a wealth of structure-activity relationship information. Note that materials science literature is mostly stored as portable document format (PDF), with content arranged in columns and mixed with tables, images, and formulas, which significantly compromises the readability of the text sequence. To tackle this issue, we employ the text parser PDFMiner (a Python toolkit) to standardize, segment, and parse the original documents, thereby converting PDF literature into plain text. In this process, the entire textual information of literature, encompassing title, author, abstract, keywords, institution, publisher, and publication year, is retained and stored as a unified TXT document. Subsequently, we apply rules based on Python regular expressions to remove redundant information, such as garbled characters and line breaks caused by figures, tables, and formulas. This results in a cleaner text corpus, enhancing its readability and enabling more efficient data analysis. Note that special symbols may also appear as garbled characters, but we refrain from directly deleting them, as they may contain valuable information such as chemical units. Therefore, we converted all such symbols to a special token
f
Petre_Slide_CategoricalScatterplotFigShare.pptx
figshare.com
pptx
Updated Sep 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
Explore at:
pptxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3840102.v1
Dataset updated
Sep 19, 2016
Dataset provided by
figshare
Authors
Benj Petre; Aurore Coince; Sophien Kamoun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Categorical scatterplots with R for biologists: a step-by-step guide

Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

Protocol

• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

Notes

• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

7 Display the graph in a separate window. Dot colors indicate

replicates

graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

References

Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

https://cran.r-project.org/

http://ggplot2.org/
m
Data from: Dataset for the Identification of an Energy Harvester
data.mendeley.com
observatorio-investigacion.unavarra.es
Updated Oct 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aitor Plaza (2024). Dataset for the Identification of an Energy Harvester [Dataset]. http://doi.org/10.17632/y4cmrsdv9f.3
Explore at:
Unique identifier
https://doi.org/10.17632/y4cmrsdv9f.3
Dataset updated
Oct 7, 2024
Authors
Aitor Plaza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the measurements of the procedure to fully characterise a novel vibration energy harvester (presented in https://doi.org/10.1016/j.apenergy.2023.120715) operating in the ultra-low-frequency range. The details of the experimental setup and fully characterization procedure can be found in: https://doi.org/10.3390/s24123813

The experimental measurements include not only the input (acceleration)–output (energy) response but also the (internal) dynamic behaviour of the system, making use of a synchronised image processing and signal acquisition system.

The result of the processed video (angular positions of the masses) and calculation of their two time derivatives by means of the Tikhonov regularization have also been included. Additionally, the calculation of the power generated in the windings has also been included.

In "Videos" folder are the recorded videos at 240fps. In "Measurements" folder data form each experiment can be found in different files. First column is time, second and third columns are accelerometer measurements, and the rest of the columns are the different voltages measured at each coil. In "Post-processed data" are the estimated angular positions (2,3,4 columns), velocities (4,5,6 columns) and accelerations (7,8,9 columns) obtained form the post-processed videos. The last column is the estimated power harvested by the device. The name of the file name is in line with the experiment.
g
PARQUET - Basic climatological data - monthly - daily - hourly - 6 minutes...
gimi9.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PARQUET - Basic climatological data - monthly - daily - hourly - 6 minutes (parquet format) [Dataset]. https://gimi9.com/dataset/eu_66159f1bf0686eb4806508e1
Explore at:
Description
Format .parquet This dataset gathers data in .parquet format. Instead of having a .csv.gz per department per period, all departments are grouped into a single file per period. When possible (depending on the size), several periods are grouped in the same file. ### Data origin The data come from: - Basic climatological data - monthly - Basic climatological data - daily - Basic climatological data - times - Basic climatological data - 6 minutes ### Data preparation The files ending with .prepared have undergone slight preparation steps: - deleting spaces in the name of columns - typing (flexible) The data are typed according to: - date (YYYYMM, YYYMMDD, YYYYMMDDDDH, YYYYMMDDDDHMN): integer - NUM_POST' : string -USUAL_NAME: string - "LAT": float -LON: float -ALTI: integer - if the column begins withQ(‘quality’) orNB` (‘number’): integer ### Update The data are updated at least once a week (depending on my availability) on the data for the period ‘latest-2023-2024’. If you have specific needs, feel free to get closer to me. ### Re-use: Meteo Squad These files are used in the Meteo Squad web application: https://www.meteosquad.com ### Contact If you have specific requests, please do not hesitate to contact me: contact@mistermeteo.com

Facebook

Twitter

Click to copy link

Link copied

Cite

Konstantinos Malliaridis (2024). DatasetofDatasets (DoD) [Dataset]. https://www.kaggle.com/datasets/terminalgr/datasetofdatasets-124-1242024

DatasetofDatasets (DoD)

This dataset is essentially the metadata from other datasets.

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 12, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Konstantinos Malliaridis

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

This dataset is essentially the metadata from 164 datasets. Each of its lines concerns a dataset from which 22 features have been extracted, which are used to classify each dataset into one of the categories 0-Unmanaged, 2-INV, 3-SI, 4-NOA (DatasetType).

This Dataset consists of 164 Rows. Each row is the metadata of an other dataset. The target column is datasetType which has 4 values indicating the dataset type. These are:

2 - Invoice detail (INV): This dataset type is a special report (usually called Detailed Sales Statement) produced by a Company Accounting or an Enterprise Resource Planning software (ERP). Using a INV-type dataset directly for ARM is extremely convenient for users as it relieves them from the tedious work of transforming data into another more suitable form. INV-type data input typically includes a header but, only two of its attributes are essential for data mining. The first attribute serves as the grouping identifier creating a unique transaction (e.g., Invoice ID, Order Number), while the second attribute contains the items utilized for data mining (e.g., Product Code, Product Name, Product ID).

3 - Sparse Item (SI): This type is widespread in Association Rules Mining (ARM). It involves a header and a fixed number of columns. Each item corresponds to a column. Each row represents a transaction. The typical cell stores a value, usually one character in length, that depicts the presence or absence of the item in the corresponding transaction. The absence character must be identified or declared before the Association Rules Mining process takes place.

4 - Nominal Attributes (NOA): This type is commonly used in Machine Learning and Data Mining tasks. It involves a fixed number of columns. Each column registers nominal/categorical values. The presence of a header row is optional. However, in cases where no header is provided, there is a risk of extracting incorrect rules if similar values exist in different attributes of the dataset. The potential values for each attribute can vary.

0 - Unmanaged for ARM: On the other hand, not all datasets are suitable for extracting useful association rules or frequent item sets. For instance, datasets characterized predominantly by numerical features with arbitrary values, or datasets that involve fragmented or mixed types of data types. For such types of datasets, ARM processing becomes possible only by introducing a data discretization stage which in turn introduces information loss. Such types of datasets are not considered in the present treatise and they are termed (0) Unmanaged in the sequel.

The dataset type is crucial to determine for ARM, and the current dataset is used to classify the dataset's type using a Supervised Machine Learning Model.

There is and another dataset type named 1 - Market Basket List (MBL) where each dataset row is a transaction. A transaction involves a variable number of items. However, due to this characteristic, these datasets can be easily categorized using procedural programming and DoD does not include instances of them. For more details about Dataset Types please refer to article "WebApriori: a web application for association rules mining". https://link.springer.com/chapter/10.1007/978-3-030-49663-0_44

Clear search

Close search

Google apps

Main menu

DatasetofDatasets (DoD)

Data and tools for studying isograms

A dataset of for cross-course learning path planning with 7 types of learner...

Dataset: Ethnicity-Based Name Partitioning for Author Name Disambiguation...

Sample Graph Datasets in CSV Format

Sample Graph Datasets in CSV Format

Description

CSV nodes

CSV edges

Metadata

CSV nodes (tiny graphs)

CSV edges (tiny graphs)

Metadata (tiny graphs)

Data from: A Community Resource for Exploring and Utilizing Genetic...

Data set collection for flow delegation - Vdataset - LDM

Dataset of books called Developing analytical database applications

Data from: Proxy data

Dataset of books called Formal set theory

U.S. Select Demographics by Census Block Groups

Datasets for Sentiment Analysis

The Con Espressione Game Dataset

United States Baby Names Count

United States Baby Names Count

United States Baby Names Dataset

About this dataset

How to use the dataset

- Understanding the Columns

- Exploring National Data

- Analyzing State-Level Data

- Understanding Territory Data

- Gender-Specific Analysis

- Identifying Regional Patterns

- Analyzing Name Popularity over Time

- Comparing Names and Variations

Research Ideas

Acknowledgements

Column heading and attribute field name correlation and description for the...

Open Science in Asia

Open Science in Asia

Understanding the Dynamics and Growth of Events and Orgs

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

NASICON-type solid electrolyte materials named entity recognition dataset

Petre_Slide_CategoricalScatterplotFigShare.pptx

7 Display the graph in a separate window. Dot colors indicate

Data from: Dataset for the Identification of an Energy Harvester

PARQUET - Basic climatological data - monthly - daily - hourly - 6 minutes...

DatasetofDatasets (DoD)

This dataset is essentially the metadata from other datasets.