100+ datasets found
  1. Dataset: Ethnicity-Based Name Partitioning for Author Name Disambiguation...

    • figshare.com
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jinseok Kim; Jenna Kim; Jason Owen-Smith (2023). Dataset: Ethnicity-Based Name Partitioning for Author Name Disambiguation Using Supervised Machine Learning [Dataset]. http://doi.org/10.6084/m9.figshare.14043791.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Jinseok Kim; Jenna Kim; Jason Owen-Smith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains data files for a research paper, "Ethnicity-Based Name Partitioning for Author Name Disambiguation Using Supervised Machine Learning," published in the Journal of the Association for Information Science and Technology.Four zipped files are uploaded.Each zipped file contains five data files: signatures_train.txt, signatures_test.txt, records.txt, clusters_train.txt, and clusters_test.txt.1. 'Signatures' files contain lists of name instances. Each name instance (a row) is associated with information as follows. - 1st column: instance id (numeric): unique id assigned to a name instance - 2nd column: paper id (numeric): unique id assigned to a paper in which the name instance appears as an author name - 3rd column: byline position (numeric): integer indicating the position of the name instance in the authorship byline of the paper - 4th column: author name (string): name string formatted as surname, comma, and forename(s) - 5th column: ethnic name group (string): name ethnicity assigned by Ethnea to the name instance - 6th column: affiliation (string): affiliation associated with the name instance, if available in the original data - 7th column: block (string): simplified name string of the name instance to indicate its block membership (surname and first forename initial) - 8th column: author id (string): unique author id (i.e., author label) assigned by the creators of the original data2. 'Records' files contain lists of papers. Each paper is associated with information as follows. -1st column: paper id (numeric): unique paper id; this is the unique paper id (2nd column) in Signatures files -2nd column: year (numeric): year of publication * Some papers may have wrong publication years due to incorrect indexing or delayed updates in original data -3rd column: venue (string): name of journal or conference in which the paper is published * Venue names can be in full string or in a shortened format according to the formats in original data -4th column: authors (string; separated by vertical bar): list of author names that appear in the paper's byline * Author names are formatted into surname, comma, and forename(s) -5th column: title words (string; separated by space): words in a title of the paper. * Note that common words are stop-listed and each remaining word is stemmed using Porter's stemmer.3. 'Clusters' files contain lists of clusters. Each cluster is associated with information as follows. -1st column: cluster id (numeric): unique id of a cluster -2nd column: list of name instance ids (Signatures - 1st column) that belong to the same unique author id (Signatures - 8th column). Signatures and Clusters files consist of two subsets - train and test files - of original labeled data which are randomly split into 50%-50% by the authors of this study.Original labeled data for AMiner.zip, KISTI.zip, and GESIS.zip came from the studies cited below.If you use one of the uploaded data files, please cite them accordingly.[AMiner.zip]Tang, J., Fong, A. C. M., Wang, B., & Zhang, J. (2012). A Unified Probabilistic Framework for Name Disambiguation in Digital Library. IEEE Transactions on Knowledge and Data Engineering, 24(6), 975-987. doi:10.1109/Tkde.2011.13Wang, X., Tang, J., Cheng, H., & Yu, P. S. (2011). ADANA: Active Name Disambiguation. Paper presented at the 2011 IEEE 11th International Conference on Data Mining.[KISTI.zip]Kang, I. S., Kim, P., Lee, S., Jung, H., & You, B. J. (2011). Construction of a Large-Scale Test Set for Author Disambiguation. Information Processing & Management, 47(3), 452-465. doi:10.1016/j.ipm.2010.10.001Note that the original KISTI data contain errors and duplicates. This study reuses the revised version of KISTI reported in a study below.Kim, J. (2018). Evaluating author name disambiguation for digital libraries: A case of DBLP. Scientometrics, 116(3), 1867-1886. doi:10.1007/s11192-018-2824-5[GESIS.zip]Momeni, F., & Mayr, P. (2016). Evaluating Co-authorship Networks in Author Name Disambiguation for Common Names. Paper presented at the 20th international Conference on Theory and Practice of Digital Libraries (TPDL 2016), Hannover, Germany.Note that this study reuses the 'Evaluation Set' among the original GESIS data which was added titles by a study below.Kim, J., & Kim, J. (2020). Effect of forename string on author name disambiguation. Journal of the Association for Information Science and Technology, 71(7), 839-855. doi:10.1002/asi.24298[UM-IRIS.zip]This labeled dataset was created for this study. For description about the labeling method, please see 'Method' in the paper below.Kim, J., Kim, J., & Owen-Smith, J. (In print). Ethnicity-based name partitioning for author name disambiguation using supervised machine learning. Journal of the Association for Information Science and Technology. doi:10.1002/asi.24459.For details on the labeling method and limitations, see the paper below.Kim, J., & Owen-Smith, J. (2021). ORCID-linked labeled data for evaluating author name disambiguation at scale. Scientometrics. doi:10.1007/s11192-020-03826-6

  2. d

    Master Street Name Table

    • catalog.data.gov
    • data.nola.gov
    • +3more
    Updated Feb 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.nola.gov (2024). Master Street Name Table [Dataset]. https://catalog.data.gov/dataset/master-street-name-table
    Explore at:
    Dataset updated
    Feb 9, 2024
    Dataset provided by
    data.nola.gov
    Description

    This list is a work-in-progress and will be updated at least quarterly. This version updates column names and corrects spellings of several streets in order to alleviate confusion and simplify street name research. It represents an inventory of official street name spellings in the City of New Orleans. Several sources contain various spellings and formats of street names. This list represents street name spellings and formats researched by the City of New Orleans GIS and City Planning Commission.Note: This list may not represent what is currently displayed on street signs. City of New Orleans official street list is derived from New Orleans street centerline file, 9-1-1 centerline file, and CPC plat maps. Fields include the full street name and the parsed elements along with abbreviations using US Postal Standards. We invite your input to as we work toward one enterprise street name list.Status: Current: Currently a known used street name in New Orleans Other: Currently a known used street name on a planned but not developed street. May be a retired street name.

  3. H

    Dataset metadata of known Dataverse installations, August 2023

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Aug 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julian Gautier (2024). Dataset metadata of known Dataverse installations, August 2023 [Dataset]. http://doi.org/10.7910/DVN/8FEGUV
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 30, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Julian Gautier
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains the metadata of the datasets published in 85 Dataverse installations and information about each installation's metadata blocks. It also includes the lists of pre-defined licenses or terms of use that dataset depositors can apply to the datasets they publish in the 58 installations that were running versions of the Dataverse software that include that feature. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations and improving understandings about how certain Dataverse features and metadata fields are used. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 22 and August 28, 2023 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation)_2023.08.22-2023.08.28.csv │ ├── contributor(citation)_2023.08.22-2023.08.28.csv │ ├── data_source(citation)_2023.08.22-2023.08.28.csv │ ├── ... │ └── topic_classification(citation)_2023.08.22-2023.08.28.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2023.08.27_12.59.59.zip │ ├── dataset_pids_Abacus_2023.08.27_12.59.59.csv │ ├── Dataverse_JSON_metadata_2023.08.27_12.59.59 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2023.08.26_22.14.04.zip │ ├── ADA_Dataverse_2023.08.27_13.16.20.zip │ ├── Arca_Dados_2023.08.27_13.34.09.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2023.08.27_19.24.15.zip └── dataverse_installations_summary_2023.08.28.csv └── dataset_pids_from_most_known_dataverse_installations_2023.08.csv └── license_options_for_each_dataverse_installation_2023.09.05.csv └── metadatablocks_from_most_known_dataverse_installations_2023.09.05.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the citation metadata block and geospatial metadata block of datasets in the 85 Dataverse installations. For example, author(citation)_2023.08.22-2023.08.28.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in the 85 installations, where there's a row for author names, affiliations, identifier types and identifiers. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 85 zipped files, one for each of the 85 Dataverse installations whose dataset metadata I was able to download. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. It also includes the alias/identifier and category of the Dataverse collection that the dataset is in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The Dataverse JSON export of the latest version of each dataset includes "(latest_version)" in the file name. This should help those who are interested in the metadata of only the latest version of each dataset. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I included them so that they can be used when extracting metadata from the dataset's Dataverse JSON exports. The dataverse_installations_summary_2023.08.28.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata...

  4. United States Baby Names Count

    • kaggle.com
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). United States Baby Names Count [Dataset]. https://www.kaggle.com/datasets/thedevastator/united-states-baby-names-count/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 4, 2023
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    United States
    Description

    United States Baby Names Count

    United States Baby Names Dataset

    By Amber Thomas [source]

    About this dataset

    The data is based on a complete sample of records on Social Security card applications as of March 2021 and is presented in three main files: baby-names-national.csv, baby-names-state.csv, and baby-names-territories.csv. These files contain detailed information about names given to babies at the national level (50 states and District of Columbia), state level (individual states), and territory level (including American Samoa, Guam, Northern Mariana Islands Puerto Rico and U.S. Virgin Islands) respectively.

    Each entry in the dataset includes several key attributes such as state_abb or territory_code representing the abbreviation or code indicating the specific state or territory where the baby was born. The sex attribute denotes the gender of each baby – either male or female – while year represents the specific birth year when each baby was born.

    Another important attribute is name which indicates given name selected for each individual newborn.The count attribute provides numerical data about how many babies received a particular name within a specific state/territory, gender combination for a given year.

    It's also worth noting that all names included have at least two characters in length to ensure high data quality standards.

    How to use the dataset

    - Understanding the Columns

    The dataset consists of multiple columns with specific information about each baby name entry. Here are the key columns in this dataset:

    • state_abb: The abbreviation of the state or territory where the baby was born.
    • sex: The gender of the baby.
    • year: The year in which the baby was born.
    • name: The given name of the baby.
    • count: The number of babies with a specific name born in a certain state, gender, and year.

    - Exploring National Data

    To analyze national trends or overall popularity across all states and years: a) Focus on baby-names-national.csv. b) Use columns like name, sex, year, and count to study trends over time.

    - Analyzing State-Level Data

    To examine specific states' data: a) Utilize baby-names-state.csv file. b) Filter data by desired states using state_abb column values. c) Combine analysis with other relevant attributes like gender, year, etc., for detailed insights.

    - Understanding Territory Data

    For insights into United States territories (American Samoa, Guam, Northern Mariana Islands, Puerto Rico, U.S Virgin Islands): a) Access informative data from baby-names-territories.csv. b) Analyze based on similar principles as state-level data but considering unique territory factors.

    - Gender-Specific Analysis

    You can study names' popularity specifically among males or females by filtering the data using the sex column. This will allow you to explore gender-specific naming trends and preferences.

    - Identifying Regional Patterns

    To identify naming patterns in specific regions: a) Analyze state-level or territory-level data. b) Look for variations in name popularity across different states or territories.

    - Analyzing Name Popularity over Time

    Track the popularity of specific names over time using the name, year, and count columns. This can help uncover trends, fluctuations, and changes in names' usage and popularity.

    - Comparing Names and Variations

    Use this

    Research Ideas

    • Tracking Popularity Trends: This dataset can be used to analyze the popularity of baby names over time. By examining the count of babies with a specific name born in different years, trends and shifts in naming preferences can be identified.
    • Gender Analysis: The dataset includes information on the gender of each baby. It can be used to study gender patterns and differences in naming choices. For example, it would be possible to compare the frequency and popularity of certain names among males and females.
    • Regional Variations: With state abbreviations provided, it is possible to explore regional variations in baby naming trends within the United States. Researchers could examine how certain names are more popular or unique to specific states or territories, highlighting cultural or geographical factors that influence naming choices

    Acknowledgements

    If you use this dataset in your research, please credit the original a...

  5. d

    Column heading and attribute field name correlation and description for the...

    • datasets.ai
    • data.usgs.gov
    • +2more
    55
    Updated Aug 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of the Interior (2024). Column heading and attribute field name correlation and description for the Titanium_vanadium_deposits.csv, and Titanium_vanadium_deposits.shp files. [Dataset]. https://datasets.ai/datasets/column-heading-and-attribute-field-name-correlation-and-description-for-the-titanium-vanad
    Explore at:
    55Available download formats
    Dataset updated
    Aug 8, 2024
    Dataset authored and provided by
    Department of the Interior
    Description

    This Titanium_vanadium_column_headings.csv file correlates the column headings in the Titanium_vanadium_deposits.csv file with the attribute field names in the Titanium_vanadium_deposits.shp file and provides a brief description of each column heading and attribute field name. Also included with this data release are the following files: Titanium_vanadium_deposits.csv file, which lists the deposits and associated information such as the host intrusion, location, grade, and tonnage data, along with other miscellaneous descriptive data about the deposits; Titanium_vanadium_deposits.shp file, which duplicates the information in the Titanium_vanadium_deposits.csv file in a spatial format for use in a GIS; Titanium_vanadium_deposits_concentrate_grade.csv file, which lists the concentrate grade data for the deposits, when available; and Titanium_vanadium_deposits_references.csv file, which lists the abbreviated and full references that are cited in the Titanium_vanadium_deposits.csv, and Titanium_vanadium_deposits.shp, and Titanium_vanadium_deposits_concentrate_grade.csv files.

  6. h

    doc-formats-tsv-3

    • huggingface.co
    Updated Nov 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasets examples (2023). doc-formats-tsv-3 [Dataset]. https://huggingface.co/datasets/datasets-examples/doc-formats-tsv-3
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 23, 2023
    Dataset authored and provided by
    Datasets examples
    Description

    [doc] formats - tsv - 3

    This dataset contains one tsv file at the root:

    data.tsv

    dog woof cat meow pokemon pika human hello

    We define the config name in the YAML config, the file's exact location, and the columns' name. As we provide the names option, but not the header one, the first row in the file is considered a row of values, not a row of column names. The delimiter is set to "\t" (tabulation) due to the file's extension. The reference for the options is the documentation of… See the full description on the dataset page: https://huggingface.co/datasets/datasets-examples/doc-formats-tsv-3.

  7. f

    Data and tools for studying isograms

    • figshare.com
    Updated Jul 31, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
    Explore at:
    application/x-sqlite3Available download formats
    Dataset updated
    Jul 31, 2017
    Dataset provided by
    figshare
    Authors
    Florian Breit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

    Label Data type Description

    isogramy int The order of isogramy, e.g. "2" is a second order isogram

    length int The length of the word in letters

    word text The actual word/isogram in ASCII

    source_pos text The Part of Speech tag from the original corpus

    count int Token count (total number of occurences)

    vol_count int Volume count (number of different sources which contain the word)

    count_per_million int Token count per million words

    vol_count_as_percent int Volume count as percentage of the total number of volumes

    is_palindrome bool Whether the word is a palindrome (1) or not (0)

    is_tautonym bool Whether the word is a tautonym (1) or not (0)

    The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

    Label

    Data type

    Description

    !total_1grams

    int

    The total number of words in the corpus

    !total_volumes

    int

    The total number of volumes (individual sources) in the corpus

    !total_isograms

    int

    The total number of isograms found in the corpus (before compacting)

    !total_palindromes

    int

    How many of the isograms found are palindromes

    !total_tautonyms

    int

    How many of the isograms found are tautonyms

    The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.

  8. Sample Graph Datasets in CSV Format

    • zenodo.org
    csv
    Updated Dec 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edwin Carreño; Edwin Carreño (2024). Sample Graph Datasets in CSV Format [Dataset]. http://doi.org/10.5281/zenodo.14335015
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 9, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Edwin Carreño; Edwin Carreño
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sample Graph Datasets in CSV Format

    Note: none of the data sets published here contain actual data, they are for testing purposes only.

    Description

    This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:

    • dataset_30_nodes_interactions.csv:contains 30 rows (nodes).
    • dataset_30_edges_interactions.csv: contains 47 rows (edges).
    • the common identifier dataset_30 refers to the same graph.

    CSV nodes

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    UniProt IDstringprotein identification
    labelstringprotein label (type of node)
    propertiesstringa dictionary containing properties related to the protein.

    CSV edges

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    Relationship IDstringrelationship identification
    Source IDstringidentification of the source protein in the relationship
    Target IDstringidentification of the target protein in the relationship
    labelstringrelationship label (type of relationship)
    propertiesstringa dictionary containing properties related to the relationship.

    Metadata

    GraphNumber of NodesNumber of EdgesSparse graph

    dataset_30*

    30

    47

    Y

    dataset_60*

    60

    181

    Y

    dataset_120*

    120

    689

    Y

    dataset_240*

    240

    2819

    Y

    dataset_300*

    300

    4658

    Y

    dataset_600*

    600

    18004

    Y

    dataset_1200*

    1200

    71785

    Y

    dataset_2400*

    2400

    288600

    Y

    dataset_3000*

    3000

    449727

    Y

    dataset_6000*

    6000

    1799413

    Y

    dataset_12000*

    12000

    7199863

    Y

    dataset_24000*

    24000

    28792361

    Y

    dataset_30000*

    30000

    44991744

    Y

    This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.

    CSV nodes (tiny graphs)

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    IDstringnode identification
    labelstringnode label (type of node)
    propertiesstringa dictionary containing properties related to the node.

    CSV edges (tiny graphs)

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    IDstringrelationship identification
    sourcestringidentification of the source node in the relationship
    targetstringidentification of the target node in the relationship
    labelstringrelationship label (type of relationship)
    propertiesstringa dictionary containing properties related to the relationship.

    Metadata (tiny graphs)

    GraphNumber of NodesNumber of EdgesSparse graph
    dataset_dummy*36N
    dataset_dummy2*36N
  9. d

    Trade Name

    • catalog.data.gov
    • opendata.dc.gov
    • +3more
    Updated Jun 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Licensing and Consumer Protection (2025). Trade Name [Dataset]. https://catalog.data.gov/dataset/trade-name
    Explore at:
    Dataset updated
    Jun 11, 2025
    Dataset provided by
    Department of Licensing and Consumer Protection
    Description

    If a business or unregistered entity (sole proprietor, general partnership etc.) wishes to do business under a name that is different than their registered name or true legal name, they may register a trade name. A trade name or a “Doing Business As” name is optional and is not required in order to conduct business in DC. However, if a sole proprietor, general partnership or registered entity is using a trade name, it must be registered and on record with Corporations Division.The dataset contains the following columns: trade names, effective date, trade name status, file number, trade name expiration date, and initial file number. More information can be found at https://dlcp.dc.gov/node/1619191

  10. t

    Data set collection for flow delegation - Vdataset - LDM

    • service.tib.eu
    Updated Aug 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Data set collection for flow delegation - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/rdr-doi-10-35097-1218
    Explore at:
    Dataset updated
    Aug 4, 2023
    Description

    Abstract: This data set collection consists of 17 data sets used for the analytical / simulative evaluation of the flow delegation concept presented in "Flow Delegation: Flow Table Capacity Bottleneck Mitigation for Software-defined Networks". Example code for processing the data sets can be found at https://github.com/kit-tm/fdeval. TechnicalRemarks: The data set collection is a zip file that contains 17 sqlite database files that can be inspected with any sqlite capable database reader (such as https://sqlitebrowser.org/). The folder names in the unzipped file indicate the names of the data sets (from d20 to d5050). Each database consists of a single table called "statistics" that gives access to the scenario parameters and evaluation results. Each row in the table represents a single execution of the evaluation environment (i.e., one experiment). The columns starting with scenario are the parameters used for scenario / experiment generation. All other columns except for id and resultid (those two columns are not essential to the data set and can be ignored) refer to statistics gathered for one experiment. Columns starting with json contain a serialized json object and need to be de-serialized, e.g., by something like arr = json.loads(string) if python is used where string is the content from the column and arr is an array of floating point numbers. These columns contain time series data, i.e., the statistics were gathered for multiple time slots. Example code for processing the data sets can be found at https://github.com/kit-tm/fdeval (plotters folder). The GitHub page also contains additional details about the data sets in this collection.

  11. Z

    Enhanced Bug Prediction in JavaScript Programs with Hybrid Call-Graph Based...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Nov 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tóth, Zoltán Gábor (2020). Enhanced Bug Prediction in JavaScript Programs with Hybrid Call-Graph Based Invocation Metrics (Training Dataset) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4281475
    Explore at:
    Dataset updated
    Nov 21, 2020
    Dataset provided by
    Tóth, Zoltán Gábor
    Hegedűs, Péter
    Ferenc, Rudolf
    Antal, Gábor
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset consists of multiple files which contain bug prediction training data.

    The entries in the dataset are JavaScript functions either being buggy or non-buggy. Bug related information was obtained from the project EsLint contained in BugsJS (https://github.com/BugsJS/eslint). The buggy instances were collected throughout the lifetime of the project, however we added non-buggy entries from the latest version which is tagged as fix (entries which were previously included as buggy were not included as non-buggy later on).

    The dataset is based on hybrid call graphs which are constructed by https://github.com/sed-szeged/hcg-js-framework. The result of this tool is a call graph where the edges are associated with a confidence level which shows how likely the given edge is a valid call edge.

    We used different threshold values from which we considered the edges to be valid. The following threshold values were used:

    0.00

    0.05

    0.20

    0.30

    The prefix in the dataset file names are coming from the used threshold. The the datasets include coupling metrics NII (Nubmer of Incoming Invocations) and NOI (Number of Outgoing Invocations) which were calculated by a static source code analyzer called SourceMeter. Hybrid counterparts of these metrics (HNII and HNOI) are based on the given threshold values.

    There are four variants for all of these datasets:

    Both static (NII, NOi) and hybrid (HNII, HNOI) coupling metrics are included with additional static source code metrics and information about the entries (file without any postfix). Column contained only in this dataset are:

    ID

    Name

    Longname

    Parent ID

    Component ID

    Path

    Line

    Column

    EndLine

    EndColumn

    Both static (NII, NOi) and hybrid (HNII, HNOI) coupling metrics are included with additional static source code metrics (file with '_h+s' postfix)

    Only static (NII, NOI) coupling metrics are included with additional static source code metrics (file with '_s' postfix)

    Only hybrid (HNII, HNOI) coupling metrics are included with additional static source code metrics (file with '_h' postfix)

    Static source code metrics which are contained in all dataset are the following:

    McCC - McCabe Cyclomatic Complexity

    NL - Nesting Level

    NLE - Nesting Level Else If

    CD - Comment Density

    CLOC - Comment Lines of Code

    DLOC - Documentation Lines of Code

    TCD - Total Comment Density (Comment Lines in an emedded function will be also considered)

    TCLOC - Total Comment Lines of Code (Comment Lines in an emedded function will be also considered)

    LLOC - Logical Lines of Code (Comment and empty lines not counted)

    LOC - Lines of Code (Comment and empty lines are counted)

    NOS - Number of Statements

    NUMPAR - Number of Parameters

    TLLOC - Logical Lines of Code (Lines in embedded functions are also counted)

    TLOC - Lines of Code (Lines in embedded functions are also counted)

    TNOS - Total Number of Statements (Statements in embedded functions are also counted)

  12. Datasets for Sentiment Analysis

    • zenodo.org
    csv
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 10, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.

    Below are the datasets specified, along with the details of their references, authors, and download sources.

    ----------- STS-Gold Dataset ----------------

    The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.

    Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.

    File name: sts_gold_tweet.csv

    ----------- Amazon Sales Dataset ----------------

    This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.

    Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)

    Features:

    • product_id - Product ID
    • product_name - Name of the Product
    • category - Category of the Product
    • discounted_price - Discounted Price of the Product
    • actual_price - Actual Price of the Product
    • discount_percentage - Percentage of Discount for the Product
    • rating - Rating of the Product
    • rating_count - Number of people who voted for the Amazon rating
    • about_product - Description about the Product
    • user_id - ID of the user who wrote review for the Product
    • user_name - Name of the user who wrote review for the Product
    • review_id - ID of the user review
    • review_title - Short review
    • review_content - Long review
    • img_link - Image Link of the Product
    • product_link - Official Website Link of the Product

    License: CC BY-NC-SA 4.0

    File name: amazon.csv

    ----------- Rotten Tomatoes Reviews Dataset ----------------

    This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.

    This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).

    Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics

    File name: data_rt.csv

    ----------- Preprocessed Dataset Sentiment Analysis ----------------

    Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
    Stemmed and lemmatized using nltk.
    Sentiment labels are generated using TextBlob polarity scores.

    The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).

    DOI: 10.34740/kaggle/dsv/3877817

    Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }

    This dataset was used in the experimental phase of my research.

    File name: EcoPreprocessed.csv

    ----------- Amazon Earphones Reviews ----------------

    This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)

    License: U.S. Government Works

    Source: www.amazon.in

    File name (original): AllProductReviews.csv (contains 14337 reviews)

    File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)

    ----------- Amazon Musical Instruments Reviews ----------------

    This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).

    Source: http://jmcauley.ucsd.edu/data/amazon/

    File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)

    File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)

  13. Tennessee Eastman Process Simulation Dataset

    • kaggle.com
    zip
    Updated Feb 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergei Averkiev (2020). Tennessee Eastman Process Simulation Dataset [Dataset]. https://www.kaggle.com/averkij/tennessee-eastman-process-simulation-dataset
    Explore at:
    zip(1370814903 bytes)Available download formats
    Dataset updated
    Feb 9, 2020
    Authors
    Sergei Averkiev
    Description

    Intro

    This dataverse contains the data referenced in Rieth et al. (2017). Issues and Advances in Anomaly Detection Evaluation for Joint Human-Automated Systems. To be presented at Applied Human Factors and Ergonomics 2017.

    Content

    Each .RData file is an external representation of an R dataframe that can be read into an R environment with the 'load' function. The variables loaded are named ‘fault_free_training’, ‘fault_free_testing’, ‘faulty_testing’, and ‘faulty_training’, corresponding to the RData files.

    Each dataframe contains 55 columns:

    Column 1 ('faultNumber') ranges from 1 to 20 in the “Faulty” datasets and represents the fault type in the TEP. The “FaultFree” datasets only contain fault 0 (i.e. normal operating conditions).

    Column 2 ('simulationRun') ranges from 1 to 500 and represents a different random number generator state from which a full TEP dataset was generated (Note: the actual seeds used to generate training and testing datasets were non-overlapping).

    Column 3 ('sample') ranges either from 1 to 500 (“Training” datasets) or 1 to 960 (“Testing” datasets). The TEP variables (columns 4 to 55) were sampled every 3 minutes for a total duration of 25 hours and 48 hours respectively. Note that the faults were introduced 1 and 8 hours into the Faulty Training and Faulty Testing datasets, respectively.

    Columns 4 to 55 contain the process variables; the column names retain the original variable names.

    Acknowledgements

    This work was sponsored by the Office of Naval Research, Human & Bioengineered Systems (ONR 341), program officer Dr. Jeffrey G. Morrison under contract N00014-15-C-5003. The views expressed are those of the authors and do not reflect the official policy or position of the Office of Naval Research, Department of Defense, or US Government.

    User Agreement

    By accessing or downloading the data or work provided here, you, the User, agree that you have read this agreement in full and agree to its terms.

    The person who owns, created, or contributed a work to the data or work provided here dedicated the work to the public domain and has waived his or her rights to the work worldwide under copyright law. You can copy, modify, distribute, and perform the work, for any lawful purpose, without asking permission.

    In no way are the patent or trademark rights of any person affected by this agreement, nor are the rights that any other person may have in the work or in how the work is used, such as publicity or privacy rights.

    Pacific Science & Engineering Group, Inc., its agents and assigns, make no warranties about the work and disclaim all liability for all uses of the work, to the fullest extent permitted by law.

    When you use or cite the work, you shall not imply endorsement by Pacific Science & Engineering Group, Inc., its agents or assigns, or by another author or affirmer of the work.

    This Agreement may be amended, and the use of the data or work shall be governed by the terms of the Agreement at the time that you access or download the data or work from this Website.

  14. UWB Positioning and Tracking Data Set

    • zenodo.org
    zip
    Updated Aug 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Klemen Bregar; Klemen Bregar (2023). UWB Positioning and Tracking Data Set [Dataset]. http://doi.org/10.5281/zenodo.8280736
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 25, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Klemen Bregar; Klemen Bregar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    # UWB Positioning and Tracking Data Set

    UWB positioning data set contains measurements from four different indoor environments. The data set contains measurements that can be used for range-based positioning evaluation in different indoor environments.

    # Measurement system

    The measurements were made using 9 DW1000 UWB transceivers (DWM1000 modules) connected to the networked RaspberryPi computer using in-house radio board SNPN_UWB. 8 nodes were used as positioning anchor nodes with fixed locations in individual indoor environment and one node was used as a mobile positioning tag.

    Each UWB node is designed arround the RaspberryPi computer and are wirelessly connected to the measurement controller (e.g. laptop) using Wi-Fi and MQTT communication technologies.

    All tag positions were generated beforehand to as closelly resemble the human walking path as possible. All walking path points are equally spaced to represent the equidistand samples of a walking path in a time-domain. The sampled walking path (measurement TAG positions) are included in a downloadable data set file under downloads section.

    # Folder structure

    Folder structure is represented below this text. Folder contains four subfolders named by the indoor environments measured during the measurement campaign and a folder raw_data where raw measurement data is saved. Each environment folder has a anchors.csv file with anchor names and locations, .json file data.json with measurements, file walking_path.csv file with tag positions and subfolder floorplan with floorplan.dxf (AutoCAD format), floorplan.png and floorplan_track.jpg.

    Subfolder raw_data contains raw data in subfolders named by the four indor environments where the measurements were taken. Each location subfolder contains a subfolder data where data from each tag position from the walking_path.csv is collected in a separate folder. There is exactly the same number of folders in data folder as is the number of measurement points in the walking_path.csv. Each measurement subfolder contains 48 .csv files named by communication channel and anchor used for those measurements. For example: ch1_A1.csv contains all measurements at selected tag location with anchor A1 on UWB channel ch1. The location folder contains also anchors.csv and walking_path.csv files which are identical to the files mentioned previously.

    The last folder in the data set is the technical_validation folder, where results of technical validation of the data set are collected. They are separated into 8 subfolders:

    - cir_min_max_mean

    - los_nlos

    - positioning_wls

    - range

    - range_error

    - range_error_A6

    - range_error_histograms

    - rss

    The organization of the data set is the following:

    data_set

    + location0

    - anchors.csv

    - data.json

    - walking_path.csv

    + floorplan

    - floorplan.dxf

    - floorplan.png

    - floorplan_track.jpg

    - walking_path.csv

    + location1

    - ...

    + location2

    - ...

    + location3

    - ...

    + raw_data

    + location0

    + data

    + 1.07_9.37_1.2

    - ch1_A1.csv

    - ch7_A8.csv

    - ...

    + 1.37_9.34_1.2

    - ...

    + ...

    + location1

    + ...

    + location2

    + ...

    + location3

    + ...

    + technical validation

    + cir_min_max_mean

    + positioning_wls

    + range

    + range_error

    + range_error_histograms

    + rss

    - LICENSE

    - README

    # Data format

    Raw measurements are saved in .csv files. Each file starts with a header, where first line represents the version of the file and the second line represents the data column names. The column names have a missing column name. Actual column names included in the .csv files are:

    TAG_ID

    ANCHOR_ID

    X_TAG

    Y_TAG

    Z_TAG

    X_ANCHOR

    Y_ANCHOR

    Z_ANCHOR

    NLOS

    RANGE

    FP_INDEX

    RSS

    RSS_FP

    FP_POINT1

    FP_POINT2

    FP_POINT3

    STDEV_NOISE

    CIR_POWER

    MAX_NOISE

    RXPACC

    CHANNEL_NUMBER

    FRAME_LENGTH

    PREAMBLE_LENGTH

    BITRATE

    PRFR

    PREAMBLE_CODE

    CIR (starts with this column; all columns until the end of the line represent the channel impulse response)

    # Availability of CODE

    Code for data analysis and preprocessing of all data available in this data set is published on GitHub:

    https://github.com/KlemenBr/uwb_positioning.git

    The code is licensed under the Apache License 2.0.

    # Authors and License

    Author of data set in this repository is Klemen Bregar, klemen.bregar@ijs.si.

    This work is licensed under a Creative Commons Attribution 4.0 International License.

    # Funding

    The research leading to the data collection has been partially funded from the European Horizon 2020 Programme project eWINE under grant agreement No. 688116, the Slovenian Research Agency under Grant numbers P2-0016, J2-2507 and bilateral project with Grant number BI-ME/21-22-007.

  15. d

    Data for: Integrating open education practices with data analysis of open...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Jul 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marja Bakermans (2024). Data for: Integrating open education practices with data analysis of open science in an undergraduate course [Dataset]. http://doi.org/10.5061/dryad.37pvmcvst
    Explore at:
    Dataset updated
    Jul 27, 2024
    Dataset provided by
    Dryad Digital Repository
    Authors
    Marja Bakermans
    Description

    The open science movement produces vast quantities of openly published data connected to journal articles, creating an enormous resource for educators to engage students in current topics and analyses. However, educators face challenges using these materials to meet course objectives. I present a case study using open science (published articles and their corresponding datasets) and open educational practices in a capstone course. While engaging in current topics of conservation, students trace connections in the research process, learn statistical analyses, and recreate analyses using the programming language R. I assessed the presence of best practices in open articles and datasets, examined student selection in the open grading policy, surveyed students on their perceived learning gains, and conducted a thematic analysis on student reflections. First, articles and datasets met just over half of the assessed fairness practices, but this increased with the publication date. There was a..., Article and dataset fairness To assess the utility of open articles and their datasets as an educational tool in an undergraduate academic setting, I measured the congruence of each pair to a set of best practices and guiding principles. I assessed ten guiding principles and best practices (Table 1), where each category was scored ‘1’ or ‘0’ based on whether it met that criteria, with a total possible score of ten. Open grading policies Students were allowed to specify the percentage weight for each assessment category in the course, including 1) six coding exercises (Exercises), 2) one lead exercise (Lead Exercise), 3) fourteen annotation assignments of readings (Annotations), 4) one final project (Final Project), 5) five discussion board posts and a statement of learning reflection (Discussion), and 6) attendance and participation (Participation). I examined if assessment categories (independent variable) were weighted (dependent variable) differently by students using an analysis of ..., , # Data for: Integrating open education practices with data analysis of open science in an undergraduate course

    Author: Marja H Bakermans Affiliation: Worcester Polytechnic Institute, 100 Institute Rd, Worcester, MA 01609 USA ORCID: https://orcid.org/0000-0002-4879-7771 Institutional IRB approval: IRB-24–0314

    Data and file overview

    The full dataset file called OEPandOSdata (.xlsx extension) contains 8 files. Below are descriptions of the name and contents of each file. NA = not applicable or no data available

    1. BestPracticesData.csv
      • Description: Data to assess the adherence of articles and datasets to open science best practices.
      • Column headers and descriptions:
        • Article: articles used in the study, numbered randomly
        • F1: Findable, Data are assigned a unique and persistent doi
        • F2: Findable, Metadata includes an identifier of data
        • F3: Findable, Data are registered in a searchable database
        • A1: ...
  16. d

    Data from: The role of fish life histories in allometrically scaled food-web...

    • datadryad.org
    • zenodo.org
    zip
    Updated Feb 25, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephanie Bland; Fernanda Valdovinos; Jeffrey Hutchings; Anna Kuparinen (2019). The role of fish life histories in allometrically scaled food-web dynamics [Dataset]. http://doi.org/10.5061/dryad.1hd6dg7
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 25, 2019
    Dataset provided by
    Dryad
    Authors
    Stephanie Bland; Fernanda Valdovinos; Jeffrey Hutchings; Anna Kuparinen
    Time period covered
    2019
    Description
    1. Body size determines key ecological and evolutionary processes of organisms. Therefore, organisms undergo extensive shifts in resources, competitors and predators as they grow in body size. While empirical and theoretical evidence show that these size-dependent ontogenetic shifts vastly influence the structure and dynamics of populations, theory on how those ontogenetic shifts affect the structure and dynamics of ecological networks is still virtually absent.
    2. Here, we expand the Allometric Trophic Network (ATN) theory in the context of aquatic food webs to incorporate size-structure in the population dynamics of fish species. We do this by modifying a food web generating algorithm, the niche model, to produce food webs where different fish life-history stages are described as separate nodes which are connected through growth and reproduction. Then, we apply a bioenergetic model that uses the food webs and the body sizes generated by our niche model to evaluate the effect of incorp...
  17. o

    QA4MRE (Reading Comprehension Q&A)

    • opendatabay.com
    .undefined
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). QA4MRE (Reading Comprehension Q&A) [Dataset]. https://www.opendatabay.com/data/ai-ml/e20ba707-f7d5-4e77-b2da-e90a67e77b9d
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 23, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Healthcare Providers & Services Utilization
    Description

    The QA4MRE dataset offers a magnificent collection of passages with connected questions and answers, providing researchers with a defining set of data to work from. With its wide range, this has been the go-to source for many research projects like the CLEF 2011, 2012 and 2013 Shared Tasks - where training datasets are available for the main track as well as documents ready to be used in two pilot studies related to Alzheimer's disease and entrance exams. This expansive dataset can allow you to unleash your creativity in ways you never thought possible - uncovering new possibilities and exciting findings as it serves as an abundant source of information. No matter which field you come from or what kind of insights you’re looking for, this powerhouse dataset will have something special waiting just around the corner

    More Datasets For more datasets, click here.

    Featured Notebooks 🚨 Your notebook can be here! 🚨! How to use the dataset How to Use the QA4MRE Dataset for Your Research The QA4MRE (Question Answering and Reading Comprehension) dataset is a great resource for researchers who want to use comprehensive datasets to explore creative approaches and solutions. This powerful dataset provides several versions of training and development data in the form of passages with accompanying questions and answers. Additionally, there are gold standard documents included that can be used in two different pilot studies related to Alzheimer’s disease as well as entrance exams. The following is a guide on how to make the most out of this valuable data set:

    Analyze Data Structures - Once you've downloaded all necessary materials, it’s time for analyzing what structure each file follows in order access its contents accordingly; knowing which column helps refine your searching process as some files go beyond just providing questions & answers such as providing topic names associated with passage text relevant processing question asking comprehension testing etc.. The table below serves as basic overview each column provided in both train & dev variants found within this datasets:

    Column NameDescriptionDatatype
    Topic nameName of topic passage representsString

    Refine Data Searching Process - Lastly if plan develop an automated system/algorithm uncover precise contents from manipulated articles/passages then refining already established search process involving

    Research Ideas Creating an automated question answering system that is capable of engaging in conversations with a user. This could be used as a teaching assistant to help students study for exams and other tests or as a virtual assistant for customer service. Developing a summarization tool dedicated specifically to the QA4MRE dataset, which can extract key information from each passage and output concise summaries with confidence scores indicating the likelihood of the summary being accurate compared to the original text. Utilizing natural language processing techniques to analyze questions related to Alzheimer’s disease and creating machine learning models that accurately predict patient responses when asked various sets of questions about their condition, thus aiding in diagnosing Alzheimer's Disease early on in its development stages

    License

    CC0

    Original Data Source: QA4MRE (Reading Comprehension Q&A)

  18. Z

    Testing and Demonstration Data for DataRig Software

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew S. Caudill (2023). Testing and Demonstration Data for DataRig Software [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7868944
    Explore at:
    Dataset updated
    Apr 27, 2023
    Dataset authored and provided by
    Matthew S. Caudill
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository holds the testing and demonstration data for DataRig, an opensource software program for downloading datasets from data repositories utilizing RESTful APIs. This repository contains 5 sample datasets.

    annotations_001.txt

    This data set is a tab-separated text file containing 6 columns that start on line number 7. The column headers are;

    'Number' 'Start Time' 'End Time' 'Time From Start' 'Channel' 'Annotation'

    There are 13 rows of data under each of these column headers representing the start and end times of annotated events from an eeg recording file in this repository called recording_001.edf. The events describe the behavior of a mouse in 5 sec increments with each behavior being one of 'exploring', 'grooming' or 'rest'.

    recording_001.edf

    A European Data Format file consisting of 4 channels of EEG data lasting approximately 1 hour. The times in the annotations_001.txt file are referenced against this file.

    sample_arr.npy

    A numpy array of shape (4, 250) with values sequentially running from 0 to 1000.

    sample_excel.xls

    An excel file with a single column of 10 numbers from 0-9 sequentially.

    sample_text.txt

    A text file with 4 rows containing 250 values per row. The values in the file run from 0 to 1000 sequentially.

  19. GitTables benchmark - column type detection

    • zenodo.org
    csv, zip
    Updated Feb 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Madelon Hulsebos; Çağatay Demiralp; Paul Demiralp; Madelon Hulsebos; Çağatay Demiralp; Paul Demiralp (2022). GitTables benchmark - column type detection [Dataset]. http://doi.org/10.5281/zenodo.5706316
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Feb 22, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Madelon Hulsebos; Çağatay Demiralp; Paul Demiralp; Madelon Hulsebos; Çağatay Demiralp; Paul Demiralp
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Note: the download page of the entire GitTables corpus is here: https://zenodo.org/record/4943312.

    This dataset represents a small subset of tables from GitTables curated for benchmarking column type detection methods. This benchmark evaluates systems that match table columns to semantic types from the DBpedia and Schema.org ontologies. It is featured in the SemTab 2021 challenge (CTA task).

    This dataset consists of the following files:

    • “tables.zip”: directory with a sample of 1101 tables from GitTables. Filenames correspond to table IDs, the first column (without column name) corresponds to row indices, column names are replaced with "col_0", "col_1", etc. which match to the targets and labels (semantic types).

    The labels (semantic types) from each ontology come from:

    For the entire GitTables corpus, please refer to this dataset. Visit https://gittables.github.io for more background and contact details.

  20. Species Abundance Data

    • figshare.com
    txt
    Updated Aug 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlos Martorell; Alejandra Martínez Blancas (2022). Species Abundance Data [Dataset]. http://doi.org/10.6084/m9.figshare.20468535.v3
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 10, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Carlos Martorell; Alejandra Martínez Blancas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Species abundance data used for population models. Each csv file contains the data for each of our focal species. The csv files for each species are labeled with the first three letters of each of the species' name and genus. The first column of the data contains the name of the locality where the data was colected. The second column contains the locality treatment where Dentro means that the locality was excluded from livestock and Fuera means that the locality did not have a livestock exclusion fence. The third column contains the quadrat from which the data comes from and the fourth column contains the subquadrat. The fifth column contains the year in wich the data was taken. Columns X an Y contain coordinates which were used to assess dispersal from and to neighbor subquadrats. t1 contains abundance at time t and t2 contain the species abundance at time t+1. Columns y1 to y16 are a matrix that indicate the year that the data were collected. If, for instance, the data was collected in 2010, then the y10 column will contain ones while the rest of the columns will contain zeros. The depth column indicates the soild depht of the subquadrat where the data was taken. Next are the columns of interacting species' abundances where the column names contain the first three lettters of each species' name and genus. Because not all species in our study interact with the focal species, the remaining species were added together and allowed to interact with the focal species as a single “multispecies” for which interaction parameters were estimated. This column is labeled multi. Some species' interaction with the focal species change with soil depth so we added a matrix to identify which species' interactions do so. This matrix is also labeled with the first three letters of the species' name and genus and occupies the last columns of the datasets.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jinseok Kim; Jenna Kim; Jason Owen-Smith (2023). Dataset: Ethnicity-Based Name Partitioning for Author Name Disambiguation Using Supervised Machine Learning [Dataset]. http://doi.org/10.6084/m9.figshare.14043791.v1
Organization logoOrganization logo

Dataset: Ethnicity-Based Name Partitioning for Author Name Disambiguation Using Supervised Machine Learning

Explore at:
zipAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Jinseok Kim; Jenna Kim; Jason Owen-Smith
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset contains data files for a research paper, "Ethnicity-Based Name Partitioning for Author Name Disambiguation Using Supervised Machine Learning," published in the Journal of the Association for Information Science and Technology.Four zipped files are uploaded.Each zipped file contains five data files: signatures_train.txt, signatures_test.txt, records.txt, clusters_train.txt, and clusters_test.txt.1. 'Signatures' files contain lists of name instances. Each name instance (a row) is associated with information as follows. - 1st column: instance id (numeric): unique id assigned to a name instance - 2nd column: paper id (numeric): unique id assigned to a paper in which the name instance appears as an author name - 3rd column: byline position (numeric): integer indicating the position of the name instance in the authorship byline of the paper - 4th column: author name (string): name string formatted as surname, comma, and forename(s) - 5th column: ethnic name group (string): name ethnicity assigned by Ethnea to the name instance - 6th column: affiliation (string): affiliation associated with the name instance, if available in the original data - 7th column: block (string): simplified name string of the name instance to indicate its block membership (surname and first forename initial) - 8th column: author id (string): unique author id (i.e., author label) assigned by the creators of the original data2. 'Records' files contain lists of papers. Each paper is associated with information as follows. -1st column: paper id (numeric): unique paper id; this is the unique paper id (2nd column) in Signatures files -2nd column: year (numeric): year of publication * Some papers may have wrong publication years due to incorrect indexing or delayed updates in original data -3rd column: venue (string): name of journal or conference in which the paper is published * Venue names can be in full string or in a shortened format according to the formats in original data -4th column: authors (string; separated by vertical bar): list of author names that appear in the paper's byline * Author names are formatted into surname, comma, and forename(s) -5th column: title words (string; separated by space): words in a title of the paper. * Note that common words are stop-listed and each remaining word is stemmed using Porter's stemmer.3. 'Clusters' files contain lists of clusters. Each cluster is associated with information as follows. -1st column: cluster id (numeric): unique id of a cluster -2nd column: list of name instance ids (Signatures - 1st column) that belong to the same unique author id (Signatures - 8th column). Signatures and Clusters files consist of two subsets - train and test files - of original labeled data which are randomly split into 50%-50% by the authors of this study.Original labeled data for AMiner.zip, KISTI.zip, and GESIS.zip came from the studies cited below.If you use one of the uploaded data files, please cite them accordingly.[AMiner.zip]Tang, J., Fong, A. C. M., Wang, B., & Zhang, J. (2012). A Unified Probabilistic Framework for Name Disambiguation in Digital Library. IEEE Transactions on Knowledge and Data Engineering, 24(6), 975-987. doi:10.1109/Tkde.2011.13Wang, X., Tang, J., Cheng, H., & Yu, P. S. (2011). ADANA: Active Name Disambiguation. Paper presented at the 2011 IEEE 11th International Conference on Data Mining.[KISTI.zip]Kang, I. S., Kim, P., Lee, S., Jung, H., & You, B. J. (2011). Construction of a Large-Scale Test Set for Author Disambiguation. Information Processing & Management, 47(3), 452-465. doi:10.1016/j.ipm.2010.10.001Note that the original KISTI data contain errors and duplicates. This study reuses the revised version of KISTI reported in a study below.Kim, J. (2018). Evaluating author name disambiguation for digital libraries: A case of DBLP. Scientometrics, 116(3), 1867-1886. doi:10.1007/s11192-018-2824-5[GESIS.zip]Momeni, F., & Mayr, P. (2016). Evaluating Co-authorship Networks in Author Name Disambiguation for Common Names. Paper presented at the 20th international Conference on Theory and Practice of Digital Libraries (TPDL 2016), Hannover, Germany.Note that this study reuses the 'Evaluation Set' among the original GESIS data which was added titles by a study below.Kim, J., & Kim, J. (2020). Effect of forename string on author name disambiguation. Journal of the Association for Information Science and Technology, 71(7), 839-855. doi:10.1002/asi.24298[UM-IRIS.zip]This labeled dataset was created for this study. For description about the labeling method, please see 'Method' in the paper below.Kim, J., Kim, J., & Owen-Smith, J. (In print). Ethnicity-based name partitioning for author name disambiguation using supervised machine learning. Journal of the Association for Information Science and Technology. doi:10.1002/asi.24459.For details on the labeling method and limitations, see the paper below.Kim, J., & Owen-Smith, J. (2021). ORCID-linked labeled data for evaluating author name disambiguation at scale. Scientometrics. doi:10.1007/s11192-020-03826-6

Search
Clear search
Close search
Google apps
Main menu