93 datasets found

NIST Statistical Reference Datasets - SRD 140
catalog.data.gov
datasets.ai
+2more
Updated Jul 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2022). NIST Statistical Reference Datasets - SRD 140 [Dataset]. https://catalog.data.gov/dataset/nist-statistical-reference-datasets-srd-140-df30c
Explore at:
Dataset updated
Jul 29, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
The purpose of this project is to improve the accuracy of statistical software by providing reference datasets with certified computational results that enable the objective evaluation of statistical software. Currently datasets and certified values are provided for assessing the accuracy of software for univariate statistics, linear regression, nonlinear regression, and analysis of variance. The collection includes both generated and 'real-world' data of varying levels of difficulty. Generated datasets are designed to challenge specific computations. These include the classic Wampler datasets for testing linear regression algorithms and the Simon & Lesage datasets for testing analysis of variance algorithms. Real-world data include challenging datasets such as the Longley data for linear regression, and more benign datasets such as the Daniel & Wood data for nonlinear regression. Certified values are 'best-available' solutions. The certification procedure is described in the web pages for each statistical method. Datasets are ordered by level of difficulty (lower, average, and higher). Strictly speaking the level of difficulty of a dataset depends on the algorithm. These levels are merely provided as rough guidance for the user. Producing correct results on all datasets of higher difficulty does not imply that your software will pass all datasets of average or even lower difficulty. Similarly, producing correct results for all datasets in this collection does not imply that your software will do the same for your particular dataset. It will, however, provide some degree of assurance, in the sense that your package provides correct results for datasets known to yield incorrect results for some software. The Statistical Reference Datasets is also supported by the Standard Reference Data Program.
u
Goodreads Book Reviews
cseweb.ucsd.edu
json
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, Goodreads Book Reviews [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
Explore at:
jsonAvailable download formats
Dataset authored and provided by
UCSD CSE Research Project
Description
These datasets contain reviews from the Goodreads book review website, and a variety of attributes describing the items. Critically, these datasets have multiple levels of user interaction, raging from adding to a shelf, rating, and reading.

Metadata includes

reviews

add-to-shelf, read, review actions

book attributes: title, isbn

graph of similar books

Basic Statistics:

Items: 1,561,465

Users: 808,749

Interactions: 225,394,930
N
Advance, IN Annual Population and Growth Analysis Dataset: A Comprehensive...
neilsberg.com
csv, json
Updated Jul 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2024). Advance, IN Annual Population and Growth Analysis Dataset: A Comprehensive Overview of Population Changes and Yearly Growth Rates in Advance from 2000 to 2023 // 2024 Edition [Dataset]. https://www.neilsberg.com/insights/advance-in-population-by-year/
Explore at:
csv, jsonAvailable download formats
Dataset updated
Jul 30, 2024
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
IN, Advance
Variables measured
Annual Population Growth Rate, Population Between 2000 and 2023, Annual Population Growth Rate Percent
Measurement technique
The data presented in this dataset is derived from the 20 years data of U.S. Census Bureau Population Estimates Program (PEP) 2000 - 2023. To measure the variables, namely (a) population and (b) population change in ( absolute and as a percentage ), we initially analyzed and tabulated the data for each of the years between 2000 and 2023. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the Advance population over the last 20 plus years. It lists the population for each year, along with the year on year change in population, as well as the change in percentage terms for each year. The dataset can be utilized to understand the population change of Advance across the last two decades. For example, using this dataset, we can identify if the population is declining or increasing. If there is a change, when the population peaked, or if it is still growing and has not reached its peak. We can also compare the trend with the overall trend of United States population over the same period of time.

Key observations

In 2023, the population of Advance was 505, a 0.40% increase year-by-year from 2022. Previously, in 2022, Advance population was 503, a decline of 0.59% compared to a population of 506 in 2021. Over the last 20 plus years, between 2000 and 2023, population of Advance decreased by 54. In this period, the peak population was 598 in the year 2009. The numbers suggest that the population has already reached its peak and is showing a trend of decline. Source: U.S. Census Bureau Population Estimates Program (PEP).

Content

When available, the data consists of estimates from the U.S. Census Bureau Population Estimates Program (PEP).

Data Coverage:

From 2000 to 2023

Variables / Data Columns

Year: This column displays the data year (Measured annually and for years 2000 to 2023)

Population: The population for the specific year for the Advance is shown in this column.

Year on Year Change: This column displays the change in Advance population for each year compared to the previous year.

Change in Percent: This column displays the year on year change as a percentage. Please note that the sum of all percentages may not equal one due to rounding of values.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Advance Population by Year. You can refer the same here
Student Performance Data Set
kaggle.com
Updated Mar 27, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data-Science Sean (2020). Student Performance Data Set [Dataset]. https://www.kaggle.com/datasets/larsen0966/student-performance-data-set
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 27, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Data-Science Sean
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
If this Data Set is useful, and upvote is appreciated. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd-period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
e
Optimization and Evaluation Datasets for PiMine - Dataset - B2FIND
b2find.eudat.eu
Updated Jun 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Optimization and Evaluation Datasets for PiMine - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/7f3c2c4e-13c5-511d-8463-47a1185959a8
Explore at:
Dataset updated
Jun 7, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The protein-protein interface comparison software PiMine was developed to provide fast comparisons against databases of known protein-protein complex structures. Its application domains range from the prediction of interfaces and potential interaction partners to the identification of potential small molecule modulators of protein-protein interactions.[1] The protein-protein evaluation datasets are a collection of five datasets that were used for the parameter optimization (ParamOptSet), enrichment assessment (Dimer597 set, Keskin set, PiMineSet), and runtime analyses (RunTimeSet) of protein-protein interface comparison tools. The evaluation datasets contain pairs of interfaces of protein chains that either share sequential and structural similarities or are even sequentially and structurally unrelated. They enable comparative benchmark studies for tools designed to identify interface similarities. Data Set description: The ParamOptSet was designed based on a study on improving the benchmark datasets for the evaluation of protein-protein docking tools [2]. It was used to optimize and fine-tune the geometric search parameters of PiMine. The Dimer597 [3] and Keskin [4] sets were developed earlier. We used them to evaluate PiMine’s performance in identifying structurally and sequentially related interface pairs as well as interface pairs with prominent similarity whose constituting chains are sequentially unrelated. The PiMine set [1] was constructed to assess different quality criteria for reliable interface comparison. It consists of similar pairs of protein-protein complexes of which two chains are sequentially and structurally highly related while the other two chains are unrelated and show different folds. It enables the assessment of the performance when the interfaces of apparently unrelated chains are available only. Furthermore, we could obtain reliable interface-interface alignments based on the similar chains which can be used for alignment performance assessments. Finally, the RunTimeSet [1] comprises protein-protein complexes from the PDB that were predicted to be biologically relevant. It enables the comparison of typical run times of comparison methods and represents also an interesting dataset to screen for interface similarities. References: [1] Graef, J.; Ehrt, C.; Reim, T.; Rarey, M. Database-driven identification of structurally similar protein-protein interfaces (submitted) [2] Barradas-Bautista, D.; Almajed, A.; Oliva, R.; Kalnis, P.; Cavallo, L. Improving classification of correct and incorrect protein-protein docking models by augmenting the training set. Bioinform. Adv. 2023, 3, vbad012. [3] Gao, M.; Skolnick, J. iAlign: a method for the structural comparison of protein–protein interfaces. Bioinformatics 2010, 26, 2259-2265. [4] Keskin, O.; Tsai, C.-J.; Wolfson, H.; Nussinov, R. A new, structurally nonredundant, diverse data set of protein–protein interfaces and its implications. Protein Sci. 2004, 13, 1043-1055. This work was supported by the German Federal Ministry of Education and Research as part of CompLS and de.NBI [031L0172, 031L0105]. C.E. is funded by Data Science in Hamburg – Helmholtz Graduate School for the Structure of Matter (Grant-ID: HIDSS-0002).

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

data.niaid.nih.gov

Updated Oct 20, 2022

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Yfantidou, Sofia (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6826682

Explore at:

Dataset updated

Oct 20, 2022

Dataset provided by

Efstathiou, Stefanos
Marchioro, Thomas
Girdzijauskas, Šarūnas
Ferrari, Elena
Karagianni, Christina
Palotti, Joao
Vakali, Athena
Giakatos, Dimitrios Panteleimon
Yfantidou, Sofia
Kazlouski, Andrei

Description

LifeSnaps Dataset Documentation

Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

Data Import: Reading CSV

For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

Data Import: Setting up a MongoDB (Recommended)

To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

For the Fitbit data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c fitbit

For the SEMA data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c sema

For surveys data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c surveys

If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

Data Availability

The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

{ _id: id (or user_id): type: data: }

Each document consists of four fields: id (also found as user_id in sema and survey collections), type, and data. The _id field is the MongoDB-defined primary key and can be ignored. The id field refers to a user-specific ID used to uniquely identify each user across all collections. The type field refers to the specific data type within the collection, e.g., steps, heart rate, calories, etc. The data field contains the actual information about the document e.g., steps count for a specific timestamp for the steps type, in the form of an embedded object. The contents of the data object are type-dependent, meaning that the fields within the data object are different between different types of data. As mentioned previously, all times are stored in local time, and user IDs are common across different collections. For more information on the available data types, see the related publication.

Surveys Encoding

BREQ2

Why do you engage in exercise?

    Code
    Text


    engage[SQ001]
    I exercise because other people say I should


    engage[SQ002]
    I feel guilty when I don’t exercise


    engage[SQ003]
    I value the benefits of exercise


    engage[SQ004]
    I exercise because it’s fun


    engage[SQ005]
    I don’t see why I should have to exercise


    engage[SQ006]
    I take part in exercise because my friends/family/partner say I should


    engage[SQ007]
    I feel ashamed when I miss an exercise session


    engage[SQ008]
    It’s important to me to exercise regularly


    engage[SQ009]
    I can’t see why I should bother exercising


    engage[SQ010]
    I enjoy my exercise sessions


    engage[SQ011]
    I exercise because others will not be pleased with me if I don’t


    engage[SQ012]
    I don’t see the point in exercising


    engage[SQ013]
    I feel like a failure when I haven’t exercised in a while


    engage[SQ014]
    I think it is important to make the effort to exercise regularly


    engage[SQ015]
    I find exercise a pleasurable activity


    engage[SQ016]
    I feel under pressure from my friends/family to exercise


    engage[SQ017]
    I get restless if I don’t exercise regularly


    engage[SQ018]
    I get pleasure and satisfaction from participating in exercise


    engage[SQ019]
    I think exercising is a waste of time

PANAS

Indicate the extent you have felt this way over the past week

    P1[SQ001]
    Interested


    P1[SQ002]
    Distressed


    P1[SQ003]
    Excited


    P1[SQ004]
    Upset


    P1[SQ005]
    Strong


    P1[SQ006]
    Guilty


    P1[SQ007]
    Scared


    P1[SQ008]
    Hostile


    P1[SQ009]
    Enthusiastic


    P1[SQ010]
    Proud


    P1[SQ011]
    Irritable


    P1[SQ012]
    Alert


    P1[SQ013]
    Ashamed


    P1[SQ014]
    Inspired


    P1[SQ015]
    Nervous


    P1[SQ016]
    Determined


    P1[SQ017]
    Attentive


    P1[SQ018]
    Jittery


    P1[SQ019]
    Active


    P1[SQ020]
    Afraid

Personality

How Accurately Can You Describe Yourself?

    Code
    Text


    ipip[SQ001]
    Am the life of the party.


    ipip[SQ002]
    Feel little concern for others.


    ipip[SQ003]
    Am always prepared.


    ipip[SQ004]
    Get stressed out easily.


    ipip[SQ005]
    Have a rich vocabulary.


    ipip[SQ006]
    Don't talk a lot.


    ipip[SQ007]
    Am interested in people.


    ipip[SQ008]
    Leave my belongings around.


    ipip[SQ009]
    Am relaxed most of the time.


    ipip[SQ010]
    Have difficulty understanding abstract ideas.


    ipip[SQ011]
    Feel comfortable around people.


    ipip[SQ012]
    Insult people.


    ipip[SQ013]
    Pay attention to details.


    ipip[SQ014]
    Worry about things.


    ipip[SQ015]
    Have a vivid imagination.


    ipip[SQ016]
    Keep in the background.


    ipip[SQ017]
    Sympathize with others' feelings.


    ipip[SQ018]
    Make a mess of things.


    ipip[SQ019]
    Seldom feel blue.


    ipip[SQ020]
    Am not interested in abstract ideas.


    ipip[SQ021]
    Start conversations.


    ipip[SQ022]
    Am not interested in other people's problems.


    ipip[SQ023]
    Get chores done right away.


    ipip[SQ024]
    Am easily disturbed.


    ipip[SQ025]
    Have excellent ideas.


    ipip[SQ026]
    Have little to say.


    ipip[SQ027]
    Have a soft heart.


    ipip[SQ028]
    Often forget to put things back in their proper place.


    ipip[SQ029]
    Get upset easily.


    ipip[SQ030]
    Do not have a good imagination.


    ipip[SQ031]
    Talk to a lot of different people at parties.


    ipip[SQ032]
    Am not really interested in others.


    ipip[SQ033]
    Like order.


    ipip[SQ034]
    Change my mood a lot.


    ipip[SQ035]
    Am quick to understand things.


    ipip[SQ036]
    Don't like to draw attention to myself.


    ipip[SQ037]
    Take time out for others.


    ipip[SQ038]
    Shirk my duties.


    ipip[SQ039]
    Have frequent mood swings.


    ipip[SQ040]
    Use difficult words.


    ipip[SQ041]
    Don't mind being the centre of attention.


    ipip[SQ042]
    Feel others' emotions.


    ipip[SQ043]
    Follow a schedule.


    ipip[SQ044]
    Get irritated easily.


    ipip[SQ045]
    Spend time reflecting on things.


    ipip[SQ046]
    Am quiet around strangers.


    ipip[SQ047]
    Make people feel at ease.


    ipip[SQ048]
    Am exacting in my work.


    ipip[SQ049]
    Often feel blue.


    ipip[SQ050]
    Am full of ideas.

STAI

Indicate how you feel right now

    Code
    Text


    STAI[SQ001]
    I feel calm


    STAI[SQ002]
    I feel secure


    STAI[SQ003]
    I am tense


    STAI[SQ004]
    I feel strained


    STAI[SQ005]
    I feel at ease


    STAI[SQ006]
    I feel upset


    STAI[SQ007]
    I am presently worrying over possible misfortunes


    STAI[SQ008]
    I feel satisfied


    STAI[SQ009]
    I feel frightened


    STAI[SQ010]
    I feel comfortable


    STAI[SQ011]
    I feel self-confident


    STAI[SQ012]
    I feel nervous


    STAI[SQ013]
    I am jittery


    STAI[SQ014]
    I feel indecisive


    STAI[SQ015]
    I am relaxed


    STAI[SQ016]
    I feel content


    STAI[SQ017]
    I am worried


    STAI[SQ018]
    I feel confused


    STAI[SQ019]
    I feel steady


    STAI[SQ020]
    I feel pleasant

TTM

Do you engage in regular physical activity according to the definition above? How frequently did each event or experience occur in the past month?

    Code
    Text


    processes[SQ002]
    I read articles to learn more about physical

Kaggle Datasets - Summary, Topics, Classification
kaggle.com
Updated Nov 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katherine Marsh (2020). Kaggle Datasets - Summary, Topics, Classification [Dataset]. https://www.kaggle.com/katherinemarsh/kaggle-datasets-summary-topics-classification/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 16, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Katherine Marsh
Description
Context

Companies and individuals are storing increasingly more data digitally; however, much of the data is unused because it is unclassified. How many times have you opened your downloads folder, found a file you downloaded a year ago and you have no idea what the contents are? You can read through those files individually but imagine doing that for thousands of files. All that raw data in storage facilities create data lakes. As the amount of data grows and the complexity rises, data lakes become data swamps. The potentially valuable and interesting datasets will likely remain unused. Our tool addresses the need to classify these large pools of data in a visually effective and succinct manner by identifying keywords in datasets, and classifying datasets into a consistent taxonomy.

The files listed within kaggleDatasetSummaryTopicsClassification.csv have been processed with our tool to generate the keywords and taxonomic classification as seen below. The summaries are not generated from our system. Instead they were retrieved from user input as they uploaded the files on Kaggle. We planned to utilize these summaries to create an NLG model to generate summaries from any input file. Unfortunately we were not able to collect enough data to build a good model. Hopefully the data within this set might help future users achieve that goal.

Acknowledgements

Developed with Senior Design Center at NC State in collaboration with SAS. Senior Design Team: Tanya Chu, Katherine Marsh, Nikhil Milind, Anna Owens SAS Representatives: : Nancy Rausch, Marty Warner, Brant Kay, Tyler Wendell, JP Trawinski
A
‘Countries of the World’ analyzed by Analyst-2
analyst-2.ai
Updated Nov 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Countries of the World’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-countries-of-the-world-00c4/2cca4656/?iid=005-843&v=presentation
Explore at:
Dataset updated
Nov 12, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
World
Description
Analysis of ‘Countries of the World’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/fernandol/countries-of-the-world on 12 November 2021.

--- Dataset description provided by original source is as follows ---

Context

World fact sheet, fun to link with other datasets.

Content

Information on population, region, area size, infant mortality and more.

Acknowledgements

Source: All these data sets are made up of data from the US government. Generally they are free to use if you use the data in the US. If you are outside of the US, you may need to contact the US Govt to ask. Data from the World Factbook is public domain. The website says "The World Factbook is in the public domain and may be used freely by anyone at anytime without seeking permission."
https://www.cia.gov/library/publications/the-world-factbook/docs/faqs.html

Inspiration

When making visualisations related to countries, sometimes it is interesting to group them by attributes such as region, or weigh their importance by population, GDP or other variables.

--- Original source retains full ownership of the source dataset ---
f
A Meta-Analysis of Multiple Matched Copy Number and Transcriptomics Data...
plos.figshare.com
pdf
Updated Jun 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard Newton; Lorenz Wernisch (2023). A Meta-Analysis of Multiple Matched Copy Number and Transcriptomics Data Sets for Inferring Gene Regulatory Relationships [Dataset]. http://doi.org/10.1371/journal.pone.0105522
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0105522
Dataset updated
Jun 9, 2023
Dataset provided by
PLOS ONE
Authors
Richard Newton; Lorenz Wernisch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Inferring gene regulatory relationships from observational data is challenging. Manipulation and intervention is often required to unravel causal relationships unambiguously. However, gene copy number changes, as they frequently occur in cancer cells, might be considered natural manipulation experiments on gene expression. An increasing number of data sets on matched array comparative genomic hybridisation and transcriptomics experiments from a variety of cancer pathologies are becoming publicly available. Here we explore the potential of a meta-analysis of thirty such data sets. The aim of our analysis was to assess the potential of in silico inference of trans-acting gene regulatory relationships from this type of data. We found sufficient correlation signal in the data to infer gene regulatory relationships, with interesting similarities between data sets. A number of genes had highly correlated copy number and expression changes in many of the data sets and we present predicted potential trans-acted regulatory relationships for each of these genes. The study also investigates to what extent heterogeneity between cell types and between pathologies determines the number of statistically significant predictions available from a meta-analysis of experiments.
e
30 years of synoptic observations from Neumayer Station with links to...
data.europa.eu
doi.pangaea.de
+1more
unknown
Updated Feb 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PANGAEA (2022). 30 years of synoptic observations from Neumayer Station with links to datasets [Dataset]. https://data.europa.eu/data/datasets/de-pangaea-dataset150017?locale=cs
Explore at:
unknownAvailable download formats
Dataset updated
Feb 5, 2022
Dataset authored and provided by
PANGAEA
Description
The analysis of research data plays a key role in data-driven areas of science. Varieties of mixed research data sets exist and scientists aim to derive or validate hypotheses to find undiscovered knowledge. Many analysis techniques identify relations of an entire dataset only. This may level the characteristic behavior of different subgroups in the data. Like automatic subspace clustering, we aim at identifying interesting subgroups and attribute sets. We present a visual-interactive system that supports scientists to explore interesting relations between aggregated bins of multivariate attributes in mixed data sets. The abstraction of data to bins enables the application of statistical dependency tests as the measure of interestingness. An overview matrix view shows all attributes, ranked with respect to the interestingness of bins. Complementary, a node-link view reveals multivariate bin relations by positioning dependent bins close to each other. The system supports information drill-down based on both expert knowledge and algorithmic support. Finally, visual-interactive subset clustering assigns multivariate bin relations to groups. A list-based cluster result representation enables the scientist to communicate multivariate findings at a glance. We demonstrate the applicability of the system with two case studies from the earth observation domain and the prostate cancer research domain. In both cases, the system enabled us to identify the most interesting multivariate bin relations, to validate already published results, and, moreover, to discover unexpected relations.
e
Reference list of 265 sources used for the discovery of relationships...
b2find.eudat.eu
Updated Jul 9, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The citation is currently not available for this dataset.
Explore at:
Dataset updated
Jul 9, 2012
Description
Visual cluster analysis provides valuable tools that help analysts to understand large data sets in terms of representative clusters and relationships thereof. Often, the found clusters are to be understood in context of belonging categorical, numerical or textual metadata which are given for the data elements. While often not part of the clustering process, such metadata play an important role and need to be considered during the interactive cluster exploration process. Traditionally, linked-views allow to relate (or loosely speaking: correlate) clusters with metadata or other properties of the underlying cluster data. Manually inspecting the distribution of metadata for each cluster in a linked-view approach is tedious, specially for large data sets, where a large search problem arises. Fully interactive search for potentially useful or interesting cluster to metadata relationships may constitute a cumbersome and long process. To remedy this problem, we propose a novel approach for guiding users in discovering interesting relationships between clusters and associated metadata. Its goal is to guide the analyst through the potentially huge search space. We focus in our work on metadata of categorical type, which can be summarized for a cluster in form of a histogram. We start from a given visual cluster representation, and compute certain measures of interestingness defined on the distribution of metadata categories for the clusters. These measures are used to automatically score and rank the clusters for potential interestingness regarding the distribution of categorical metadata. Identified interesting relationships are highlighted in the visual cluster representation for easy inspection by the user. We present a system implementing an encompassing, yet extensible, set of interestingness scores for categorical metadata, which can also be extended to numerical metadata. Appropriate visual representations are provided for showing the visual correlations, as well as the calculated ranking scores. Focusing on clusters of time series data, we test our approach on a large real-world data set of time-oriented scientific research data, demonstrating how specific interesting views are automatically identified, supporting the analyst discovering interesting and visually understandable relationships. The dataset contains 265 links (childs) to any of the BSRN datasets. Any user who accepts the BSRN data release guidelines (http://bsrn.awi.de/data/conditions-of-data-release) may ask Gert König-Langlo (mailto:Gert.Koenig-Langlo@awi.de) to obtain an account to download these datasets.
Yelp COVID-19 Features
kaggle.com
Updated Jun 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexis Han (2021). Yelp COVID-19 Features [Dataset]. https://www.kaggle.com/alexzixinhan/yelp-covid19-features
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 23, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Alexis Han
Description
Source

Please refer to Yelp for the original JSON file and other datasets. This dataset was created in June 2020 by Yelp. The usage of this dataset should be for academic purposes.

Content

I read the JSON file in Python and convert it to three CSV files:

covid_features.csv contains all the data (9 features)

covid_features_banners.csv contains all records that have covid banners, which is good for text analysis and creating graphs like a word cloud

covid_features_highlights.csv contains all records that have highlights (not FALSE), which you can see it's like a dictionary and contains more data in it.

Acknowledgements

Please read Dataset_User_Agreement.pdf before you proceed with all data files.

Inspiration

It would be interesting to see how virtual services were offered by restaurants during COVID in 2020 and how restaurant businesses strived to communicate and connect with customers on Yelp. There is no numeric data to play with, however, it's still valuable to do some visualizations.
Twitter Tweets Sentiment Dataset
kaggle.com
Updated Apr 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M Yasser H (2022). Twitter Tweets Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 8, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
M Yasser H
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://raw.githubusercontent.com/Masterx-AI/Project_Twitter_Sentiment_Analysis_/main/twitt.jpg" alt="">

Description:

Twitter is an online Social Media Platform where people share their their though as tweets. It is observed that some people misuse it to tweet hateful content. Twitter is trying to tackle this problem and we shall help it by creating a strong NLP based-classifier model to distinguish the negative tweets & block such tweets. Can you build a strong classifier model to predict the same?

Each row contains the text of a tweet and a sentiment label. In the training set you are provided with a word or phrase drawn from the tweet (selected_text) that encapsulates the provided sentiment.

Make sure, when parsing the CSV, to remove the beginning / ending quotes from the text field, to ensure that you don't include them in your training.

You're attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc.)

Columns:

textID - unique ID for each piece of text

text - the text of the tweet

sentiment - the general sentiment of the tweet

Acknowledgement:

The dataset is download from Kaggle Competetions:
https://www.kaggle.com/c/tweet-sentiment-extraction/data?select=train.csv

Objective:

Understand the Dataset & cleanup (if required).

Build classification models to predict the twitter sentiments.

Compare the evaluation metrics of vaious classification algorithms.
A
‘US Figure Skating Data - 2020 Regionals Int Ladies’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘US Figure Skating Data - 2020 Regionals Int Ladies’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-us-figure-skating-data-2020-regionals-int-ladies-de39/8600c042/?iid=013-632&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United States
Description
Analysis of ‘US Figure Skating Data - 2020 Regionals Int Ladies’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/katiewerner/us-figure-skating-data-2020-regionals-int-ladies on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

US Figure Skating hosts several competitions at the regional level; top skaters from these competitions progress to the next echelon of competition. Depending on the region, there may be qualifying events before the final round to determine the top skaters. In a qualifying round, skaters perform their free skate program for a panel of judges; top qualifiers move on to the final round. The final round consists of each skater performing a short program and their free skate program; the official panels are not necessarily the same between the short program and free skate program events (likewise for qualifying rounds).

Skaters receive scores based on: technical elements, program components, and any deductions incurred. Technical elements performed are assessed by the technical officials on the panel of officials, and the element is assigned a base value. This base value has several considerations beyond just the element performed - for example, if a jump is under-rotated this affects the base value, and if the skater performs an element that receives a bonus, the bonus is added to the base value. The judges on the panel of officials then assign each technical element a grade of execution mark, which may be integers in the range from -5 to +5. The base value is then scaled by the average GOE, dropping the highest and lowest GOEs.

Judges on the official panel are also responsible for assigning program component scores, which consider interpretation of the music, program choreography/composition, performance, transitions between and going into the technical elements, and overall skating skills. Some skating levels have scores for some of the components. Component marks are adjusted based on removing high and low scores, and also adjusted by a set of scalar multipliers, as provided by US Figure Skating.

Deductions may be incurred for time violations, falls, or costume/prop issues. Fall deductions may be more severe depending on the level.

Content

This dataset packaging includes 5 csv files that contain the scraped data from the 2020 Regional Intermedaite Ladies events: - Event Details (event index, round, program type) - Official Details (judge and technical official names, cities) - Skater Details (skater placement, skate order, club, overall scores and deductions) - Technical Score Details (base values, GOEs) - Component Score Details (component scores)

The event index and skater placement columns are keys that can be used to join these datasets together, as appropriate.

Acknowledgements

Many thanks to my husband, for encouraging me to tackle this project during my personal sabbatical. Special shoutout to James Robertson and Joshua Merry for giving this dataset publication a quick once over.

As ever, I also owe thanks to stackoverflow for inspiring code solutions when all hope is seemingly lost.

Photo credit: Greg Pembroke

Inspiration

I created this dataset as simply a fun data project, being data I am intimately familiar and fascinated with the creation of -- indeed, I helped create this data back in October 2019 by serving as a judge on some of these panels! I intend to publish another notebook containing EDA and some models in the near future - stay tuned!

The scripts and notebooks provided are strictly intended to serve as an example to demonstrate my approach to working with data. It is ABSOLUTELY NOT intended to be used to draw any definitive conclusions about: the sport of figure skating, the judges, the technical officials, the skaters, the coaches, the skating clubs, US Figure Skating, any region or section of US Figure Skating, etc.

--- Original source retains full ownership of the source dataset ---
SRM Resting-state EEG
openneuro.org
Updated Nov 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christoffer Hatlestad-Hall; Trine Waage Rygvold; Stein Andersson (2022). SRM Resting-state EEG [Dataset]. http://doi.org/10.18112/openneuro.ds003775.v1.2.1
Explore at:
Unique identifier
https://doi.org/10.18112/openneuro.ds003775.v1.2.1
Dataset updated
Nov 23, 2022
Dataset provided by
OpenNeurohttps://openneuro.org/
Authors
Christoffer Hatlestad-Hall; Trine Waage Rygvold; Stein Andersson
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
SRM Resting-state EEG

Introduction

This EEG dataset contains resting-state EEG extracted from the experimental paradigm used in the Stimulus-Selective Response Modulation (SRM) project at the Dept. of Psychology, University of Oslo, Norway.

The data is recorded with a BioSemi ActiveTwo system, using 64 electrodes following the positional scheme of the extended 10-20 system (10-10). Each datafile comprises four minutes of uninterrupted EEG acquired while the subjects were resting with their eyes closed. The dataset includes EEG from 111 healthy control subjects (the "t1" session), of which a number underwent an additional EEG recording at a later date (the "t2" session). Thus, some subjects have one associated EEG file, whereas others have two.

Disclaimer

The dataset is provided "as is". Hereunder, the authors take no responsibility with regard to data quality. The user is solely responsible for ascertaining that the data used for publications or in other contexts fulfil the required quality criteria.

The data

Raw data files

The raw EEG data signals are rereferenced to the average reference. Other than that, no operations have been performed on the data. The files contain no events; the whole continuous segment is resting-state data. The data signals are unfiltered (recorded in Europe, the line noise frequency is 50 Hz). The time points for the subject's EEG recording(s), are listed in the *_scans.tsv file (particularly interesting for the subjects with two recordings).

Please note that the quality of the raw data has not been carefully assessed. While most data files are of high quality, a few might be of poorer quality. The data files are provided "as is", and it is the user's esponsibility to ascertain the quality of the individual data file.

/derivatives/cleaned_data

For convenience, a cleaned dataset is provided. The files in this derived dataset have been preprocessed with a basic, fully automated pipeline (see /code/s2_preprocess.m for details) directory for details. The derived files are stored as EEGLAB .set files in a directory structure identical to that of the raw files. Please note that the *_channels.tsv files associated with the derived files have been updated with status information about each channel ("good" or "bad"). The "bad" channels are – for the sake of consistency – interpolated, and thus still present in the data. It might be advisable to remove these channels in some analyses, as they (per definition) do not provide anything to the EEG data. The cleaned data signals are referenced to the average reference (including the interpolated channels).

Please mind the automatic nature of the employed pipeline. It might not perform optimally on all data files (e.g. over-/underestimating proportion of bad channels). For publications, we recommend implementing a more sensitive cleaning pipeline.

Demographic and cognitive test data

The participants.tsv file in the root folder contains the variables age, sex, and a range of cognitive test scores. See the sidecar participants.json for more information on the behavioural measures. Please note that these measures were collected in connection with the "t1" session recording.

How to cite

All use of this dataset in a publication context requires the following paper to be cited:

Hatlestad-Hall, C., Rygvold, T. W., & Andersson, S. (2022). BIDS-structured resting-state electroencephalography (EEG) data extracted from an experimental paradigm. Data in Brief, 45, 108647. https://doi.org/10.1016/j.dib.2022.108647

Contact

Questions regarding the EEG data may be addressed to Christoffer Hatlestad-Hall (chr.hh@pm.me).

Question regarding the project in general may be addressed to Stein Andersson (stein.andersson@psykologi.uio.no) or Trine W. Rygvold (t.w.rygvold@psykologi.uio.no).
f
Data from: Chemical Topic Modeling: Exploring Molecular Data Sets Using a...
acs.figshare.com
figshare.com
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nadine Schneider; Nikolas Fechner; Gregory A. Landrum; Nikolaus Stiefl (2023). Chemical Topic Modeling: Exploring Molecular Data Sets Using a Common Text-Mining Approach [Dataset]. http://doi.org/10.1021/acs.jcim.7b00249.s003
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.7b00249.s003
Dataset updated
Jun 1, 2023
Dataset provided by
ACS Publications
Authors
Nadine Schneider; Nikolas Fechner; Gregory A. Landrum; Nikolaus Stiefl
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Big data is one of the key transformative factors which increasingly influences all aspects of modern life. Although this transformation brings vast opportunities it also generates novel challenges, not the least of which is organizing and searching this data deluge. The field of medicinal chemistry is not different: more and more data are being generated, for instance, by technologies such as DNA encoded libraries, peptide libraries, text mining of large literature corpora, and new in silico enumeration methods. Handling those huge sets of molecules effectively is quite challenging and requires compromises that often come at the expense of the interpretability of the results. In order to find an intuitive and meaningful approach to organizing large molecular data sets, we adopted a probabilistic framework called “topic modeling” from the text-mining field. Here we present the first chemistry-related implementation of this method, which allows large molecule sets to be assigned to “chemical topics” and investigating the relationships between those. In this first study, we thoroughly evaluate this novel method in different experiments and discuss both its disadvantages and advantages. We show very promising results in reproducing human-assigned concepts using the approach to identify and retrieve chemical series from sets of molecules. We have also created an intuitive visualization of the chemical topics output by the algorithm. This is a huge benefit compared to other unsupervised machine-learning methods, like clustering, which are commonly used to group sets of molecules. Finally, we applied the new method to the 1.6 million molecules of the ChEMBL22 data set to test its robustness and efficiency. In about 1 h we built a 100-topic model of this large data set in which we could identify interesting topics like “proteins”, “DNA”, or “steroids”. Along with this publication we provide our data sets and an open-source implementation of the new method (CheTo) which will be part of an upcoming version of the open-source cheminformatics toolkit RDKit.
A
‘Premier League’ analyzed by Analyst-2
analyst-2.ai
Updated Nov 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Premier League’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-premier-league-37de/8118be0f/?iid=073-634&v=presentation
Explore at:
Dataset updated
Nov 12, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Premier League’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/zaeemnalla/premier-league on 12 November 2021.

--- Dataset description provided by original source is as follows ---

Context

Official football data organised and formatted in csv files ready for download is quite hard to come by. Stats providers are hesitant to release their data to anyone and everyone, even if it's for academic purposes. That was my exact dilemma which prompted me to scrape and extract it myself. Now that it's at your disposal, have fun with it.

Content

The data was acquired from the Premier League website and is representative of seasons 2006/2007 to 2017/2018. Visit both sets to get a detailed description of what each entails.

Inspiration

Use it to the best of your ability to predict match outcomes or for a thorough data analysis to uncover some intriguing insights. Be safe and only use this dataset for personal projects. If you'd like to use this type of data for a commercial project, contact Opta to access it through their API instead.

--- Original source retains full ownership of the source dataset ---
u
Data from: Plant Expression Database
agdatacommons.nal.usda.gov
s.cnmilf.com
+2more
bin
Updated Feb 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sudhansu S. Dash; John Van Hemert; Lu Hong; Roger P. Wise; Julie A. Dickerson (2024). Plant Expression Database [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/Plant_Expression_Database/24661179
Explore at:
binAvailable download formats
Dataset updated
Feb 9, 2024
Dataset provided by
PLEXdb
Authors
Sudhansu S. Dash; John Van Hemert; Lu Hong; Roger P. Wise; Julie A. Dickerson
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
[NOTE: PLEXdb is no longer available online. Oct 2019.] PLEXdb (Plant Expression Database) is a unified gene expression resource for plants and plant pathogens. PLEXdb is a genotype to phenotype, hypothesis building information warehouse, leveraging highly parallel expression data with seamless portals to related genetic, physical, and pathway data. PLEXdb (http://www.plexdb.org), in partnership with community databases, supports comparisons of gene expression across multiple plant and pathogen species, promoting individuals and/or consortia to upload genome-scale data sets to contrast them to previously archived data. These analyses facilitate the interpretation of structure, function and regulation of genes in economically important plants. A list of Gene Atlas experiments highlights data sets that give responses across different developmental stages, conditions and tissues. Tools at PLEXdb allow users to perform complex analyses quickly and easily. The Model Genome Interrogator (MGI) tool supports mapping gene lists onto corresponding genes from model plant organisms, including rice and Arabidopsis. MGI predicts homologies, displays gene structures and supporting information for annotated genes and full-length cDNAs. The gene list-processing wizard guides users through PLEXdb functions for creating, analyzing, annotating and managing gene lists. Users can upload their own lists or create them from the output of PLEXdb tools, and then apply diverse higher level analyses, such as ANOVA and clustering. PLEXdb also provides methods for users to track how gene expression changes across many different experiments using the Gene OscilloScope. This tool can identify interesting expression patterns, such as up-regulation under diverse conditions or checking any gene’s suitability as a steady-state control. Resources in this dataset:Resource Title: Website Pointer for Plant Expression Database, Iowa State University. File Name: Web Page, url: https://www.bcb.iastate.edu/plant-expression-database [NOTE: PLEXdb is no longer available online. Oct 2019.] Project description for the Plant Expression Database (PLEXdb) and integrated tools.
h
awesome-chatgpt-prompts
huggingface.co
Updated Dec 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fatih Kadir Akın (2023). awesome-chatgpt-prompts [Dataset]. https://huggingface.co/datasets/fka/awesome-chatgpt-prompts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 15, 2023
Authors
Fatih Kadir Akın
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
🧠 Awesome ChatGPT Prompts [CSV dataset]

This is a Dataset Repository of Awesome ChatGPT Prompts View All Prompts on GitHub

License

CC-0
E
A Dataset of Scientific Topics
live.european-language-grid.eu
data.niaid.nih.gov
Updated Apr 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). A Dataset of Scientific Topics [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/18328
Explore at:
Dataset updated
Apr 23, 2022
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The automatic extraction of topics is a standard technique for summarizing text corpora from various domains (e.g., news articles, transport or logistic reports, scientific publications) that has several applications. Since, in many cases, topics are subject to continuous change there is the need to monitor the evolution of a set of topics of interest, as the corresponding corpora are updated. The evolution of scientific topics, in particular, is of great interest for researchers, policy makers, fund managers, and other professionals/engineers in the research and academic community. In this dataset, we provide a set of topics for scientific publications gathered from Crossref. The topics have been produced by performing a topic modeling analysis on two distinct sets of publications, each coming from a different time period. Acknowledgements: This research was partially funded by project ENIRISST under grant agreement No. MIS 5027930 (co-financed by Greece and the EU through the European Regional Development Fund).

Facebook

Twitter

Click to copy link

Link copied

Cite

National Institute of Standards and Technology (2022). NIST Statistical Reference Datasets - SRD 140 [Dataset]. https://catalog.data.gov/dataset/nist-statistical-reference-datasets-srd-140-df30c

NIST Statistical Reference Datasets - SRD 140

Explore at:

Dataset updated

Jul 29, 2022

Dataset provided by

National Institute of Standards and Technologyhttp://www.nist.gov/

Description

The purpose of this project is to improve the accuracy of statistical software by providing reference datasets with certified computational results that enable the objective evaluation of statistical software. Currently datasets and certified values are provided for assessing the accuracy of software for univariate statistics, linear regression, nonlinear regression, and analysis of variance. The collection includes both generated and 'real-world' data of varying levels of difficulty. Generated datasets are designed to challenge specific computations. These include the classic Wampler datasets for testing linear regression algorithms and the Simon & Lesage datasets for testing analysis of variance algorithms. Real-world data include challenging datasets such as the Longley data for linear regression, and more benign datasets such as the Daniel & Wood data for nonlinear regression. Certified values are 'best-available' solutions. The certification procedure is described in the web pages for each statistical method. Datasets are ordered by level of difficulty (lower, average, and higher). Strictly speaking the level of difficulty of a dataset depends on the algorithm. These levels are merely provided as rough guidance for the user. Producing correct results on all datasets of higher difficulty does not imply that your software will pass all datasets of average or even lower difficulty. Similarly, producing correct results for all datasets in this collection does not imply that your software will do the same for your particular dataset. It will, however, provide some degree of assurance, in the sense that your package provides correct results for datasets known to yield incorrect results for some software. The Statistical Reference Datasets is also supported by the Standard Reference Data Program.

Clear search

Close search

Google apps

Main menu

NIST Statistical Reference Datasets - SRD 140

Goodreads Book Reviews

Advance, IN Annual Population and Growth Analysis Dataset: A Comprehensive...

About this dataset

Content

Inspiration

Recommended for further research

Student Performance Data Set

Optimization and Evaluation Datasets for PiMine - Dataset - B2FIND

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

Kaggle Datasets - Summary, Topics, Classification

Context

Acknowledgements

‘Countries of the World’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

A Meta-Analysis of Multiple Matched Copy Number and Transcriptomics Data...

30 years of synoptic observations from Neumayer Station with links to...

Reference list of 265 sources used for the discovery of relationships...

Yelp COVID-19 Features

Source

Content

Acknowledgements

Inspiration

Twitter Tweets Sentiment Dataset

Description:

Columns:

Acknowledgement:

Objective:

‘US Figure Skating Data - 2020 Regionals Int Ladies’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

SRM Resting-state EEG

SRM Resting-state EEG

Introduction

Disclaimer

The data

Raw data files

/derivatives/cleaned_data

Demographic and cognitive test data

How to cite

Contact

Data from: Chemical Topic Modeling: Exploring Molecular Data Sets Using a...

‘Premier League’ analyzed by Analyst-2

Context

Content

Inspiration

Data from: Plant Expression Database

awesome-chatgpt-prompts

A Dataset of Scientific Topics

NIST Statistical Reference Datasets - SRD 140See More Versions

NIST Statistical Reference Datasets - SRD 140