The purpose of this project is to improve the accuracy of statistical software by providing reference datasets with certified computational results that enable the objective evaluation of statistical software. Currently datasets and certified values are provided for assessing the accuracy of software for univariate statistics, linear regression, nonlinear regression, and analysis of variance. The collection includes both generated and 'real-world' data of varying levels of difficulty. Generated datasets are designed to challenge specific computations. These include the classic Wampler datasets for testing linear regression algorithms and the Simon & Lesage datasets for testing analysis of variance algorithms. Real-world data include challenging datasets such as the Longley data for linear regression, and more benign datasets such as the Daniel & Wood data for nonlinear regression. Certified values are 'best-available' solutions. The certification procedure is described in the web pages for each statistical method. Datasets are ordered by level of difficulty (lower, average, and higher). Strictly speaking the level of difficulty of a dataset depends on the algorithm. These levels are merely provided as rough guidance for the user. Producing correct results on all datasets of higher difficulty does not imply that your software will pass all datasets of average or even lower difficulty. Similarly, producing correct results for all datasets in this collection does not imply that your software will do the same for your particular dataset. It will, however, provide some degree of assurance, in the sense that your package provides correct results for datasets known to yield incorrect results for some software. The Statistical Reference Datasets is also supported by the Standard Reference Data Program.
These datasets contain reviews from the Goodreads book review website, and a variety of attributes describing the items. Critically, these datasets have multiple levels of user interaction, raging from adding to a shelf, rating, and reading.
Metadata includes
reviews
add-to-shelf, read, review actions
book attributes: title, isbn
graph of similar books
Basic Statistics:
Items: 1,561,465
Users: 808,749
Interactions: 225,394,930
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the Advance population over the last 20 plus years. It lists the population for each year, along with the year on year change in population, as well as the change in percentage terms for each year. The dataset can be utilized to understand the population change of Advance across the last two decades. For example, using this dataset, we can identify if the population is declining or increasing. If there is a change, when the population peaked, or if it is still growing and has not reached its peak. We can also compare the trend with the overall trend of United States population over the same period of time.
Key observations
In 2023, the population of Advance was 505, a 0.40% increase year-by-year from 2022. Previously, in 2022, Advance population was 503, a decline of 0.59% compared to a population of 506 in 2021. Over the last 20 plus years, between 2000 and 2023, population of Advance decreased by 54. In this period, the peak population was 598 in the year 2009. The numbers suggest that the population has already reached its peak and is showing a trend of decline. Source: U.S. Census Bureau Population Estimates Program (PEP).
When available, the data consists of estimates from the U.S. Census Bureau Population Estimates Program (PEP).
Data Coverage:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Advance Population by Year. You can refer the same here
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
If this Data Set is useful, and upvote is appreciated. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd-period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The protein-protein interface comparison software PiMine was developed to provide fast comparisons against databases of known protein-protein complex structures. Its application domains range from the prediction of interfaces and potential interaction partners to the identification of potential small molecule modulators of protein-protein interactions.[1] The protein-protein evaluation datasets are a collection of five datasets that were used for the parameter optimization (ParamOptSet), enrichment assessment (Dimer597 set, Keskin set, PiMineSet), and runtime analyses (RunTimeSet) of protein-protein interface comparison tools. The evaluation datasets contain pairs of interfaces of protein chains that either share sequential and structural similarities or are even sequentially and structurally unrelated. They enable comparative benchmark studies for tools designed to identify interface similarities. Data Set description: The ParamOptSet was designed based on a study on improving the benchmark datasets for the evaluation of protein-protein docking tools [2]. It was used to optimize and fine-tune the geometric search parameters of PiMine. The Dimer597 [3] and Keskin [4] sets were developed earlier. We used them to evaluate PiMine’s performance in identifying structurally and sequentially related interface pairs as well as interface pairs with prominent similarity whose constituting chains are sequentially unrelated. The PiMine set [1] was constructed to assess different quality criteria for reliable interface comparison. It consists of similar pairs of protein-protein complexes of which two chains are sequentially and structurally highly related while the other two chains are unrelated and show different folds. It enables the assessment of the performance when the interfaces of apparently unrelated chains are available only. Furthermore, we could obtain reliable interface-interface alignments based on the similar chains which can be used for alignment performance assessments. Finally, the RunTimeSet [1] comprises protein-protein complexes from the PDB that were predicted to be biologically relevant. It enables the comparison of typical run times of comparison methods and represents also an interesting dataset to screen for interface similarities. References: [1] Graef, J.; Ehrt, C.; Reim, T.; Rarey, M. Database-driven identification of structurally similar protein-protein interfaces (submitted) [2] Barradas-Bautista, D.; Almajed, A.; Oliva, R.; Kalnis, P.; Cavallo, L. Improving classification of correct and incorrect protein-protein docking models by augmenting the training set. Bioinform. Adv. 2023, 3, vbad012. [3] Gao, M.; Skolnick, J. iAlign: a method for the structural comparison of protein–protein interfaces. Bioinformatics 2010, 26, 2259-2265. [4] Keskin, O.; Tsai, C.-J.; Wolfson, H.; Nussinov, R. A new, structurally nonredundant, diverse data set of protein–protein interfaces and its implications. Protein Sci. 2004, 13, 1043-1055. This work was supported by the German Federal Ministry of Education and Research as part of CompLS and de.NBI [031L0172, 031L0105]. C.E. is funded by Data Science in Hamburg – Helmholtz Graduate School for the Structure of Matter (Grant-ID: HIDSS-0002).
LifeSnaps Dataset Documentation
Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.
The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.
Data Import: Reading CSV
For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.
Data Import: Setting up a MongoDB (Recommended)
To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.
To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.
For the Fitbit data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
For the SEMA data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c sema
For surveys data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c surveys
If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.
Data Availability
The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:
{ _id: id (or user_id): type: data: }
Each document consists of four fields: id (also found as user_id in sema and survey collections), type, and data. The _id field is the MongoDB-defined primary key and can be ignored. The id field refers to a user-specific ID used to uniquely identify each user across all collections. The type field refers to the specific data type within the collection, e.g., steps, heart rate, calories, etc. The data field contains the actual information about the document e.g., steps count for a specific timestamp for the steps type, in the form of an embedded object. The contents of the data object are type-dependent, meaning that the fields within the data object are different between different types of data. As mentioned previously, all times are stored in local time, and user IDs are common across different collections. For more information on the available data types, see the related publication.
Surveys Encoding
BREQ2
Why do you engage in exercise?
Code
Text
engage[SQ001]
I exercise because other people say I should
engage[SQ002]
I feel guilty when I don’t exercise
engage[SQ003]
I value the benefits of exercise
engage[SQ004]
I exercise because it’s fun
engage[SQ005]
I don’t see why I should have to exercise
engage[SQ006]
I take part in exercise because my friends/family/partner say I should
engage[SQ007]
I feel ashamed when I miss an exercise session
engage[SQ008]
It’s important to me to exercise regularly
engage[SQ009]
I can’t see why I should bother exercising
engage[SQ010]
I enjoy my exercise sessions
engage[SQ011]
I exercise because others will not be pleased with me if I don’t
engage[SQ012]
I don’t see the point in exercising
engage[SQ013]
I feel like a failure when I haven’t exercised in a while
engage[SQ014]
I think it is important to make the effort to exercise regularly
engage[SQ015]
I find exercise a pleasurable activity
engage[SQ016]
I feel under pressure from my friends/family to exercise
engage[SQ017]
I get restless if I don’t exercise regularly
engage[SQ018]
I get pleasure and satisfaction from participating in exercise
engage[SQ019]
I think exercising is a waste of time
PANAS
Indicate the extent you have felt this way over the past week
P1[SQ001]
Interested
P1[SQ002]
Distressed
P1[SQ003]
Excited
P1[SQ004]
Upset
P1[SQ005]
Strong
P1[SQ006]
Guilty
P1[SQ007]
Scared
P1[SQ008]
Hostile
P1[SQ009]
Enthusiastic
P1[SQ010]
Proud
P1[SQ011]
Irritable
P1[SQ012]
Alert
P1[SQ013]
Ashamed
P1[SQ014]
Inspired
P1[SQ015]
Nervous
P1[SQ016]
Determined
P1[SQ017]
Attentive
P1[SQ018]
Jittery
P1[SQ019]
Active
P1[SQ020]
Afraid
Personality
How Accurately Can You Describe Yourself?
Code
Text
ipip[SQ001]
Am the life of the party.
ipip[SQ002]
Feel little concern for others.
ipip[SQ003]
Am always prepared.
ipip[SQ004]
Get stressed out easily.
ipip[SQ005]
Have a rich vocabulary.
ipip[SQ006]
Don't talk a lot.
ipip[SQ007]
Am interested in people.
ipip[SQ008]
Leave my belongings around.
ipip[SQ009]
Am relaxed most of the time.
ipip[SQ010]
Have difficulty understanding abstract ideas.
ipip[SQ011]
Feel comfortable around people.
ipip[SQ012]
Insult people.
ipip[SQ013]
Pay attention to details.
ipip[SQ014]
Worry about things.
ipip[SQ015]
Have a vivid imagination.
ipip[SQ016]
Keep in the background.
ipip[SQ017]
Sympathize with others' feelings.
ipip[SQ018]
Make a mess of things.
ipip[SQ019]
Seldom feel blue.
ipip[SQ020]
Am not interested in abstract ideas.
ipip[SQ021]
Start conversations.
ipip[SQ022]
Am not interested in other people's problems.
ipip[SQ023]
Get chores done right away.
ipip[SQ024]
Am easily disturbed.
ipip[SQ025]
Have excellent ideas.
ipip[SQ026]
Have little to say.
ipip[SQ027]
Have a soft heart.
ipip[SQ028]
Often forget to put things back in their proper place.
ipip[SQ029]
Get upset easily.
ipip[SQ030]
Do not have a good imagination.
ipip[SQ031]
Talk to a lot of different people at parties.
ipip[SQ032]
Am not really interested in others.
ipip[SQ033]
Like order.
ipip[SQ034]
Change my mood a lot.
ipip[SQ035]
Am quick to understand things.
ipip[SQ036]
Don't like to draw attention to myself.
ipip[SQ037]
Take time out for others.
ipip[SQ038]
Shirk my duties.
ipip[SQ039]
Have frequent mood swings.
ipip[SQ040]
Use difficult words.
ipip[SQ041]
Don't mind being the centre of attention.
ipip[SQ042]
Feel others' emotions.
ipip[SQ043]
Follow a schedule.
ipip[SQ044]
Get irritated easily.
ipip[SQ045]
Spend time reflecting on things.
ipip[SQ046]
Am quiet around strangers.
ipip[SQ047]
Make people feel at ease.
ipip[SQ048]
Am exacting in my work.
ipip[SQ049]
Often feel blue.
ipip[SQ050]
Am full of ideas.
STAI
Indicate how you feel right now
Code
Text
STAI[SQ001]
I feel calm
STAI[SQ002]
I feel secure
STAI[SQ003]
I am tense
STAI[SQ004]
I feel strained
STAI[SQ005]
I feel at ease
STAI[SQ006]
I feel upset
STAI[SQ007]
I am presently worrying over possible misfortunes
STAI[SQ008]
I feel satisfied
STAI[SQ009]
I feel frightened
STAI[SQ010]
I feel comfortable
STAI[SQ011]
I feel self-confident
STAI[SQ012]
I feel nervous
STAI[SQ013]
I am jittery
STAI[SQ014]
I feel indecisive
STAI[SQ015]
I am relaxed
STAI[SQ016]
I feel content
STAI[SQ017]
I am worried
STAI[SQ018]
I feel confused
STAI[SQ019]
I feel steady
STAI[SQ020]
I feel pleasant
TTM
Do you engage in regular physical activity according to the definition above? How frequently did each event or experience occur in the past month?
Code
Text
processes[SQ002]
I read articles to learn more about physical
Companies and individuals are storing increasingly more data digitally; however, much of the data is unused because it is unclassified. How many times have you opened your downloads folder, found a file you downloaded a year ago and you have no idea what the contents are? You can read through those files individually but imagine doing that for thousands of files. All that raw data in storage facilities create data lakes. As the amount of data grows and the complexity rises, data lakes become data swamps. The potentially valuable and interesting datasets will likely remain unused. Our tool addresses the need to classify these large pools of data in a visually effective and succinct manner by identifying keywords in datasets, and classifying datasets into a consistent taxonomy.
The files listed within kaggleDatasetSummaryTopicsClassification.csv have been processed with our tool to generate the keywords and taxonomic classification as seen below. The summaries are not generated from our system. Instead they were retrieved from user input as they uploaded the files on Kaggle. We planned to utilize these summaries to create an NLG model to generate summaries from any input file. Unfortunately we were not able to collect enough data to build a good model. Hopefully the data within this set might help future users achieve that goal.
Developed with Senior Design Center at NC State in collaboration with SAS. Senior Design Team: Tanya Chu, Katherine Marsh, Nikhil Milind, Anna Owens SAS Representatives: : Nancy Rausch, Marty Warner, Brant Kay, Tyler Wendell, JP Trawinski
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Countries of the World’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/fernandol/countries-of-the-world on 12 November 2021.
--- Dataset description provided by original source is as follows ---
World fact sheet, fun to link with other datasets.
Information on population, region, area size, infant mortality and more.
Source: All these data sets are made up of data from the US government. Generally they are free to use if you use the data in the US. If you are outside of the US, you may need to contact the US Govt to ask.
Data from the World Factbook is public domain. The website says "The World Factbook is in the public domain and may be used freely by anyone at anytime without seeking permission."
https://www.cia.gov/library/publications/the-world-factbook/docs/faqs.html
When making visualisations related to countries, sometimes it is interesting to group them by attributes such as region, or weigh their importance by population, GDP or other variables.
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Inferring gene regulatory relationships from observational data is challenging. Manipulation and intervention is often required to unravel causal relationships unambiguously. However, gene copy number changes, as they frequently occur in cancer cells, might be considered natural manipulation experiments on gene expression. An increasing number of data sets on matched array comparative genomic hybridisation and transcriptomics experiments from a variety of cancer pathologies are becoming publicly available. Here we explore the potential of a meta-analysis of thirty such data sets. The aim of our analysis was to assess the potential of in silico inference of trans-acting gene regulatory relationships from this type of data. We found sufficient correlation signal in the data to infer gene regulatory relationships, with interesting similarities between data sets. A number of genes had highly correlated copy number and expression changes in many of the data sets and we present predicted potential trans-acted regulatory relationships for each of these genes. The study also investigates to what extent heterogeneity between cell types and between pathologies determines the number of statistically significant predictions available from a meta-analysis of experiments.
The analysis of research data plays a key role in data-driven areas of science. Varieties of mixed research data sets exist and scientists aim to derive or validate hypotheses to find undiscovered knowledge. Many analysis techniques identify relations of an entire dataset only. This may level the characteristic behavior of different subgroups in the data. Like automatic subspace clustering, we aim at identifying interesting subgroups and attribute sets. We present a visual-interactive system that supports scientists to explore interesting relations between aggregated bins of multivariate attributes in mixed data sets. The abstraction of data to bins enables the application of statistical dependency tests as the measure of interestingness. An overview matrix view shows all attributes, ranked with respect to the interestingness of bins. Complementary, a node-link view reveals multivariate bin relations by positioning dependent bins close to each other. The system supports information drill-down based on both expert knowledge and algorithmic support. Finally, visual-interactive subset clustering assigns multivariate bin relations to groups. A list-based cluster result representation enables the scientist to communicate multivariate findings at a glance. We demonstrate the applicability of the system with two case studies from the earth observation domain and the prostate cancer research domain. In both cases, the system enabled us to identify the most interesting multivariate bin relations, to validate already published results, and, moreover, to discover unexpected relations.
Visual cluster analysis provides valuable tools that help analysts to understand large data sets in terms of representative clusters and relationships thereof. Often, the found clusters are to be understood in context of belonging categorical, numerical or textual metadata which are given for the data elements. While often not part of the clustering process, such metadata play an important role and need to be considered during the interactive cluster exploration process. Traditionally, linked-views allow to relate (or loosely speaking: correlate) clusters with metadata or other properties of the underlying cluster data. Manually inspecting the distribution of metadata for each cluster in a linked-view approach is tedious, specially for large data sets, where a large search problem arises. Fully interactive search for potentially useful or interesting cluster to metadata relationships may constitute a cumbersome and long process. To remedy this problem, we propose a novel approach for guiding users in discovering interesting relationships between clusters and associated metadata. Its goal is to guide the analyst through the potentially huge search space. We focus in our work on metadata of categorical type, which can be summarized for a cluster in form of a histogram. We start from a given visual cluster representation, and compute certain measures of interestingness defined on the distribution of metadata categories for the clusters. These measures are used to automatically score and rank the clusters for potential interestingness regarding the distribution of categorical metadata. Identified interesting relationships are highlighted in the visual cluster representation for easy inspection by the user. We present a system implementing an encompassing, yet extensible, set of interestingness scores for categorical metadata, which can also be extended to numerical metadata. Appropriate visual representations are provided for showing the visual correlations, as well as the calculated ranking scores. Focusing on clusters of time series data, we test our approach on a large real-world data set of time-oriented scientific research data, demonstrating how specific interesting views are automatically identified, supporting the analyst discovering interesting and visually understandable relationships. The dataset contains 265 links (childs) to any of the BSRN datasets. Any user who accepts the BSRN data release guidelines (http://bsrn.awi.de/data/conditions-of-data-release) may ask Gert König-Langlo (mailto:Gert.Koenig-Langlo@awi.de) to obtain an account to download these datasets.
Please refer to Yelp for the original JSON file and other datasets. This dataset was created in June 2020 by Yelp. The usage of this dataset should be for academic purposes.
I read the JSON file in Python and convert it to three CSV files:
Please read Dataset_User_Agreement.pdf before you proceed with all data files.
It would be interesting to see how virtual services were offered by restaurants during COVID in 2020 and how restaurant businesses strived to communicate and connect with customers on Yelp. There is no numeric data to play with, however, it's still valuable to do some visualizations.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://raw.githubusercontent.com/Masterx-AI/Project_Twitter_Sentiment_Analysis_/main/twitt.jpg" alt="">
Twitter is an online Social Media Platform where people share their their though as tweets. It is observed that some people misuse it to tweet hateful content. Twitter is trying to tackle this problem and we shall help it by creating a strong NLP based-classifier model to distinguish the negative tweets & block such tweets. Can you build a strong classifier model to predict the same?
Each row contains the text of a tweet and a sentiment label. In the training set you are provided with a word or phrase drawn from the tweet (selected_text) that encapsulates the provided sentiment.
Make sure, when parsing the CSV, to remove the beginning / ending quotes from the text field, to ensure that you don't include them in your training.
You're attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc.)
The dataset is download from Kaggle Competetions:
https://www.kaggle.com/c/tweet-sentiment-extraction/data?select=train.csv
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘US Figure Skating Data - 2020 Regionals Int Ladies’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/katiewerner/us-figure-skating-data-2020-regionals-int-ladies on 28 January 2022.
--- Dataset description provided by original source is as follows ---
US Figure Skating hosts several competitions at the regional level; top skaters from these competitions progress to the next echelon of competition. Depending on the region, there may be qualifying events before the final round to determine the top skaters. In a qualifying round, skaters perform their free skate program for a panel of judges; top qualifiers move on to the final round. The final round consists of each skater performing a short program and their free skate program; the official panels are not necessarily the same between the short program and free skate program events (likewise for qualifying rounds).
Skaters receive scores based on: technical elements, program components, and any deductions incurred. Technical elements performed are assessed by the technical officials on the panel of officials, and the element is assigned a base value. This base value has several considerations beyond just the element performed - for example, if a jump is under-rotated this affects the base value, and if the skater performs an element that receives a bonus, the bonus is added to the base value. The judges on the panel of officials then assign each technical element a grade of execution mark, which may be integers in the range from -5 to +5. The base value is then scaled by the average GOE, dropping the highest and lowest GOEs.
Judges on the official panel are also responsible for assigning program component scores, which consider interpretation of the music, program choreography/composition, performance, transitions between and going into the technical elements, and overall skating skills. Some skating levels have scores for some of the components. Component marks are adjusted based on removing high and low scores, and also adjusted by a set of scalar multipliers, as provided by US Figure Skating.
Deductions may be incurred for time violations, falls, or costume/prop issues. Fall deductions may be more severe depending on the level.
This dataset packaging includes 5 csv files that contain the scraped data from the 2020 Regional Intermedaite Ladies events: - Event Details (event index, round, program type) - Official Details (judge and technical official names, cities) - Skater Details (skater placement, skate order, club, overall scores and deductions) - Technical Score Details (base values, GOEs) - Component Score Details (component scores)
The event index and skater placement columns are keys that can be used to join these datasets together, as appropriate.
Many thanks to my husband, for encouraging me to tackle this project during my personal sabbatical. Special shoutout to James Robertson and Joshua Merry for giving this dataset publication a quick once over.
As ever, I also owe thanks to stackoverflow for inspiring code solutions when all hope is seemingly lost.
Photo credit: Greg Pembroke
I created this dataset as simply a fun data project, being data I am intimately familiar and fascinated with the creation of -- indeed, I helped create this data back in October 2019 by serving as a judge on some of these panels! I intend to publish another notebook containing EDA and some models in the near future - stay tuned!
The scripts and notebooks provided are strictly intended to serve as an example to demonstrate my approach to working with data. It is ABSOLUTELY NOT intended to be used to draw any definitive conclusions about: the sport of figure skating, the judges, the technical officials, the skaters, the coaches, the skating clubs, US Figure Skating, any region or section of US Figure Skating, etc.
--- Original source retains full ownership of the source dataset ---
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This EEG dataset contains resting-state EEG extracted from the experimental paradigm used in the Stimulus-Selective Response Modulation (SRM) project at the Dept. of Psychology, University of Oslo, Norway.
The data is recorded with a BioSemi ActiveTwo system, using 64 electrodes following the positional scheme of the extended 10-20 system (10-10). Each datafile comprises four minutes of uninterrupted EEG acquired while the subjects were resting with their eyes closed. The dataset includes EEG from 111 healthy control subjects (the "t1" session), of which a number underwent an additional EEG recording at a later date (the "t2" session). Thus, some subjects have one associated EEG file, whereas others have two.
The dataset is provided "as is". Hereunder, the authors take no responsibility with regard to data quality. The user is solely responsible for ascertaining that the data used for publications or in other contexts fulfil the required quality criteria.
The raw EEG data signals are rereferenced to the average reference. Other than that, no operations have been performed on the data. The files contain no events; the whole continuous segment is resting-state data. The data signals are unfiltered (recorded in Europe, the line noise frequency is 50 Hz). The time points for the subject's EEG recording(s), are listed in the *_scans.tsv file (particularly interesting for the subjects with two recordings).
Please note that the quality of the raw data has not been carefully assessed. While most data files are of high quality, a few might be of poorer quality. The data files are provided "as is", and it is the user's esponsibility to ascertain the quality of the individual data file.
For convenience, a cleaned dataset is provided. The files in this derived dataset have been preprocessed with a basic, fully automated pipeline (see /code/s2_preprocess.m for details) directory for details. The derived files are stored as EEGLAB .set files in a directory structure identical to that of the raw files. Please note that the *_channels.tsv files associated with the derived files have been updated with status information about each channel ("good" or "bad"). The "bad" channels are – for the sake of consistency – interpolated, and thus still present in the data. It might be advisable to remove these channels in some analyses, as they (per definition) do not provide anything to the EEG data. The cleaned data signals are referenced to the average reference (including the interpolated channels).
Please mind the automatic nature of the employed pipeline. It might not perform optimally on all data files (e.g. over-/underestimating proportion of bad channels). For publications, we recommend implementing a more sensitive cleaning pipeline.
The participants.tsv file in the root folder contains the variables age, sex, and a range of cognitive test scores. See the sidecar participants.json for more information on the behavioural measures. Please note that these measures were collected in connection with the "t1" session recording.
All use of this dataset in a publication context requires the following paper to be cited:
Hatlestad-Hall, C., Rygvold, T. W., & Andersson, S. (2022). BIDS-structured resting-state electroencephalography (EEG) data extracted from an experimental paradigm. Data in Brief, 45, 108647. https://doi.org/10.1016/j.dib.2022.108647
Questions regarding the EEG data may be addressed to Christoffer Hatlestad-Hall (chr.hh@pm.me).
Question regarding the project in general may be addressed to Stein Andersson (stein.andersson@psykologi.uio.no) or Trine W. Rygvold (t.w.rygvold@psykologi.uio.no).
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Big data is one of the key transformative factors which increasingly influences all aspects of modern life. Although this transformation brings vast opportunities it also generates novel challenges, not the least of which is organizing and searching this data deluge. The field of medicinal chemistry is not different: more and more data are being generated, for instance, by technologies such as DNA encoded libraries, peptide libraries, text mining of large literature corpora, and new in silico enumeration methods. Handling those huge sets of molecules effectively is quite challenging and requires compromises that often come at the expense of the interpretability of the results. In order to find an intuitive and meaningful approach to organizing large molecular data sets, we adopted a probabilistic framework called “topic modeling” from the text-mining field. Here we present the first chemistry-related implementation of this method, which allows large molecule sets to be assigned to “chemical topics” and investigating the relationships between those. In this first study, we thoroughly evaluate this novel method in different experiments and discuss both its disadvantages and advantages. We show very promising results in reproducing human-assigned concepts using the approach to identify and retrieve chemical series from sets of molecules. We have also created an intuitive visualization of the chemical topics output by the algorithm. This is a huge benefit compared to other unsupervised machine-learning methods, like clustering, which are commonly used to group sets of molecules. Finally, we applied the new method to the 1.6 million molecules of the ChEMBL22 data set to test its robustness and efficiency. In about 1 h we built a 100-topic model of this large data set in which we could identify interesting topics like “proteins”, “DNA”, or “steroids”. Along with this publication we provide our data sets and an open-source implementation of the new method (CheTo) which will be part of an upcoming version of the open-source cheminformatics toolkit RDKit.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Premier League’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/zaeemnalla/premier-league on 12 November 2021.
--- Dataset description provided by original source is as follows ---
Official football data organised and formatted in csv files ready for download is quite hard to come by. Stats providers are hesitant to release their data to anyone and everyone, even if it's for academic purposes. That was my exact dilemma which prompted me to scrape and extract it myself. Now that it's at your disposal, have fun with it.
The data was acquired from the Premier League website and is representative of seasons 2006/2007 to 2017/2018. Visit both sets to get a detailed description of what each entails.
Use it to the best of your ability to predict match outcomes or for a thorough data analysis to uncover some intriguing insights. Be safe and only use this dataset for personal projects. If you'd like to use this type of data for a commercial project, contact Opta to access it through their API instead.
--- Original source retains full ownership of the source dataset ---
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
[NOTE: PLEXdb is no longer available online. Oct 2019.] PLEXdb (Plant Expression Database) is a unified gene expression resource for plants and plant pathogens. PLEXdb is a genotype to phenotype, hypothesis building information warehouse, leveraging highly parallel expression data with seamless portals to related genetic, physical, and pathway data. PLEXdb (http://www.plexdb.org), in partnership with community databases, supports comparisons of gene expression across multiple plant and pathogen species, promoting individuals and/or consortia to upload genome-scale data sets to contrast them to previously archived data. These analyses facilitate the interpretation of structure, function and regulation of genes in economically important plants. A list of Gene Atlas experiments highlights data sets that give responses across different developmental stages, conditions and tissues. Tools at PLEXdb allow users to perform complex analyses quickly and easily. The Model Genome Interrogator (MGI) tool supports mapping gene lists onto corresponding genes from model plant organisms, including rice and Arabidopsis. MGI predicts homologies, displays gene structures and supporting information for annotated genes and full-length cDNAs. The gene list-processing wizard guides users through PLEXdb functions for creating, analyzing, annotating and managing gene lists. Users can upload their own lists or create them from the output of PLEXdb tools, and then apply diverse higher level analyses, such as ANOVA and clustering. PLEXdb also provides methods for users to track how gene expression changes across many different experiments using the Gene OscilloScope. This tool can identify interesting expression patterns, such as up-regulation under diverse conditions or checking any gene’s suitability as a steady-state control. Resources in this dataset:Resource Title: Website Pointer for Plant Expression Database, Iowa State University. File Name: Web Page, url: https://www.bcb.iastate.edu/plant-expression-database [NOTE: PLEXdb is no longer available online. Oct 2019.] Project description for the Plant Expression Database (PLEXdb) and integrated tools.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
🧠 Awesome ChatGPT Prompts [CSV dataset]
This is a Dataset Repository of Awesome ChatGPT Prompts View All Prompts on GitHub
License
CC-0
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The automatic extraction of topics is a standard technique for summarizing text corpora from various domains (e.g., news articles, transport or logistic reports, scientific publications) that has several applications. Since, in many cases, topics are subject to continuous change there is the need to monitor the evolution of a set of topics of interest, as the corresponding corpora are updated. The evolution of scientific topics, in particular, is of great interest for researchers, policy makers, fund managers, and other professionals/engineers in the research and academic community. In this dataset, we provide a set of topics for scientific publications gathered from Crossref. The topics have been produced by performing a topic modeling analysis on two distinct sets of publications, each coming from a different time period. Acknowledgements: This research was partially funded by project ENIRISST under grant agreement No. MIS 5027930 (co-financed by Greece and the EU through the European Regional Development Fund).
The purpose of this project is to improve the accuracy of statistical software by providing reference datasets with certified computational results that enable the objective evaluation of statistical software. Currently datasets and certified values are provided for assessing the accuracy of software for univariate statistics, linear regression, nonlinear regression, and analysis of variance. The collection includes both generated and 'real-world' data of varying levels of difficulty. Generated datasets are designed to challenge specific computations. These include the classic Wampler datasets for testing linear regression algorithms and the Simon & Lesage datasets for testing analysis of variance algorithms. Real-world data include challenging datasets such as the Longley data for linear regression, and more benign datasets such as the Daniel & Wood data for nonlinear regression. Certified values are 'best-available' solutions. The certification procedure is described in the web pages for each statistical method. Datasets are ordered by level of difficulty (lower, average, and higher). Strictly speaking the level of difficulty of a dataset depends on the algorithm. These levels are merely provided as rough guidance for the user. Producing correct results on all datasets of higher difficulty does not imply that your software will pass all datasets of average or even lower difficulty. Similarly, producing correct results for all datasets in this collection does not imply that your software will do the same for your particular dataset. It will, however, provide some degree of assurance, in the sense that your package provides correct results for datasets known to yield incorrect results for some software. The Statistical Reference Datasets is also supported by the Standard Reference Data Program.