100+ datasets found

n
Census Microdata Samples Project
neuinfo.org
dknet.org
+2more
Updated Jan 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Census Microdata Samples Project [Dataset]. http://identifiers.org/RRID:SCR_008902
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_008902
Dataset updated
Jan 29, 2022
Description
A data set of cross-nationally comparable microdata samples for 15 Economic Commission for Europe (ECE) countries (Bulgaria, Canada, Czech Republic, Estonia, Finland, Hungary, Italy, Latvia, Lithuania, Romania, Russia, Switzerland, Turkey, UK, USA) based on the 1990 national population and housing censuses in countries of Europe and North America to study the social and economic conditions of older persons. These samples have been designed to allow research on a wide range of issues related to aging, as well as on other social phenomena. A common set of nomenclatures and classifications, derived on the basis of a study of census data comparability in Europe and North America, was adopted as a standard for recoding. This series was formerly called Dynamics of Population Aging in ECE Countries. The recommendations regarding the design and size of the samples drawn from the 1990 round of censuses envisaged: (1) drawing individual-based samples of about one million persons; (2) progressive oversampling with age in order to ensure sufficient representation of various categories of older people; and (3) retaining information on all persons co-residing in the sampled individual''''s dwelling unit. Estonia, Latvia and Lithuania provided the entire population over age 50, while Finland sampled it with progressive over-sampling. Canada, Italy, Russia, Turkey, UK, and the US provided samples that had not been drawn specially for this project, and cover the entire population without over-sampling. Given its wide user base, the US 1990 PUMS was not recoded. Instead, PAU offers mapping modules, which recode the PUMS variables into the project''''s classifications, nomenclatures, and coding schemes. Because of the high sampling density, these data cover various small groups of older people; contain as much geographic detail as possible under each country''''s confidentiality requirements; include more extensive information on housing conditions than many other data sources; and provide information for a number of countries whose data were not accessible until recently. Data Availability: Eight of the fifteen participating countries have signed the standard data release agreement making their data available through NACDA/ICPSR (see links below). Hungary and Switzerland require a clearance to be obtained from their national statistical offices for the use of microdata, however the documents signed between the PAU and these countries include clauses stipulating that, in general, all scholars interested in social research will be granted access. Russia requested that certain provisions for archiving the microdata samples be removed from its data release arrangement. The PAU has an agreement with several British scholars to facilitate access to the 1991 UK data through collaborative arrangements. Statistics Canada and the Italian Institute of statistics (ISTAT) provide access to data from Canada and Italy, respectively. * Dates of Study: 1989-1992 * Study Features: International, Minority Oversamples * Sample Size: Approx. 1 million/country Links: * Bulgaria (1992), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/02200 * Czech Republic (1991), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/06857 * Estonia (1989), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/06780 * Finland (1990), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/06797 * Romania (1992), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/06900 * Latvia (1989), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/02572 * Lithuania (1989), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/03952 * Turkey (1990), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/03292 * U.S. (1990), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/06219
d
Statistics review 2: Samples and populations
catalog.data.gov
data.virginia.gov
Updated Sep 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institutes of Health (2025). Statistics review 2: Samples and populations [Dataset]. https://catalog.data.gov/dataset/statistics-review-2-samples-and-populations
Explore at:
Dataset updated
Sep 6, 2025
Dataset provided by
National Institutes of Health
Description
The previous review in this series introduced the notion of data description and outlined some of the more common summary measures used to describe a dataset. However, a dataset is typically only of interest for the information it provides regarding the population from which it was drawn. The present review focuses on estimation of population values from a sample.
Data from: Evaluating Supplemental Samples in Longitudinal Research:...
tandf.figshare.com
txt
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laura K. Taylor; Xin Tong; Scott E. Maxwell (2024). Evaluating Supplemental Samples in Longitudinal Research: Replacement and Refreshment Approaches [Dataset]. http://doi.org/10.6084/m9.figshare.12162072.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12162072.v1
Dataset updated
Feb 9, 2024
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Laura K. Taylor; Xin Tong; Scott E. Maxwell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Despite the wide application of longitudinal studies, they are often plagued by missing data and attrition. The majority of methodological approaches focus on participant retention or modern missing data analysis procedures. This paper, however, takes a new approach by examining how researchers may supplement the sample with additional participants. First, refreshment samples use the same selection criteria as the initial study. Second, replacement samples identify auxiliary variables that may help explain patterns of missingness and select new participants based on those characteristics. A simulation study compares these two strategies for a linear growth model with five measurement occasions. Overall, the results suggest that refreshment samples lead to less relative bias, greater relative efficiency, and more acceptable coverage rates than replacement samples or not supplementing the missing participants in any way. Refreshment samples also have high statistical power. The comparative strengths of the refreshment approach are further illustrated through a real data example. These findings have implications for assessing change over time when researching at-risk samples with high levels of permanent attrition.
Biological Samples Database (BSD)
fisheries.noaa.gov
catalog.data.gov
Updated May 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Southeast Fisheries Science Center (2023). Biological Samples Database (BSD) [Dataset]. https://www.fisheries.noaa.gov/inport/item/11312
Explore at:
Dataset updated
May 6, 2023
Dataset provided by
Southeast Fisheries Science Center
Time period covered
1992 - Dec 3, 2125
Area covered

Description
The Biological Sampling Database (BSD) is an Oracle relational database that is maintained at the NMFS Panama City Laboratory and NOAA NMFS Beaufort Laboratory. Data set includes port samples of reef fish species collected from commercial and recreational fishery landings in the U.S. South Atlantic (NC - FL Keys). The data set serves as an inventory of samples stored at the NMFS Beaufort Labor...
u
Example (synthetic) electronic health record data
rdr.ucl.ac.uk
application/csv
Updated Apr 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Steve Harris; Wai Shing Lai (2024). Example (synthetic) electronic health record data [Dataset]. http://doi.org/10.5522/04/25676298.v1
Explore at:
application/csvAvailable download formats
Unique identifier
https://doi.org/10.5522/04/25676298.v1
Dataset updated
Apr 24, 2024
Dataset provided by
University College London
Authors
Steve Harris; Wai Shing Lai
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
These data are modelled using the OMOP Common Data Model v5.3.Correlated Data SourceNG tube vocabulariesGeneration RulesThe patient’s age should be between 18 and 100 at the moment of the visit.Ethnicity data is using 2021 census data in England and Wales (Census in England and Wales 2021) .Gender is equally distributed between Male and Female (50% each).Every person in the record has a link in procedure_occurrence with the concept “Checking the position of nasogastric tube using X-ray”2% of person records have a link in procedure_occurrence with the concept of “Plain chest X-ray”60% of visit_occurrence has visit concept “Inpatient Visit”, while 40% have “Emergency Room Visit”NotesVersion 0Generated by man-made rule/story generatorStructural correct, all tables linked with the relationshipWe used national ethnicity data to generate a realistic distribution (see below)2011 Race Census figure in England and WalesEthnic Group : Population(%)Asian or Asian British: Bangladeshi - 1.1Asian or Asian British: Chinese - 0.7Asian or Asian British: Indian - 3.1Asian or Asian British: Pakistani - 2.7Asian or Asian British: any other Asian background -1.6Black or African or Caribbean or Black British: African - 2.5Black or African or Caribbean or Black British: Caribbean - 1Black or African or Caribbean or Black British: other Black or African or Caribbean background - 0.5Mixed multiple ethnic groups: White and Asian - 0.8Mixed multiple ethnic groups: White and Black African - 0.4Mixed multiple ethnic groups: White and Black Caribbean - 0.9Mixed multiple ethnic groups: any other Mixed or multiple ethnic background - 0.8White: English or Welsh or Scottish or Northern Irish or British - 74.4White: Irish - 0.9White: Gypsy or Irish Traveller - 0.1White: any other White background - 6.4Other ethnic group: any other ethnic group - 1.6Other ethnic group: Arab - 0.6
Dataset #1: Cross-sectional survey data
figshare.com
txt
Updated Jul 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adam Baimel (2023). Dataset #1: Cross-sectional survey data [Dataset]. http://doi.org/10.6084/m9.figshare.23708730.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23708730.v1
Dataset updated
Jul 19, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Adam Baimel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
N.B. This is not real data. Only here for an example for project templates.

Project Title: Add title here

Project Team: Add contact information for research project team members

Summary: Provide a descriptive summary of the nature of your research project and its aims/focal research questions.

Relevant publications/outputs: When available, add links to the related publications/outputs from this data.

Data availability statement: If your data is not linked on figshare directly, provide links to where it is being hosted here (i.e., Open Science Framework, Github, etc.). If your data is not going to be made publicly available, please provide details here as to the conditions under which interested individuals could gain access to the data and how to go about doing so.

Data collection details: 1. When was your data collected? 2. How were your participants sampled/recruited?

Sample information: How many and who are your participants? Demographic summaries are helpful additions to this section.

Research Project Materials: What materials are necessary to fully reproduce your the contents of your dataset? Include a list of all relevant materials (e.g., surveys, interview questions) with a brief description of what is included in each file that should be uploaded alongside your datasets.

List of relevant datafile(s): If your project produces data that cannot be contained in a single file, list the names of each of the files here with a brief description of what parts of your research project each file is related to.

Data codebook: What is in each column of your dataset? Provide variable names as they are encoded in your data files, verbatim question associated with each response, response options, details of any post-collection coding that has been done on the raw-response (and whether that's encoded in a separate column).

Examples available at: https://www.thearda.com/data-archive?fid=PEWMU17 https://www.thearda.com/data-archive?fid=RELLAND14
Genomics examples
redivis.com
Updated Oct 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Redivis Demo Organization (2025). Genomics examples [Dataset]. https://redivis.com/datasets/yz1s-d09009dbb
Explore at:
Dataset updated
Oct 20, 2025
Dataset provided by
Redivis Inc.
Authors
Redivis Demo Organization
Time period covered
Jan 30, 2025
Description
This is an auto-generated index table corresponding to a folder of files in this dataset with the same name. This table can be used to extract a subset of files based on their metadata, which can then be used for further analysis. You can view the contents of specific files by navigating to the "cells" tab and clicking on an individual file_id.
Sample Customer data
kaggle.com
zip
Updated Nov 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ravi Kolluru (2022). Sample Customer data [Dataset]. https://www.kaggle.com/datasets/ravick/sample-customer-data
Explore at:
zip(59911 bytes)Available download formats
Dataset updated
Nov 13, 2022
Authors
Ravi Kolluru
Description
Dataset

This dataset was created by Ravi Kolluru

Contents
PCA Data Samples
kaggle.com
zip
Updated Mar 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex Wolski (2021). PCA Data Samples [Dataset]. https://www.kaggle.com/datasets/alexwolski/pca-data-samples
Explore at:
zip(1526 bytes)Available download formats
Dataset updated
Mar 2, 2021
Authors
Alex Wolski
Description
Dataset

This dataset was created by Alex Wolski

Contents
h
example-data-frame
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI Robotics Ethics Society (PUCRS), example-data-frame [Dataset]. https://huggingface.co/datasets/AiresPucrs/example-data-frame
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
AI Robotics Ethics Society (PUCRS)
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Example DataFrame (Teeny-Tiny Castle)

This dataset is part of a tutorial tied to the Teeny-Tiny Castle, an open-source repository containing educational tools for AI Ethics and Safety research.

How to Use

from datasets import load_dataset

dataset = load_dataset("AiresPucrs/example-data-frame", split = 'train')
B
Data Cleaning Sample
borealisdata.ca
dataone.org
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
d
Data from: An example data set for exploration of Multiple Linear Regression...
catalog.data.gov
data.usgs.gov
Updated Nov 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). An example data set for exploration of Multiple Linear Regression [Dataset]. https://catalog.data.gov/dataset/an-example-data-set-for-exploration-of-multiple-linear-regression
Explore at:
Dataset updated
Nov 20, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This data set contains example data for exploration of the theory of regression based regionalization. The 90th percentile of annual maximum streamflow is provided as an example response variable for 293 streamgages in the conterminous United States. Several explanatory variables are drawn from the GAGES-II data base in order to demonstrate how multiple linear regression is applied. Example scripts demonstrate how to collect the original streamflow data provided and how to recreate the figures from the associated Techniques and Methods chapter.
h
Wikipedia-example-data
huggingface.co
Updated Jul 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TopicNavi (2024). Wikipedia-example-data [Dataset]. https://huggingface.co/datasets/TopicNavi/Wikipedia-example-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 6, 2024
Dataset authored and provided by
TopicNavi
Description
TopicNavi/Wikipedia-example-data dataset hosted on Hugging Face and contributed by the HF Datasets community
s
Snowplow Modeled Customer Data Sample
snowplow.io
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Snowplow Analytics, Snowplow Modeled Customer Data Sample [Dataset]. https://snowplow.io/explore-snowplow-data-part-2
Explore at:
Dataset authored and provided by
Snowplow Analytics
Time period covered
Apr 1, 2020 - Apr 3, 2020
Variables measured
user_id, mkt_source, page_views, session_id, conversions, geo_country, device_class, mkt_campaign, session_length, time_engaged_in_s
Description
Example of modeled customer behavioral data showing user sessions, engagement metrics, and conversion data across multiple platforms and devices
European Union Statistics on Income and Living Conditions 2013 -...
catalog.ihsn.org
datacatalog.ihsn.org
Updated Mar 29, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eurostat (2019). European Union Statistics on Income and Living Conditions 2013 - Cross-Sectional User Database - Netherlands [Dataset]. https://catalog.ihsn.org/index.php/catalog/7684
Explore at:
Dataset updated
Mar 29, 2019
Dataset authored and provided by
Eurostathttps://ec.europa.eu/eurostat
Time period covered
2013
Area covered
Netherlands
Description
Abstract

In 2013, the EU-SILC instrument covered all EU Member States plus Iceland, Turkey, Norway, Switzerland and Croatia. EU-SILC has become the EU reference source for comparative statistics on income distribution and social exclusion at European level, particularly in the context of the "Program of Community action to encourage cooperation between Member States to combat social exclusion" and for producing structural indicators on social cohesion for the annual spring report to the European Council. The first priority is to be given to the delivery of comparable, timely and high quality cross-sectional data.

There are two types of datasets: 1) Cross-sectional data pertaining to fixed time periods, with variables on income, poverty, social exclusion and living conditions. 2) Longitudinal data pertaining to individual-level changes over time, observed periodically - usually over four years.

Social exclusion and housing-condition information is collected at household level. Income at a detailed component level is collected at personal level, with some components included in the "Household" section. Labor, education and health observations only apply to persons aged 16 and over. EU-SILC was established to provide data on structural indicators of social cohesion (at-risk-of-poverty rate, S80/S20 and gender pay gap) and to provide relevant data for the two 'open methods of coordination' in the field of social inclusion and pensions in Europe.

This is the 1st version of the 2013 Cross-Sectional User Database as released in July 2015.

Geographic coverage

The survey covers following countries: Austria; Belgium; Bulgaria; Croatia; Cyprus; Czech Republic; Denmark; Estonia; Finland; France; Germany; Greece; Spain; Ireland; Italy; Latvia; Lithuania; Luxembourg; Hungary; Malta; Netherlands; Poland; Portugal; Romania; Slovenia; Slovakia; Serbia; Sweden; United Kingdom; Iceland; Norway; Turkey; Switzerland

Small parts of the national territory amounting to no more than 2% of the national population and the national territories listed below may be excluded from EU-SILC: France - French Overseas Departments and territories; Netherlands - The West Frisian Islands with the exception of Texel; Ireland - All offshore islands with the exception of Achill, Bull, Cruit, Gorumna, Inishnee, Lettermore, Lettermullan and Valentia; United Kingdom - Scotland north of the Caledonian Canal, the Scilly Islands.

Analysis unit

Households;

Individuals 16 years and older.

Universe

The survey covered all household members over 16 years old. Persons living in collective households and in institutions are generally excluded from the target population.

Kind of data

Sample survey data [ssd]

Sampling procedure

On the basis of various statistical and practical considerations and the precision requirements for the most critical variables, the minimum effective sample sizes to be achieved were defined. Sample size for the longitudinal component refers, for any pair of consecutive years, to the number of households successfully interviewed in the first year in which all or at least a majority of the household members aged 16 or over are successfully interviewed in both the years.

For the cross-sectional component, the plans are to achieve the minimum effective sample size of around 131.000 households in the EU as a whole (137.000 including Iceland and Norway). The allocation of the EU sample among countries represents a compromise between two objectives: the production of results at the level of individual countries, and production for the EU as a whole. Requirements for the longitudinal data will be less important. For this component, an effective sample size of around 98.000 households (103.000 including Iceland and Norway) is planned.

Member States using registers for income and other data may use a sample of persons (selected respondents) rather than a sample of complete households in the interview survey. The minimum effective sample size in terms of the number of persons aged 16 or over to be interviewed in detail is in this case taken as 75 % of the figures shown in columns 3 and 4 of the table I, for the cross-sectional and longitudinal components respectively.

The reference is to the effective sample size, which is the size required if the survey were based on simple random sampling (design effect in relation to the 'risk of poverty rate' variable = 1.0). The actual sample sizes will have to be larger to the extent that the design effects exceed 1.0 and to compensate for all kinds of non-response. Furthermore, the sample size refers to the number of valid households which are households for which, and for all members of which, all or nearly all the required information has been obtained. For countries with a sample of persons design, information on income and other data shall be collected for the household of each selected respondent and for all its members.

At the beginning, a cross-sectional representative sample of households is selected. It is divided into say 4 sub-samples, each by itself representative of the whole population and similar in structure to the whole sample. One sub-sample is purely cross-sectional and is not followed up after the first round. Respondents in the second sub-sample are requested to participate in the panel for 2 years, in the third sub-sample for 3 years, and in the fourth for 4 years. From year 2 onwards, one new panel is introduced each year, with request for participation for 4 years. In any one year, the sample consists of 4 sub-samples, which together constitute the cross-sectional sample. In year 1 they are all new samples; in all subsequent years, only one is new sample. In year 2, three are panels in the second year; in year 3, one is a panel in the second year and two in the third year; in subsequent years, one is a panel for the second year, one for the third year, and one for the fourth (final) year.

According to the Commission Regulation on sampling and tracing rules, the selection of the sample will be drawn according to the following requirements:

For all components of EU-SILC (whether survey or register based), the crosssectional and longitudinal (initial sample) data shall be based on a nationally representative probability sample of the population residing in private households within the country, irrespective of language, nationality or legal residence status. All private households and all persons aged 16 and over within the household are eligible for the operation.

Representative probability samples shall be achieved both for households, which form the basic units of sampling, data collection and data analysis, and for individual persons in the target population.

The sampling frame and methods of sample selection shall ensure that every individual and household in the target population is assigned a known and non-zero probability of selection.

By way of exception, paragraphs 1 to 3 shall apply in Germany exclusively to the part of the sample based on probability sampling according to Article 8 of the Regulation of the European Parliament and of the Council (EC) No 1177/2003 concerning

Community Statistics on Income and Living Conditions. Article 8 of the EU-SILC Regulation of the European Parliament and of the Council mentions: 1. The cross-sectional and longitudinal data shall be based on nationally representative probability samples. 2. By way of exception to paragraph 1, Germany shall supply cross-sectional data based on a nationally representative probability sample for the first time for the year 2008. For the year 2005, Germany shall supply data for one fourth based on probability sampling and for three fourths based on quota samples, the latter to be progressively replaced by random selection so as to achieve fully representative probability sampling by 2008. For the longitudinal component, Germany shall supply for the year 2006 one third of longitudinal data (data for year 2005 and 2006) based on probability sampling and two thirds based on quota samples. For the year 2007, half of the longitudinal data relating to years 2005, 2006 and 2007 shall be based on probability sampling and half on quota sample. After 2007 all of the longitudinal data shall be based on probability sampling.

Detailed information about sampling is available in Quality Reports in Related Materials.

Mode of data collection

Mixed
H
Political Analysis Using R: Example Code and Data, Plus Data for Practice...
dataverse.harvard.edu
search.dataone.org
Updated Apr 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jamie Monogan (2020). Political Analysis Using R: Example Code and Data, Plus Data for Practice Problems [Dataset]. http://doi.org/10.7910/DVN/ARKOTI
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/ARKOTI
Dataset updated
Apr 28, 2020
Dataset provided by
Harvard Dataverse
Authors
Jamie Monogan
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Each R script replicates all of the example code from one chapter from the book. All required data for each script are also uploaded, as are all data used in the practice problems at the end of each chapter. The data are drawn from a wide array of sources, so please cite the original work if you ever use any of these data sets for research purposes.
Data example for Thompson sampling demonstration
kaggle.com
zip
Updated Nov 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wesley Silva (2023). Data example for Thompson sampling demonstration [Dataset]. https://www.kaggle.com/datasets/xslyrs/data-example-for-thompson-sampling-demonstration/code
Explore at:
zip(19043 bytes)Available download formats
Dataset updated
Nov 6, 2023
Authors
Wesley Silva
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by Wesley Silva

Released under Apache 2.0

Contents
github-final-datasets
kaggle.com
zip
Updated Nov 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Olga Ivanova (2023). github-final-datasets [Dataset]. https://www.kaggle.com/datasets/olgaiv39/github-final-datasets
Explore at:
zip(1877861953 bytes)Available download formats
Dataset updated
Nov 9, 2023
Authors
Olga Ivanova
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Github Clean Code Snippets Dataset

Here is a description, how the datasets for a training notebook used for Telegram ML Contest solution were prepared.

1 Step - Github Samples Database parsing

The first part of the code samples was taken from a private version of this notebook.

Here is the statistics about classes of programming languages from Github Code Snippets database https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F2fdc091661198e80559f8cb1d1a306ff%2FScreenshot%202023-11-07%20at%2021.24.42.png?generation=1699390166413391&alt=media" alt="">

From this database, 2 csv files were created - with 50000 code samples for each of the 20 programming languages included, with equal by numbers and stratified sampling. The files related here are sample_equal_prop_50000.csv and sample_equal_prop_50000.csv and sample_stratified_50000.csv, respectively.

2 Step - Github Bigquery Database parsing

Second option for capturing out additional examples was to run this notebook with making up larger amount of queries, 10000.

The resulted file is dataset-10000.csv - included to the data card

The statistics for the code programming languages is as on the next chart - it has 32 labeled classes
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F7c04342da8ec1df266cd90daf00204f9%2FScreenshot%202023-10-13%20at%2020.52.13.png?generation=1699392769199533&alt=media" alt="">

3 Step - collection of code samples of raw coding samples

To get a model more robust, code samples of 20 additional languages were collected in amount from 10 till 15 samples on more-less popular use cases. Also, for the class "OTHER", like regular language examples, according to the task of the competition, the text examples from this dataset with promts on Huggingface were added to the file. The resulted file here is rare_languages.csv - also in data card

The statistics for rare languages code snippets is as follows: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F0b340781c774d2acb988ce1567f4afa3%2FScreenshot%202023-11-08%20at%2001.13.07.png?generation=1699402436798661&alt=media" alt="">

4 Step - First and second datasets combining

For this stage of dataset creation, the number of the columns in sample_equal_prop_50000.csv and sample_stratified_50000.csv was cut out just for 2 - "snippet", "language", the version of file with equal numbers is in the data card - sample_equal_prop_50000_clean.csv

To prepare Bigquery dataset file, the column with index was cut out, and the column "content" was renamed to "snippet". These changes were saved in dataset-10000-clean.csv

After that, the files sample_equal_prop_50000_clean.csv and dataset-10000-clean.csv were combined together and saved as github-combined-file.csv

5 Step - Datasets cleaning from symbols and merging together with rare languages

The prepared files took too much RAM to be read by Pandas library, so that is why additional prepocessing has been made - the symbols like quatas, commas, ampersands, new lines and adding tabs characters were cleaned out. After clieaning, the flies were merged with rare_languages.csv file and saved as github-combined-file-no-symbols-rare-clean.csv and sample_equal_prop_50000_-no-symbols-rare-clean.csv, respectively.

The final distribution of classes turned out to be the next one https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2Ff43e0cea4c565c9f7c808527b0dfa2da%2FScreenshot%202023-11-09%20at%2020.26.30.png?generation=1699558064765454&alt=media" alt="">

6 Step - Fixing up the labels

To be suitable for TF-DF format, to each programming language a certain label was given as well. The final labels are in the data card.
Public Use Microdata Samples (PUMS) - Dataset - NASA Open Data Portal
data.nasa.gov
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov, Public Use Microdata Samples (PUMS) - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/public-use-microdata-samples-pums
Explore at:
Dataset provided by
NASAhttp://nasa.gov/
Description
The Public Use Microdata Samples (PUMS) are computer-accessible files containing records for a sample of housing Units, with information on the characteristics of each housing Unit and the people in it for 1940-1990. Within the limits of sample size and geographical detail, these files allow users to prepare virtually any tabulations they require. Each datafile is documented in a codebook containing a data dictionary and supporting appendix information. Electronic versions for the codebooks are only available for the 1980 and 1990 datafiles. Identifying information has been removed to protect the confidentiality of the respondents. PUMS is produced by the United States Census Bureau (USCB) and is distributed by USCB, Inter-university Consortium for Political and Social Research (ICPSR), and Columbia University Center for International Earth Science Information Network (CIESIN).
h
amazon-product-data-sample
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iftach Arbel, amazon-product-data-sample [Dataset]. https://huggingface.co/datasets/iarbel/amazon-product-data-sample
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Iftach Arbel
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for "amazon-product-data-filter"

Dataset Summary

The Amazon Product Dataset contains product listing data from the Amazon US website. It can be used for various NLP and classification tasks, such as text generation, product type classification, attribute extraction, image recognition and more. NOTICE: This is a sample of the full Amazon Product Dataset, which contains 1K examples. Follow the link to gain access to the full dataset.

Languages… See the full description on the dataset page: https://huggingface.co/datasets/iarbel/amazon-product-data-sample.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2022). Census Microdata Samples Project [Dataset]. http://identifiers.org/RRID:SCR_008902

Census Microdata Samples Project

RRID:SCR_008902, nlx_151430, Census Microdata Samples Project (RRID:SCR_008902), Census Microdata Samples Project, Status of Older Persons in UNECE Countries, Dynamics of Population Aging in ECE Countries, PAU Census Microdata Samples Project, Population Activities Unit Census Microdata Samples Project, Dynamics of Population Aging in Economic Commission for Europe Countries

Explore at:

4 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://identifiers.org/RRID:SCR_008902

Dataset updated

Jan 29, 2022

Description

A data set of cross-nationally comparable microdata samples for 15 Economic Commission for Europe (ECE) countries (Bulgaria, Canada, Czech Republic, Estonia, Finland, Hungary, Italy, Latvia, Lithuania, Romania, Russia, Switzerland, Turkey, UK, USA) based on the 1990 national population and housing censuses in countries of Europe and North America to study the social and economic conditions of older persons. These samples have been designed to allow research on a wide range of issues related to aging, as well as on other social phenomena. A common set of nomenclatures and classifications, derived on the basis of a study of census data comparability in Europe and North America, was adopted as a standard for recoding. This series was formerly called Dynamics of Population Aging in ECE Countries. The recommendations regarding the design and size of the samples drawn from the 1990 round of censuses envisaged: (1) drawing individual-based samples of about one million persons; (2) progressive oversampling with age in order to ensure sufficient representation of various categories of older people; and (3) retaining information on all persons co-residing in the sampled individual''''s dwelling unit. Estonia, Latvia and Lithuania provided the entire population over age 50, while Finland sampled it with progressive over-sampling. Canada, Italy, Russia, Turkey, UK, and the US provided samples that had not been drawn specially for this project, and cover the entire population without over-sampling. Given its wide user base, the US 1990 PUMS was not recoded. Instead, PAU offers mapping modules, which recode the PUMS variables into the project''''s classifications, nomenclatures, and coding schemes. Because of the high sampling density, these data cover various small groups of older people; contain as much geographic detail as possible under each country''''s confidentiality requirements; include more extensive information on housing conditions than many other data sources; and provide information for a number of countries whose data were not accessible until recently. Data Availability: Eight of the fifteen participating countries have signed the standard data release agreement making their data available through NACDA/ICPSR (see links below). Hungary and Switzerland require a clearance to be obtained from their national statistical offices for the use of microdata, however the documents signed between the PAU and these countries include clauses stipulating that, in general, all scholars interested in social research will be granted access. Russia requested that certain provisions for archiving the microdata samples be removed from its data release arrangement. The PAU has an agreement with several British scholars to facilitate access to the 1991 UK data through collaborative arrangements. Statistics Canada and the Italian Institute of statistics (ISTAT) provide access to data from Canada and Italy, respectively. * Dates of Study: 1989-1992 * Study Features: International, Minority Oversamples * Sample Size: Approx. 1 million/country Links: * Bulgaria (1992), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/02200 * Czech Republic (1991), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/06857 * Estonia (1989), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/06780 * Finland (1990), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/06797 * Romania (1992), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/06900 * Latvia (1989), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/02572 * Lithuania (1989), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/03952 * Turkey (1990), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/03292 * U.S. (1990), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/06219

Clear search

Close search

Google apps

Main menu

Census Microdata Samples Project

Statistics review 2: Samples and populations

Data from: Evaluating Supplemental Samples in Longitudinal Research:...

Biological Samples Database (BSD)

Example (synthetic) electronic health record data

Dataset #1: Cross-sectional survey data

Genomics examples

Sample Customer data

Dataset

Contents

PCA Data Samples

Dataset

Contents

example-data-frame

Data Cleaning Sample

Data from: An example data set for exploration of Multiple Linear Regression...

Wikipedia-example-data

Snowplow Modeled Customer Data Sample

European Union Statistics on Income and Living Conditions 2013 -...

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Political Analysis Using R: Example Code and Data, Plus Data for Practice...

Data example for Thompson sampling demonstration

Dataset

Contents

github-final-datasets

Github Clean Code Snippets Dataset

1 Step - Github Samples Database parsing

2 Step - Github Bigquery Database parsing

3 Step - collection of code samples of raw coding samples

4 Step - First and second datasets combining

5 Step - Datasets cleaning from symbols and merging together with rare languages

6 Step - Fixing up the labels

Public Use Microdata Samples (PUMS) - Dataset - NASA Open Data Portal

amazon-product-data-sample

Census Microdata Samples ProjectSee More Versions

Census Microdata Samples Project