66 datasets found

VA Personal Health Record Sample Data
catalog.data.gov
data.va.gov
+3more
Updated Apr 25, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Veterans Affairs (2021). VA Personal Health Record Sample Data [Dataset]. https://catalog.data.gov/dataset/va-personal-health-record-sample-data
Explore at:
Dataset updated
Apr 25, 2021
Dataset provided by
United States Department of Veterans Affairshttp://va.gov/
Description
My HealtheVet (www.myhealth.va.gov) is a Personal Health Record portal designed to improve the delivery of health care services to Veterans, to promote health and wellness, and to engage Veterans as more active participants in their health care. The My HealtheVet portal enables Veterans to create and maintain a web-based PHR that provides access to patient health education information and resources, a comprehensive personal health journal, and electronic services such as online VA prescription refill requests and Secure Messaging. Veterans can visit the My HealtheVet website and self-register to create an account, although registration is not required to view the professionally-sponsored health education resources, including topics of special interest to the Veteran population. Once registered, Veterans can create a customized PHR that is accessible from any computer with Internet access.

Data from: Login Data Set for Risk-Based Authentication

zenodo.org
data.niaid.nih.gov

zip

Updated Jun 30, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Stephan Wiefling; Stephan Wiefling; Paul René Jørgensen; Paul René Jørgensen; Sigurd Thunem; Sigurd Thunem; Luigi Lo Iacono; Luigi Lo Iacono (2022). Login Data Set for Risk-Based Authentication [Dataset]. http://doi.org/10.5281/zenodo.6782156

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6782156

Dataset updated

Jun 30, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Stephan Wiefling; Stephan Wiefling; Paul René Jørgensen; Paul René Jørgensen; Sigurd Thunem; Sigurd Thunem; Luigi Lo Iacono; Luigi Lo Iacono

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Login Data Set for Risk-Based Authentication

Synthesized login feature data of >33M login attempts and >3.3M users on a large-scale online service in Norway. Original data collected between February 2020 and February 2021.

This data sets aims to foster research and development for Risk-Based Authentication (RBA) systems. The data was synthesized from the real-world login behavior of more than 3.3M users at a large-scale single sign-on (SSO) online service in Norway.

The users used this SSO to access sensitive data provided by the online service, e.g., a cloud storage and billing information. We used this data set to study how the Freeman et al. (2016) RBA model behaves on a large-scale online service in the real world (see Publication). The synthesized data set can reproduce these results made on the original data set (see Study Reproduction). Beyond that, you can use this data set to evaluate and improve RBA algorithms under real-world conditions.

WARNING: The feature values are plausible, but still totally artificial. Therefore, you should NOT use this data set in productive systems, e.g., intrusion detection systems.

Overview

The data set contains the following features related to each login attempt on the SSO:

Feature	Data Type	Description	Range or Example
IP Address	String	IP address belonging to the login attempt	0.0.0.0 - 255.255.255.255
Country	String	Country derived from the IP address	US
Region	String	Region derived from the IP address	New York
City	String	City derived from the IP address	Rochester
ASN	Integer	Autonomous system number derived from the IP address	0 - 600000
User Agent String	String	User agent string submitted by the client	Mozilla/5.0 (Windows NT 10.0; Win64; ...
OS Name and Version	String	Operating system name and version derived from the user agent string	Windows 10
Browser Name and Version	String	Browser name and version derived from the user agent string	Chrome 70.0.3538
Device Type	String	Device type derived from the user agent string	(`mobile`, `desktop`, `tablet`, `bot`, `unknown`)¹
User ID	Integer	Idenfication number related to the affected user account	[Random pseudonym]
Login Timestamp	Integer	Timestamp related to the login attempt	[64 Bit timestamp]
Round-Trip Time (RTT) [ms]	Integer	Server-side measured latency between client and server	1 - 8600000
Login Successful	Boolean	`True`: Login was successful, `False`: Login failed	(`true`, `false`)
Is Attack IP	Boolean	IP address was found in known attacker data set	(`true`, `false`)
Is Account Takeover	Boolean	Login attempt was identified as account takeover by incident response team of the online service	(`true`, `false`)

Data Creation

As the data set targets RBA systems, especially the Freeman et al. (2016) model, the statistical feature probabilities between all users, globally and locally, are identical for the categorical data. All the other data was randomly generated while maintaining logical relations and timely order between the features.

The timestamps, however, are not identical and contain randomness. The feature values related to IP address and user agent string were randomly generated by publicly available data, so they were very likely not present in the real data set. The RTTs resemble real values but were randomly assigned among users per geolocation. Therefore, the RTT entries were probably in other positions in the original data set.

The country was randomly assigned per unique feature value. Based on that, we randomly assigned an ASN related to the country, and generated the IP addresses for this ASN. The cities and regions were derived from the generated IP addresses for privacy reasons and do not reflect the real logical relations from the original data set.
The device types are identical to the real data set. Based on that, we randomly assigned the OS, and based on the OS the browser information. From this information, we randomly generated the user agent string. Therefore, all the logical relations regarding the user agent are identical as in the real data set.
The RTT was randomly drawn from the login success status and synthesized geolocation data. We did this to ensure that the RTTs are realistic ones.

Regarding the Data Values

Due to unresolvable conflicts during the data creation, we had to assign some unrealistic IP addresses and ASNs that are not present in the real world. Nevertheless, these do not have any effects on the risk scores generated by the Freeman et al. (2016) model.

You can recognize them by the following values:

ASNs with values >= 500.000
IP addresses in the range 10.0.0.0 - 10.255.255.255 (10.0.0.0/8 CIDR range)

Study Reproduction

Based on our evaluation, this data set can reproduce our study results regarding the RBA behavior of an RBA model using the IP address (IP address, country, and ASN) and user agent string (Full string, OS name and version, browser name and version, device type) as features.

The calculated RTT significances for countries and regions inside Norway are not identical using this data set, but have similar tendencies. The same is true for the Median RTTs per country. This is due to the fact that the available number of entries per country, region, and city changed with the data creation procedure. However, the RTTs still reflect the real-world distributions of different geolocations by city.

See RESULTS.md for more details.

Ethics

By using the SSO service, the users agreed in the data collection and evaluation for research purposes. For study reproduction and fostering RBA research, we agreed with the data owner to create a synthesized data set that does not allow re-identification of customers.

The synthesized data set does not contain any sensitive data values, as the IP addresses, browser identifiers, login timestamps, and RTTs were randomly generated and assigned.

Publication

You can find more details on our conducted study in the following journal article:

Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service (2022)
Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono.
ACM Transactions on Privacy and Security

Bibtex

@article{Wiefling_Pump_2022,
 author = {Wiefling, Stephan and Jørgensen, Paul René and Thunem, Sigurd and Lo Iacono, Luigi},
 title = {Pump {Up} {Password} {Security}! {Evaluating} and {Enhancing} {Risk}-{Based} {Authentication} on a {Real}-{World} {Large}-{Scale} {Online} {Service}},
 journal = {{ACM} {Transactions} on {Privacy} and {Security}},
 doi = {10.1145/3546069},
 publisher = {ACM},
 year  = {2022}
}

License

This data set and the contents of this repository are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. See the LICENSE file for details. If the data set is used within a publication, the following journal article has to be cited as the source of the data set:

Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono: Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service. In: ACM Transactions on Privacy and Security (2022). doi: 10.1145/3546069

Few (invalid) user agents strings from the original data set could not be parsed, so their device type is empty. Perhaps this parse error is useful information for your studies, so we kept these 1526 entries.↩︎

Film Circulation dataset
zenodo.org
data.niaid.nih.gov
bin, csv, png
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
Explore at:
csv, png, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7887672
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
My Digital Footprint
kaggle.com
zip
Updated Jun 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Girish (2023). My Digital Footprint [Dataset]. https://www.kaggle.com/datasets/girish17019/my-digital-footprint
Explore at:
zip(874430159 bytes)Available download formats
Dataset updated
Jun 29, 2023
Authors
Girish
Description
Dataset Info:

MyDigitalFootprint (MDF) is a novel large-scale dataset composed of smartphone embedded sensors data, physical proximity information, and Online Social Networks interactions aimed at supporting multimodal context-recognition and social relationships modelling in mobile environments. The dataset includes two months of measurements and information collected from the personal mobile devices of 31 volunteer users by following the in-the-wild data collection approach: the data has been collected in the users' natural environment, without limiting their usual behaviour. Existing public datasets generally consist of a limited set of context data, aimed at optimising specific application domains (human activity recognition is the most common example). On the contrary, the dataset contains a comprehensive set of information describing the user context in the mobile environment.

The complete analysis of the data contained in MDF has been presented in the following publication:

https://www.sciencedirect.com/science/article/abs/pii/S1574119220301383?via%3Dihub

The full anonymised dataset is contained in the folder MDF. Moreover, in order to demonstrate the efficacy of MDF, there are three proof of concept context-aware applications based on different machine learning tasks:

A social link prediction algorithm based on physical proximity data,

The recognition of daily-life activities based on smartphone-embedded sensors data,

A pervasive context-aware recommender system.

For the sake of reproducibility, the data used to evaluate the proof-of-concept applications are contained in the folders link-prediction, context-recognition, and cars, respectively.
h
fineweb
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
Explore at:
Unique identifier
https://doi.org/10.57967/hf/2493
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
🍷 FineWeb

15 trillion tokens of the finest data the 🌐 web has to offer

What is it?

The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
m
Web page phishing detection
data.mendeley.com
Updated Sep 28, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdelhakim Hannousse (2020). Web page phishing detection [Dataset]. http://doi.org/10.17632/c2gw7fy2j4.2
Explore at:
Unique identifier
https://doi.org/10.17632/c2gw7fy2j4.2
Dataset updated
Sep 28, 2020
Authors
Abdelhakim Hannousse
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The provided dataset includes 11430 URLs with 87 extracted features. The dataset are designed to be used as a a benchmark for machine learning based phishing detection systems. Features are from three different classes: 56 extracted from the structure and syntax of URLs, 24 extracted from the content of their correspondent pages and 7 are extracetd by querying external services. The datatset is balanced, it containes exactly 50% phishing and 50% legitimate URLs. Associated to the dataset, we provide Python scripts used for the extraction of the features for potential replication or extension.

dataset_A: contains a list a URLs together with their DOM tree objects that can be used for replication and experimenting new URL and content-based features overtaking short-time living of phishing web pages.

dataset_B: containes the extracted feature values that can be used directly as inupt to classifiers for examination. Note that the data in this dataset are indexed with URLs so that one need to remove the index before experimentation.

Datasets are constructed on May 2020. Due to huge size of dataset A, only a sample of the dataset is provided, it will be divided into sample files and uploaded one by one, for urgent need of full copy, please contact directly the author at: hannousse.abdelhakim@univ-guelma.dz
d
Data from: ComBase: A Web Resource for Quantitative and Predictive Food...
catalog.data.gov
datasets.ai
+1more
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). ComBase: A Web Resource for Quantitative and Predictive Food Microbiology [Dataset]. https://catalog.data.gov/dataset/combase-a-web-resource-for-quantitative-and-predictive-food-microbiology-d652f
Explore at:
Dataset updated
Jun 5, 2025
Dataset provided by
Agricultural Research Service
Description
ComBase includes a systematically formatted database of quantified microbial responses to the food environment with more than 65,000 records, and is used for: Informing the design of food safety risk management plans Producing Food Safety Plans and HACCP plans Reducing food waste Assessing microbiological risk in foods The ComBase Browser enables you to search thousands of microbial growth and survival curves that have been collated in research establishments and from publications. The ComBase Predictive Models are a collection of software tools based on ComBase data to predict the growth or inactivation of microorganisms as a function of environmental factors such as temperature, pH and water activity in broth. Interested users can also contribute growth or inactivation data via the Donate Data page, which includes instructional videos, data template and sample, and an Excel demo file of data and macros for checking data format and syntax. Resources in this dataset:Resource Title: Website Pointer to ComBase. File Name: Web Page, url: https://www.combase.cc/index.php/en/ ComBase is an online tool for quantitative food microbiology. Its main features are the ComBase database and ComBase models, and can be accessed on any web platform, including mobile devices. The focus of ComBase is describing and predicting how microorganisms survive and grow under a variety of primarily food-related conditions. ComBase is a highly useful tool for food companies to understand safer ways of producing and storing foods. This includes developing new food products and reformulating foods, designing challenge test protocols, producing Food Safety plans, and helping public health organizations develop science-based food policies through quantitative risk assessment. Over 60,000 records have been deposited into ComBase, describing how food environments, such as temperature, pH, and water activity, as well as other factors (e.g. preservatives and atmosphere) affect the growth of bacteria. Each data record shows users how bacteria populations change for a particular combination of environmental factors. Mathematical models (the ComBase Predictor and Food models) were developed on systematically generated data to predict how various organisms grow or survive under various conditions.
m
Annotated Terms of Service of 100 Online Platforms
data.mendeley.com
Updated Dec 12, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Przemyslaw Palka (2023). Annotated Terms of Service of 100 Online Platforms [Dataset]. http://doi.org/10.17632/dtbj87j937.3
Explore at:
Unique identifier
https://doi.org/10.17632/dtbj87j937.3
Dataset updated
Dec 12, 2023
Authors
Przemyslaw Palka
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains information about the contents of 100 Terms of Service (ToS) of online platforms. The documents were analyzed and evaluated from the point of view of the European Union consumer law. The main results have been presented in the table titled "Terms of Service Analysis and Evaluation_RESULTS." This table is accompanied by the instruction followed by the annotators, titled "Variables Definitions," allowing for the interpretation of the assigned values. In addition, we provide the raw data (analyzed ToS, in the folder "Clear ToS") and the annotated documents (in the folder "Annotated ToS," further subdivided).

SAMPLE: The sample contains 100 contracts of digital platforms operating in sixteen market sectors: Cloud storage, Communication, Dating, Finance, Food, Gaming, Health, Music, Shopping, Social, Sports, Transportation, Travel, Video, Work, and Various. The selected companies' main headquarters span four legal surroundings: the US, the EU, Poland specifically, and Other jurisdictions. The chosen platforms are both privately held and publicly listed and offer both fee-based and free services. Although the sample cannot be treated as representative of all online platforms, it nevertheless accounts for the most popular consumer services in the analyzed sectors and contains a diverse and heterogeneous set.

CONTENT: Each ToS has been assigned the following information: 1. Metadata: 1.1. the name of the service; 1.2. the URL; 1.3. the effective date; 1.4. the language of ToS; 1.5. the sector; 1.6. the number of words in ToS; 1.7–1.8. the jurisdiction of the main headquarters; 1.9. if the company is public or private; 1.10. if the service is paid or free. 2. Evaluative Variables: remedy clauses (2.1– 2.5); dispute resolution clauses (2.6–2.10); unilateral alteration clauses (2.11–2.15); rights to police the behavior of users (2.16–2.17); regulatory requirements (2.18–2.20); and various (2.21–2.25). 3. Count Variables: the number of clauses seen as unclear (3.1) and the number of other documents referred to by the ToS (3.2). 4. Pull-out Text Variables: rights and obligations of the parties (4.1) and descriptions of the service (4.2)

ACKNOWLEDGEMENT: The research leading to these results has received funding from the Norwegian Financial Mechanism 2014-2021, project no. 2020/37/K/HS5/02769, titled “Private Law of Data: Concepts, Practices, Principles & Politics.”

BIDS Phenotype External Example Dataset

openneuro.org

Updated Jun 4, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Samuel Guay; Eric Earl; Hao-Ting Wang; Remi Gau; Dorota Jarecka; David Keator; Melissa Kline Struhl; Satra Ghosh; Louis De Beaumont; Adam G. Thomas (2022). BIDS Phenotype External Example Dataset [Dataset]. http://doi.org/10.18112/openneuro.ds004131.v1.0.1

Explore at:

Unique identifier

https://doi.org/10.18112/openneuro.ds004131.v1.0.1

Dataset updated

Jun 4, 2022

Dataset provided by

OpenNeurohttps://openneuro.org/

Authors

Samuel Guay; Eric Earl; Hao-Ting Wang; Remi Gau; Dorota Jarecka; David Keator; Melissa Kline Struhl; Satra Ghosh; Louis De Beaumont; Adam G. Thomas

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

BIDS Phenotype External Dataset Example COPY OF "The NIMH Healthy Research Volunteer Dataset" (ds003982)

Modality-agnostic files were copied over and the CHANGES file was updated.

THE ORIGINAL DATASET ds003982 README FOLLOWS

A comprehensive clinical, MRI, and MEG collection characterizing healthy research volunteers collected at the National Institute of Mental Health (NIMH) Intramural Research Program (IRP) in Bethesda, Maryland using medical and mental health assessments, diagnostic and dimensional measures of mental health, cognitive and neuropsychological functioning, structural and functional magnetic resonance imaging (MRI), along with diffusion tensor imaging (DTI), and a comprehensive magnetoencephalography battery (MEG).

In addition, blood samples are currently banked for future genetic analysis. All data collected in this protocol are broadly shared in the OpenNeuro repository, in the Brain Imaging Data Structure (BIDS) format. In addition, blood samples of healthy volunteers are banked for future analyses. All data collected in this protocol are broadly shared here, in the Brain Imaging Data Structure (BIDS) format. In addition, task paradigms and basic pre-processing scripts are shared on GitHub. This dataset is unique in its depth of characterization of a healthy population in terms of brain health and will contribute to a wide array of secondary investigations of non-clinical and clinical research questions.

This dataset is licensed under the Creative Commons Zero (CC0) v1.0 License.

Recruitment

Inclusion criteria for the study require that participants are adults at or over 18 years of age in good health with the ability to read, speak, understand, and provide consent in English. All participants provided electronic informed consent for online screening and written informed consent for all other procedures. Exclusion criteria include:

A history of significant or unstable medical or mental health condition requiring treatment
Current self-injury, suicidal thoughts or behavior
Current illicit drug use by history or urine drug screen
Abnormal physical exam or laboratory result at the time of in-person assessment
Less than an 8th grade education or IQ below 70
Current employees, or first-degree relatives of NIMH employees

Study participants are recruited through direct mailings, bulletin boards and listservs, outreach exhibits, print advertisements, and electronic media.

Clinical Measures

All potential volunteers first visit the study website (https://nimhresearchvolunteer.ctss.nih.gov), check a box indicating consent, and complete preliminary self-report screening questionnaires. The study website is HIPAA compliant and therefore does not collect PII ; instead, participants are instructed to contact the study team to provide their identity and contact information. The questionnaires include demographics, clinical history including medications, disability status (WHODAS 2.0), mental health symptoms (modified DSM-5 Self-Rated Level 1 Cross-Cutting Symptom Measure), substance use survey (DSM-5 Level 2), alcohol use (AUDIT), handedness (Edinburgh Handedness Inventory), and perceived health ratings. At the conclusion of the questionnaires, participants are again prompted to send an email to the study team. Survey results, supplemented by NIH medical records review (if present), are reviewed by the study team, who determine if the participant is likely eligible for the protocol. These participants are then scheduled for an in-person assessment. Follow-up phone screenings were also used to determine if participants were eligible for in-person screening.

In-person Assessments

At this visit, participants undergo a comprehensive clinical evaluation to determine final eligibility to be included as a healthy research volunteer. The mental health evaluation consists of a psychiatric diagnostic interview (Structured Clinical Interview for DSM-5 Disorders (SCID-5), along with self-report surveys of mood (Beck Depression Inventory-II (BD-II) and anxiety (Beck Anxiety Inventory, BAI) symptoms. An intelligence quotient (IQ) estimation is determined with the Kaufman Brief Intelligence Test, Second Edition (KBIT-2). The KBIT-2 is a brief (20-30 minute) assessment of intellectual functioning administered by a trained examiner. There are three subtests, including verbal knowledge, riddles, and matrices.

Medical Evaluation

Medical evaluation includes medical history elicitation and systematic review of systems. Biological and physiological measures include vital signs (blood pressure, pulse), as well as weight, height, and BMI. Blood and urine samples are taken and a complete blood count, acute care panel, hepatic panel, thyroid stimulating hormone, viral markers (HCV, HBV, HIV), C-reactive protein, creatine kinase, urine drug screen and urine pregnancy tests are performed. In addition, blood samples that can be used for future genomic analysis, development of lymphoblastic cell lines or other biomarker measures are collected and banked with the NIMH Repository and Genomics Resource (Infinity BiologiX). The Family Interview for Genetic Studies (FIGS) was later added to the assessment in order to provide better pedigree information; the Adverse Childhood Events (ACEs) survey was also added to better characterize potential risk factors for psychopathology. The entirety of the in-person assessment not only collects information relevant for eligibility determination, but it also provides a comprehensive set of standardized clinical measures of volunteer health that can be used for secondary research.

MRI Scan

Participants are given the option to consent for a magnetic resonance imaging (MRI) scan, which can serve as a baseline clinical scan to determine normative brain structure, and also as a research scan with the addition of functional sequences (resting state and diffusion tensor imaging). The MR protocol used was initially based on the ADNI-3 basic protocol, but was later modified to include portions of the ABCD protocol in the following manner:

The T1 scan from ADNI3 was replaced by the T1 scan from the ABCD protocol.
The Axial T2 2D FLAIR acquisition from ADNI2 was added, and fat saturation turned on.
Fat saturation was turned on for the pCASL acquisition.
The high-resolution in-plane hippocampal 2D T2 scan was removed and replaced with the whole brain 3D T2 scan from the ABCD protocol (which is resolution and bandwidth matched to the T1 scan).
The slice-select gradient reversal method was turned on for DTI acquisition, and reconstruction interpolation turned off.
Scans for distortion correction were added (reversed-blip scans for DTI and resting state scans).
The 3D FLAIR sequence was made optional and replaced by one where the prescription and other acquisition parameters provide resolution and geometric correspondence between the T1 and T2 scans.

At the time of the MRI scan, volunteers are administered a subset of tasks from the NIH Toolbox Cognition Battery. The four tasks include:

Flanker inhibitory control and attention task assesses the constructs of attention and executive functioning.
Executive functioning is also assessed using a dimensional change card sort test.
Episodic memory is evaluated using a picture sequence memory test.
Working memory is evaluated using a list sorting test.

MEG

An optional MEG study was added to the protocol approximately one year after the study was initiated, thus there are relatively fewer MEG recordings in comparison to the MRI dataset. MEG studies are performed on a 275 channel CTF MEG system (CTF MEG, Coquiltam BC, Canada). The position of the head was localized at the beginning and end of each recording using three fiducial coils. These coils were placed 1.5 cm above the nasion, and at each ear, 1.5 cm from the tragus on a line between the tragus and the outer canthus of the eye. For 48 participants (as of 2/1/2022), photographs were taken of the three coils and used to mark the points on the T1 weighted structural MRI scan for co-registration. For the remainder of the participants (n=16 as of 2/1/2022), a Brainsight neuronavigation system (Rogue Research, Montréal, Québec, Canada) was used to coregister the MRI and fiducial localizer coils in realtime prior to MEG data acquisition.

Specific Measures within Dataset

Online and In-person behavioral and clinical measures, along with the corresponding phenotype file name, sorted first by measurement location and then by file name.

Location	Measure	File Name
Online	Alcohol Use Disorders Identification Test (AUDIT)	audit
	Demographics	demographics
	DSM-5 Level 2 Substance Use - Adult	drug_use
	Edinburgh Handedness Inventory (EHI)	ehi
	Health History Form	health_history_questions
	Perceived Health Rating - self	health_rating
	DSM-5 Self-Rated Level 1 Cross-Cutting Symptoms Measure – Adult (modified)	mental_health_questions
	World Health Organization Disability Assessment Schedule

Ground-Based Global Navigation Satellite System (GNSS) Observation Data...
data.nasa.gov
s.cnmilf.com
+6more
Updated Apr 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Ground-Based Global Navigation Satellite System (GNSS) Observation Data (30-second sampling, daily, 24 hour files) from NASA CDDIS [Dataset]. https://data.nasa.gov/dataset/ground-based-global-navigation-satellite-system-gnss-observation-data-30-second-sampling-d-f66f6
Explore at:
Dataset updated
Apr 1, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
This dataset consists of ground-based Global Navigation Satellite System (GNSS) Observation Data (30-second sampling, daily 24 hour files) from the NASA Crustal Dynamics Data Information System (CDDIS). GNSS provide autonomous geo-spatial positioning with global coverage. GNSS data sets from ground receivers at the CDDIS consist primarily of the data from the U.S. Global Positioning System (GPS) and the Russian GLObal NAvigation Satellite System (GLONASS). Since 2011, the CDDIS GNSS archive includes data from other GNSS (Europe’s Galileo, China’s Beidou, Japan’s Quasi-Zenith Satellite System/QZSS, the Indian Regional Navigation Satellite System/IRNSS, and worldwide Satellite Based Augmentation Systems/SBASs), which are similar to the U.S. GPS in terms of the satellite constellation, orbits, and signal structure. The daily GNSS observation files (un-compacted) contain one day of GPS or multi-GNSS observation (30-second sampling) data in RINEX format from a global permanent network of ground-based receivers, one file per site. More information about these data is available on the CDDIS website at https://cddis.nasa.gov/Data_and_Derived_Products/GNSS/daily_30second_data.html.
f
Data from: Public Data Set of Protein–Ligand Dissociation Kinetic Constants...
acs.figshare.com
xlsx
Updated Jun 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huisi Liu; Minyi Su; Hai-Xia Lin; Renxiao Wang; Yan Li (2023). Public Data Set of Protein–Ligand Dissociation Kinetic Constants for Quantitative Structure–Kinetics Relationship Studies [Dataset]. http://doi.org/10.1021/acsomega.2c02156.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acsomega.2c02156.s002
Dataset updated
Jun 6, 2023
Dataset provided by
ACS Publications
Authors
Huisi Liu; Minyi Su; Hai-Xia Lin; Renxiao Wang; Yan Li
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Protein–ligand binding affinity reflects the equilibrium thermodynamics of the protein–ligand binding process. Binding/unbinding kinetics is the other side of the coin. Computational models for interpreting the quantitative structure–kinetics relationship (QSKR) aim at predicting protein–ligand binding/unbinding kinetics based on protein structure, ligand structure, or their complex structure, which in principle can provide a more rational basis for structure-based drug design. Thus far, most of the public data sets used for deriving such QSKR models are rather limited in sample size and structural diversity. To tackle this problem, we have compiled a set of 680 protein–ligand complexes with experimental dissociation rate constants (koff), which were mainly curated from the references accumulated for updating our PDBbind database. Three-dimensional structure of each protein–ligand complex in this data set was either retrieved from the Protein Data Bank or carefully modeled based on a proper template. The entire data set covers 155 types of protein, with their dissociation kinetic constants (koff) spanning nearly 10 orders of magnitude. To the best of our knowledge, this data set is the largest of its kind reported publicly. Utilizing this data set, we derived a random forest (RF) model based on protein–ligand atom pair descriptors for predicting koff values. We also demonstrated that utilizing modeled structures as additional training samples will benefit the model performance. The RF model with mixed structures can serve as a baseline for testifying other more sophisticated QSKR models. The whole data set, namely, PDBbind-koff-2020, is available for free download at our PDBbind-CN web site (http://www.pdbbind.org.cn/download.php).
d
Hazardous Waste Manifest Data (CT) 1984 – 2008
catalog.data.gov
data.ct.gov
+3more
Updated Jul 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.ct.gov (2025). Hazardous Waste Manifest Data (CT) 1984 – 2008 [Dataset]. https://catalog.data.gov/dataset/hazardous-waste-manifest-data-ct-1984-2008
Explore at:
Dataset updated
Jul 26, 2025
Dataset provided by
data.ct.gov
Description
NOTES: 1. Please use this link to leave the data view to see the full description: https://data.ct.gov/Environment-and-Natural-Resources/Hazardous-Waste-Manifest-Data-CT-1984-2008/h6d8-qiar 2. Please Use ALL CAPS when searching using the "Filter" function on text such as: LITCHFIELD. But not needed for the upper right corner "Find in this Dataset" search where for example "Litchfield" can be used. Dataset Description: We know there are errors in the data although we strive to minimize them. Examples include: • Manifests completed incorrectly by the generator or the transporter - data was entered based on the incorrect information. We can only enter the information we receive. • Data entry errors – we now have QA/QC procedures in place to prevent or catch and fix a lot of these. • Historically there are multiple records of the same generator. Each variation in spelling in name or address generated a separate handler record. We have worked to minimize these but many remain. The good news is that as long as they all have the same EPA ID they will all show up in your search results. • Handlers provide erroneous data to obtain an EPA ID - data entry was based on erroneous information. Examples include incorrect or bogus addresses and names. There are also a lot of MISSPELLED NAMES AND ADDRESSES! • Missing manifests – Not every required manifest gets submitted to DEEP. Also, of the more than 100,000 paper manifests we receive each year, some were incorrectly handled and never entered. • Missing data – we know that the records for approximately 25 boxes of manifests, mostly prior to 1985 were lost from the database in the 1980’s. • Translation errors – the data has been migrated to newer data platforms numerous times, and each time there have been errors and data losses. • Wastes incorrectly entered – mostly due to complex names that were difficult to spell, or typos in quantities or units of measure. Since Summer 2019, scanned images of manifest hardcopies may be viewed at the DEEP Document Online Search Portal: https://filings.deep.ct.gov/DEEPDocumentSearchPortal/
c
ckanext-fedmsg
catalog.civicdataecosystem.org
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). ckanext-fedmsg [Dataset]. https://catalog.civicdataecosystem.org/dataset/ckanext-fedmsg
Explore at:
Dataset updated
Jun 4, 2025
Description
The CKANExt-fedmsg extension for CKAN enables real-time dissemination of changes made to the CKAN database to a fedmsg bus. This allows external systems and applications to be immediately notified of updates affecting public datasets and metadata within CKAN, promoting increased responsiveness and integration capabilities. By broadcasting these changes, the extension helps facilitate the creation of event-driven architectures around CKAN data. Key Features: Real-time Message Broadcasting: Publishes messages to a fedmsg bus whenever modifications occur in the CKAN database, such as dataset creation, modification, or deletion. Public Change Monitoring: Monitors and reports only public changes, ensuring that sensitive or internal modifications are not exposed. Integration Simplicity: Designed to be easily integrated into existing CKAN installations by adding "fedmsg" to the ckan.plugins line in the CKAN configuration file. Technical Integration: The extension integrates with CKAN by hooking into the CKAN plugin system. Specifically, to enable the extension, you need to modify CKAN's INI configuration file by adding fedmsg to the ckan.plugins setting. This activates the extension and allows it to monitor and broadcast database changes as messages. The exact mechanics of fedmsg integration require further configuration of the fedmsg bus itself, but CKANExt-fedmsg handles the publishing side of the integration. Benefits & Impact: By sending real-time updates to a fedmsg bus, CKANExt-fedmsg allows for timely synchronization of CKAN data with other systems. This feature enables automated workflows, quicker data replication, and more responsive applications that rely on CKAN data. For example, an external website could automatically update its data catalog whenever a new dataset is published on the CKAN instance.
w
Amazon Web Services - Public Data Sets
data.wu.ac.at
Updated Oct 10, 2013
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Global (2013). Amazon Web Services - Public Data Sets [Dataset]. https://data.wu.ac.at/schema/datahub_io/NTYxNjkxNmYtNmZlNS00N2EwLWJkYTktZjFjZWJkNTM2MTNm
Explore at:
Dataset updated
Oct 10, 2013
Dataset provided by
Global
Description
About

From website:

Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. AWS is hosting the public data sets at no charge for the community, and like all AWS services, users pay only for the compute and storage they use for their own applications. An initial list of data sets is already available, and more will be added soon.

Previously, large data sets such as the mapping of the Human Genome and the US Census data required hours or days to locate, download, customize, and analyze. Now, anyone can access these data sets from their Amazon Elastic Compute Cloud (Amazon EC2) instances and start computing on the data within minutes. Users can also leverage the entire AWS ecosystem and easily collaborate with other AWS users. For example, users can produce or use prebuilt server images with tools and applications to analyze the data sets. By hosting this important and useful data with cost-efficient services such as Amazon EC2, AWS hopes to provide researchers across a variety of disciplines and industries with tools to enable more innovation, more quickly.
d
Historical groundwater chemistry data compiled for the Placerita and Newhall...
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Historical groundwater chemistry data compiled for the Placerita and Newhall Oil Fields, Los Angeles County, southern California [Dataset]. https://catalog.data.gov/dataset/historical-groundwater-chemistry-data-compiled-for-the-placerita-and-newhall-oil-fields-lo
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Area covered
Newhall, Los Angeles County, California, Southern California
Description
This digital dataset contains historical geochemical and other information for 580 samples of groundwater from 12 wells located within the Placerita and Newhall Oil Fields in Los Angeles County, southern California. The sampled wells include 5 monitoring wells (Dataset IDs 1-5) associated with a groundwater remediation site, 6 water-supply wells (Dataset IDs 6-11) used to supply groundwater in support of oil production, and 1 well (Dataset ID 12) constructed as an observation well to monitor water-disposal operations. The groundwater remediation site wells represent a subset of a much larger number of monitoring wells that were selected for inclusion in this dataset because they were perforated within the regional groundwater aquifer, and the geochemistry data for these wells include the greatest variety of constituents available in the data source. The numerical water chemistry data, well locations, and well construction information for Dataset IDs 1-5 were compiled from PDFs (Portable Document Format) documents located on the California Department of Toxic Substances Control EnviroStor (DTSC-EnviroStor) website. Water chemistry data for the remaining wells were compiled from a combination of scanned laboratory analysis reports available from the California Geologic Energy Management Division (CalGEM) (Dataset IDs 6-9), analytical reports available as PDFs located on the State Water Resources Control Board GeoTracker (SWRCB-GT) website (Dataset ID 10), and analytical reports located within well history files in CalGEM's online Well Finder (WF) database (Dataset IDs 11-12). The availability of location and well construction information for Dataset IDs 6-12 varied by site. A combination of scanned laboratory analysis reports, CalGEM WF well history files, and California Department of Water Resources Well Completion Reports (CDWR-WCR) were the primary sources of location and well construction information for 5 wells (Dataset IDs 7-10, and 12) and 3 wells (Dataset IDs 8, 10, and 12), respectively. Approximate locations for Dataset IDs 7 and 9 represent meridian, township, range, and section (MTRS) centroids. Google Earth was used to determine the approximate location of Dataset ID 8 based on information from the WCR for that well. No location or well construction information was found for Dataset IDs 6 and 11. Data were manually compiled into two separate files described as follows: 1) a summary data file that includes well identifiers, location, construction, the number of chemistry samples, the period of record, specific sample dates for each site, and an inventory of which constituent groups were sampled on each date; and 2) a data file of geochemistry analyses for selected constituents classified into one of the following groups: water-quality indicators, major and minor ions, nutrients, trace elements, naturally occurring radioactive material (NORM), volatile organic compounds (VOCs), and miscellaneous organics compounds. Ion (charge) balance calculations and percent error of these calculations were included for samples having a complete suite of major ion analyses. Analytical method, reporting level, reporting level type, and supplemental notes were included where available or pertinent. A data dictionary was created to describe the geochemistry data file and is provided with this data release.
Server Logs
kaggle.com
Updated Oct 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vishnu U (2021). Server Logs [Dataset]. https://www.kaggle.com/vishnu0399/server-logs/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 12, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Vishnu U
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The dataset is a synthetically generated server log based on Apache Server Logging Format. Each line corresponds to each log entry. The log entry has the following parameters :

Components in Log Entry :

IP of client: This refers to the IP address of the client that sent the request to the server.

Remote Log Name: Remote name of the User performing the request. In the majority of the applications, this is confidential information and is hidden or not available.

User ID: The ID of the user performing the request. In the majority of the applications, this is a piece of confidential information and is hidden or not available.

Date and Time in UTC format: The date and time of the request are represented in UTC format as follows: - Day/Month/Year:Hour:Minutes: Seconds +Time-Zone-Correction.

Request Type: The type of request (GET, POST, PUT, DELETE) that the server got. This depends on the operation that the request will do.

API: The API of the website to which the request is related. Example: When a user accesses a carton shopping website, the API comes as /usr/cart.

Protocol and Version: Protocol used for connecting with server and its version.

Status Code: Status code that the server returned for the request. Eg: 404 is sent when a requested resource is not found. 200 is sent when the request was successfully served.

Byte: The amount of data in bytes that was sent back to the client.

Referrer: The websites/source from where the user was directed to the current website. If none it is represented by “-“.

UA String: The user agent string contains details of the browser and the host device (like the name, version, device type etc.).

Response Time: The response time the server took to serve the request. This is the difference between the timestamps when the request was received and when the request was served.

Content

The dataset consists of two files - - logfiles.log is the actual log file in text format - TestFileGenerator.py is the synthetic log file generator. The number of log entries required can be edited in the code.
Amazon Fine Food Reviews
kaggle.com
zip
Updated May 1, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford Network Analysis Project (2017). Amazon Fine Food Reviews [Dataset]. https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews
Explore at:
zip(253873708 bytes)Available download formats
Dataset updated
May 1, 2017
Dataset authored and provided by
Stanford Network Analysis Project
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories.

Contents

Reviews.csv: Pulled from the corresponding SQLite table named Reviews in database.sqlite

database.sqlite: Contains the table 'Reviews'

Data includes:
- Reviews from Oct 1999 - Oct 2012
- 568,454 reviews
- 256,059 users
- 74,258 products
- 260 users with > 50 reviews

Acknowledgements

See this SQLite query for a quick sample of the dataset.

If you publish articles based on this dataset, please cite the following paper:

J. McAuley and J. Leskovec. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. WWW, 2013.
E-commerce - Users of a French C2C fashion store
kaggle.com
Updated Feb 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeffrey Mvutu Mabilama (2024). E-commerce - Users of a French C2C fashion store [Dataset]. https://www.kaggle.com/jmmvutu/ecommerce-users-of-a-french-c2c-fashion-store/notebooks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 24, 2024
Dataset provided by
Kaggle
Authors
Jeffrey Mvutu Mabilama
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
French
Description
Foreword

This users dataset is a preview of a much bigger dataset, with lots of related data (product listings of sellers, comments on listed products, etc...).

My Telegram bot will answer your queries and allow you to contact me.

Context

There are a lot of unknowns when running an E-commerce store, even when you have analytics to guide your decisions.

Users are an important factor in an e-commerce business. This is especially true in a C2C-oriented store, since they are both the suppliers (by uploading their products) AND the customers (by purchasing other user's articles).

This dataset aims to serve as a benchmark for an e-commerce fashion store. Using this dataset, you may want to try and understand what you can expect of your users and determine in advance how your grows may be.

For instance, if you see that most of your users are not very active, you may look into this dataset to compare your store's performance.

If you think this kind of dataset may be useful or if you liked it, don't forget to show your support or appreciation with an upvote/comment. You may even include how you think this dataset might be of use to you. This way, I will be more aware of specific needs and be able to adapt my datasets to suits more your needs.

This dataset is part of a preview of a much larger dataset. Please contact me for more.

Content

The data was scraped from a successful online C2C fashion store with over 10M registered users. The store was first launched in Europe around 2009 then expanded worldwide.

Visitors vs Users: Visitors do not appear in this dataset. Only registered users are included. "Visitors" cannot purchase an article but can view the catalog.

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Questions you might want to answer using this dataset:

Are e-commerce users interested in social network feature ?

Are my users active enough (compared to those of this dataset) ?

How likely are people from other countries to sign up in a C2C website ?

How many users are likely to drop off after years of using my service ?

Example works:

Report(s) made using SQL queries can be found on the data.world page of the dataset.

Notebooks may be found on the Kaggle page of the dataset.

License

CC-BY-NC-SA 4.0

For other licensing options, contact me.
d
Data from: Commercial and Residential Hourly Load Profiles for all TMY3...
catalog.data.gov
data.openei.org
+2more
Updated Jun 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Renewable Energy Laboratory (2024). Commercial and Residential Hourly Load Profiles for all TMY3 Locations in the United States [Dataset]. https://catalog.data.gov/dataset/commercial-and-residential-hourly-load-profiles-for-all-tmy3-locations-in-the-united-state-bbc75
Explore at:
Dataset updated
Jun 19, 2024
Dataset provided by
National Renewable Energy Laboratory
Area covered
United States
Description
Note: This dataset has been superseded by the dataset found at "End-Use Load Profiles for the U.S. Building Stock" (submission 4520; linked in the submission resources), which is a comprehensive and validated representation of hourly load profiles in the U.S. commercial and residential building stock. The End-Use Load Profiles project website includes links to data viewers for this new dataset. For documentation of dataset validation, model calibration, and uncertainty quantification, see Wilson et al. (2022). These data were first created around 2012 as a byproduct of various analyses of solar photovoltaics and solar water heating (see references below for are two examples). This dataset contains several errors and limitations. It is recommended that users of this dataset transition to the updated version of the dataset posted in the resources. This dataset contains weather data, commercial load profile data, and residential load profile data. Weather The Typical Meteorological Year 3 (TMY3) provides one year of hourly data for around 1,000 locations. The TMY weather represents 30-year normals, which are typical weather conditions over a 30-year period. Commercial The commercial load profiles included are the 16 ASHRAE 90.1-2004 DOE Commercial Prototype Models simulated in all TMY3 locations, with building insulation levels changing based on ASHRAE 90.1-2004 requirements in each climate zone. The folder names within each resource represent the weather station location of the profiles, whereas the file names represent the building type and the representative city for the ASHRAE climate zone that was used to determine code compliance insulation levels. As indicated by the file names, all building models represent construction that complied with the ASHRAE 90.1-2004 building energy code requirements. No older or newer vintages of buildings are represented. Residential The BASE residential load profiles are five EnergyPlus models (one per climate region) representing 2009 IECC construction single-family detached homes simulated in all TMY3 locations. No older or newer vintages of buildings are represented. Each of the five climate regions include only one heating fuel type; electric heating is only found in the Hot-Humid climate. Air conditioning is not found in the Marine climate region. One major issue with the residential profiles is that for each of the five climate zones, certain location-specific algorithms from one city were applied to entire climate zones. For example, in the Hot-Humid files, the heating season calculated for Tampa, FL (December 1 - March 31) was unknowingly applied to all other locations in the Hot-Humid zone, which restricts heating operation outside of those days (for example, heating is disabled in Dallas, TX during cold weather in November). This causes the heating energy to be artificially low in colder parts of that climate zone, and conversely the cooling season restriction leads to artificially low cooling energy use in hotter parts of each climate zone. Additionally, the ground temperatures for the representative city were used across the entire climate zone. This affects water heating energy use (because inlet cold water temperature depends on ground temperature) and heating/cooling energy use (because of ground heat transfer through foundation walls and floors). Representative cities were Tampa, FL (Hot-Humid), El Paso, TX (Mixed-Dry/Hot-Dry), Memphis, TN (Mixed-Humid), Arcata, CA (Marine), and Billings, MT (Cold/Very-Cold). The residential dataset includes a HIGH building load profile that was intended to provide a rough approximation of older home vintages, but it combines poor thermal insulation with larger house size, tighter thermostat setpoints, and less efficient HVAC equipment. Conversely, the LOW building combines excellent thermal insulation with smaller house size, wider thermostat setpoints, and more efficient HVAC equipment. However, it is not known how well these HIGH and LOW permutations represent the range of energy use in the housing stock. Note that on July 2nd, 2013, the Residential High and Low load files were updated from 366 days in a year for leap years to the more general 365 days in a normal year. The archived residential load data is included from prior to this date.
3D MNIST
kaggle.com
Updated Oct 18, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David de la Iglesia Castro (2019). 3D MNIST [Dataset]. https://www.kaggle.com/daavoo/3d-mnist/home
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 18, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
David de la Iglesia Castro
Description
Context

The aim of this dataset is to provide a simple way to get started with 3D computer vision problems such as 3D shape recognition.

Accurate 3D point clouds can (easily and cheaply) be adquired nowdays from different sources:

RGB-D devices: Google Tango, Microsoft Kinect, etc.

Lidar.

3D reconstruction from multiple images.

However there is a lack of large 3D datasets (you can find a good one here based on triangular meshes); it's especially hard to find datasets based on point clouds (wich is the raw output from every 3D sensing device).

This dataset contains 3D point clouds generated from the original images of the MNIST dataset to bring a familiar introduction to 3D to people used to work with 2D datasets (images).

In the 3D_from_2D notebook you can find the code used to generate the dataset.

You can use the code in the notebook to generate a bigger 3D dataset from the original.

Content

full_dataset_vectors.h5

The entire dataset stored as 4096-D vectors obtained from the voxelization (x:16, y:16, z:16) of all the 3D point clouds.

In adition to the original point clouds, it contains randomly rotated copies with noise.

The full dataset is splitted into arrays:

X_train (10000, 4096)

y_train (10000)

X_test(2000, 4096)

y_test (2000)

Example python code reading the full dataset:

with h5py.File("../input/train_point_clouds.h5", "r") as hf: X_train = hf["X_train"][:] y_train = hf["y_train"][:] X_test = hf["X_test"][:] y_test = hf["y_test"][:]

train_point_clouds.h5 & test_point_clouds.h5

5000 (train), and 1000 (test) 3D point clouds stored in HDF5 file format. The point clouds have zero mean and a maximum dimension range of 1.

Each file is divided into HDF5 groups

Each group is named as its corresponding array index in the original mnist dataset and it contains:

"points" dataset: x, y, z coordinates of each 3D point in the point cloud.

"normals" dataset: nx, ny, nz components of the unit normal associate to each point.

"img" dataset: the original mnist image.

"label" attribute: the original mnist label.

Example python code reading 2 digits and storing some of the group content in tuples:

with h5py.File("../input/train_point_clouds.h5", "r") as hf: a = hf["0"] b = hf["1"] digit_a = (a["img"][:], a["points"][:], a.attrs["label"]) digit_b = (b["img"][:], b["points"][:], b.attrs["label"])

voxelgrid.py

Simple Python class that generates a grid of voxels from the 3D point cloud. Check kernel for use.

plot3D.py

Module with functions to plot point clouds and voxelgrid inside jupyter notebook. You have to run this locally due to Kaggle's notebook lack of support to rendering Iframes. See github issue here

Functions included:

array_to_color Converts 1D array to rgb values use as kwarg color in plot_points()

plot_points(xyz, colors=None, size=0.1, axis=False)

plot_voxelgrid(v_grid, cmap="Oranges", axis=False)

Acknowledgements

Website of the original MNIST dataset

Website of the 3D MNIST dataset

Have fun!

Facebook

Twitter

Click to copy link

Link copied

Cite

Department of Veterans Affairs (2021). VA Personal Health Record Sample Data [Dataset]. https://catalog.data.gov/dataset/va-personal-health-record-sample-data

VA Personal Health Record Sample Data

Explore at:

3 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Apr 25, 2021

Dataset provided by

United States Department of Veterans Affairshttp://va.gov/

Description

My HealtheVet (www.myhealth.va.gov) is a Personal Health Record portal designed to improve the delivery of health care services to Veterans, to promote health and wellness, and to engage Veterans as more active participants in their health care. The My HealtheVet portal enables Veterans to create and maintain a web-based PHR that provides access to patient health education information and resources, a comprehensive personal health journal, and electronic services such as online VA prescription refill requests and Secure Messaging. Veterans can visit the My HealtheVet website and self-register to create an account, although registration is not required to view the professionally-sponsored health education resources, including topics of special interest to the Veteran population. Once registered, Veterans can create a customized PHR that is accessible from any computer with Internet access.

Clear search

Close search

Google apps

Main menu

VA Personal Health Record Sample Data

Data from: Login Data Set for Risk-Based Authentication

Film Circulation dataset

My Digital Footprint

Dataset Info:

fineweb

Web page phishing detection

Data from: ComBase: A Web Resource for Quantitative and Predictive Food...

Annotated Terms of Service of 100 Online Platforms

BIDS Phenotype External Example Dataset

BIDS Phenotype External Dataset Example COPY OF "The NIMH Healthy Research Volunteer Dataset" (ds003982)

THE ORIGINAL DATASET ds003982 README FOLLOWS

Recruitment

Clinical Measures

In-person Assessments

Medical Evaluation

MRI Scan

MEG

Specific Measures within Dataset

Ground-Based Global Navigation Satellite System (GNSS) Observation Data...

Data from: Public Data Set of Protein–Ligand Dissociation Kinetic Constants...

Hazardous Waste Manifest Data (CT) 1984 – 2008

ckanext-fedmsg

Amazon Web Services - Public Data Sets

About

Historical groundwater chemistry data compiled for the Placerita and Newhall...

Server Logs

Context

Components in Log Entry :

Content

Amazon Fine Food Reviews

Context

Contents

Acknowledgements

E-commerce - Users of a French C2C fashion store

Foreword

Context

Content

Acknowledgements

Inspiration

License

Data from: Commercial and Residential Hourly Load Profiles for all TMY3...

3D MNIST

Context

Content

full_dataset_vectors.h5

train_point_clouds.h5 & test_point_clouds.h5

voxelgrid.py

plot3D.py

Acknowledgements

Have fun!

VA Personal Health Record Sample Data