100+ datasets found
  1. Data from: Login Data Set for Risk-Based Authentication

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jun 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephan Wiefling; Stephan Wiefling; Paul René Jørgensen; Paul René Jørgensen; Sigurd Thunem; Sigurd Thunem; Luigi Lo Iacono; Luigi Lo Iacono (2022). Login Data Set for Risk-Based Authentication [Dataset]. http://doi.org/10.5281/zenodo.6782156
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 30, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Stephan Wiefling; Stephan Wiefling; Paul René Jørgensen; Paul René Jørgensen; Sigurd Thunem; Sigurd Thunem; Luigi Lo Iacono; Luigi Lo Iacono
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Login Data Set for Risk-Based Authentication

    Synthesized login feature data of >33M login attempts and >3.3M users on a large-scale online service in Norway. Original data collected between February 2020 and February 2021.

    This data sets aims to foster research and development for Risk-Based Authentication (RBA) systems. The data was synthesized from the real-world login behavior of more than 3.3M users at a large-scale single sign-on (SSO) online service in Norway.

    The users used this SSO to access sensitive data provided by the online service, e.g., a cloud storage and billing information. We used this data set to study how the Freeman et al. (2016) RBA model behaves on a large-scale online service in the real world (see Publication). The synthesized data set can reproduce these results made on the original data set (see Study Reproduction). Beyond that, you can use this data set to evaluate and improve RBA algorithms under real-world conditions.

    WARNING: The feature values are plausible, but still totally artificial. Therefore, you should NOT use this data set in productive systems, e.g., intrusion detection systems.

    Overview

    The data set contains the following features related to each login attempt on the SSO:

    FeatureData TypeDescriptionRange or Example
    IP AddressStringIP address belonging to the login attempt0.0.0.0 - 255.255.255.255
    CountryStringCountry derived from the IP addressUS
    RegionStringRegion derived from the IP addressNew York
    CityStringCity derived from the IP addressRochester
    ASNIntegerAutonomous system number derived from the IP address0 - 600000
    User Agent StringStringUser agent string submitted by the clientMozilla/5.0 (Windows NT 10.0; Win64; ...
    OS Name and VersionStringOperating system name and version derived from the user agent stringWindows 10
    Browser Name and VersionStringBrowser name and version derived from the user agent stringChrome 70.0.3538
    Device TypeStringDevice type derived from the user agent string(mobile, desktop, tablet, bot, unknown)1
    User IDIntegerIdenfication number related to the affected user account[Random pseudonym]
    Login TimestampIntegerTimestamp related to the login attempt[64 Bit timestamp]
    Round-Trip Time (RTT) [ms]IntegerServer-side measured latency between client and server1 - 8600000
    Login SuccessfulBooleanTrue: Login was successful, False: Login failed(true, false)
    Is Attack IPBooleanIP address was found in known attacker data set(true, false)
    Is Account TakeoverBooleanLogin attempt was identified as account takeover by incident response team of the online service(true, false)

    Data Creation

    As the data set targets RBA systems, especially the Freeman et al. (2016) model, the statistical feature probabilities between all users, globally and locally, are identical for the categorical data. All the other data was randomly generated while maintaining logical relations and timely order between the features.

    The timestamps, however, are not identical and contain randomness. The feature values related to IP address and user agent string were randomly generated by publicly available data, so they were very likely not present in the real data set. The RTTs resemble real values but were randomly assigned among users per geolocation. Therefore, the RTT entries were probably in other positions in the original data set.

    • The country was randomly assigned per unique feature value. Based on that, we randomly assigned an ASN related to the country, and generated the IP addresses for this ASN. The cities and regions were derived from the generated IP addresses for privacy reasons and do not reflect the real logical relations from the original data set.

    • The device types are identical to the real data set. Based on that, we randomly assigned the OS, and based on the OS the browser information. From this information, we randomly generated the user agent string. Therefore, all the logical relations regarding the user agent are identical as in the real data set.

    • The RTT was randomly drawn from the login success status and synthesized geolocation data. We did this to ensure that the RTTs are realistic ones.

    Regarding the Data Values

    Due to unresolvable conflicts during the data creation, we had to assign some unrealistic IP addresses and ASNs that are not present in the real world. Nevertheless, these do not have any effects on the risk scores generated by the Freeman et al. (2016) model.

    You can recognize them by the following values:

    • ASNs with values >= 500.000

    • IP addresses in the range 10.0.0.0 - 10.255.255.255 (10.0.0.0/8 CIDR range)

    Study Reproduction

    Based on our evaluation, this data set can reproduce our study results regarding the RBA behavior of an RBA model using the IP address (IP address, country, and ASN) and user agent string (Full string, OS name and version, browser name and version, device type) as features.

    The calculated RTT significances for countries and regions inside Norway are not identical using this data set, but have similar tendencies. The same is true for the Median RTTs per country. This is due to the fact that the available number of entries per country, region, and city changed with the data creation procedure. However, the RTTs still reflect the real-world distributions of different geolocations by city.

    See RESULTS.md for more details.

    Ethics

    By using the SSO service, the users agreed in the data collection and evaluation for research purposes. For study reproduction and fostering RBA research, we agreed with the data owner to create a synthesized data set that does not allow re-identification of customers.

    The synthesized data set does not contain any sensitive data values, as the IP addresses, browser identifiers, login timestamps, and RTTs were randomly generated and assigned.

    Publication

    You can find more details on our conducted study in the following journal article:

    Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service (2022)
    Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono.
    ACM Transactions on Privacy and Security

    Bibtex

    @article{Wiefling_Pump_2022,
     author = {Wiefling, Stephan and Jørgensen, Paul René and Thunem, Sigurd and Lo Iacono, Luigi},
     title = {Pump {Up} {Password} {Security}! {Evaluating} and {Enhancing} {Risk}-{Based} {Authentication} on a {Real}-{World} {Large}-{Scale} {Online} {Service}},
     journal = {{ACM} {Transactions} on {Privacy} and {Security}},
     doi = {10.1145/3546069},
     publisher = {ACM},
     year  = {2022}
    }

    License

    This data set and the contents of this repository are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. See the LICENSE file for details. If the data set is used within a publication, the following journal article has to be cited as the source of the data set:

    Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono: Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service. In: ACM Transactions on Privacy and Security (2022). doi: 10.1145/3546069

    1. Few (invalid) user agents strings from the original data set could not be parsed, so their device type is empty. Perhaps this parse error is useful information for your studies, so we kept these 1526 entries.↩︎

  2. Data from: NACP Regional: Original Observation Data and Biosphere and...

    • catalog.data.gov
    • search.dataone.org
    • +4more
    Updated Sep 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ORNL_DAAC (2025). NACP Regional: Original Observation Data and Biosphere and Inverse Model Outputs [Dataset]. https://catalog.data.gov/dataset/nacp-regional-original-observation-data-and-biosphere-and-inverse-model-outputs-acd36
    Explore at:
    Dataset updated
    Sep 19, 2025
    Dataset provided by
    Oak Ridge National Laboratory Distributed Active Archive Center
    Description

    This data set contains the originally-submitted observation measurement data, terrestrial biosphere model output data, and inverse model simulations that various investigator teams contributed to the North American Carbon Program (NACP) Regional Synthesis activities. The data set provides nine (9) data packages of remote sensing and ground observation measurements (OM) (MODIS gross primary productivity (GPP), MODIS net primary production (NPP), MODIS fraction of photosynthetically active radiation (fPar), MODIS leaf area index (LAI), MODIS enhanced vegetation index (EVI), MODIS normalize difference vegetation index (NDVI), Forest Inventory and Analysis (FIA) forest biomass, National Agricultural Statistics Service (NASS) crop NPP, and Flux Anomaly). The data set also provides data packages of simulation results from 19 terrestrial biosphere models (TBM) and eight (8) inverse models (IM). The data packages are respectively OM, TBM, and IM data files listed in Tables 4-6. Each OM, TBM, and IM data package contains all of the original data (and documentation, if any) that the NACP Modeling and Synthesis Thematic Data Center (MAST-DC) acquired or received. These originally-submitted data were processed by the MAST-DC to produce the three standardized gridded data sets of carbon flux for inter-comparison purposes (see Related Data Products below). These original data and documentation are provided to allow users of the standardized gridded data products to be able to trace back to the data origins when needed. The Data Center (ORNL DAAC) transformed some of the originally-submitted data files to file formats that are more suitable for long-term archiving. For example, .xlsx files were saved as .csv, ERDAS Imagine files were converted to GeoTIFFs, and MATLAB files were converted to GeoTIFF and NetCDF formats as appropriate. Files received in NetCDF, GeoTIFF, and HDF formats were not transformed.

  3. r

    SweFraCas 1.0

    • researchdata.se
    • data.europa.eu
    Updated Jan 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Språkbanken Text (2024). SweFraCas 1.0 [Dataset]. http://doi.org/10.23695/GFWN-QK37
    Explore at:
    Dataset updated
    Jan 1, 2024
    Dataset provided by
    University of Gothenburg
    Authors
    Språkbanken Text
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    I. IDENTIFYING INFORMATION

    Title* SweFracas v1.0

    Subtitle A Swedish version of the Fracas inference/entailment dataset

    Created by* Lars Borin (lars.borin@gu.se)

    Publisher(s)* Språkbanken Text (sb-info@svenska.gu.se)

    Link(s) / permanent identifier(s)* https://spraakbanken.gu.se/en/resources/swefracas

    License(s)* CC BY 4.0

    Abstract* A textual inference/entailment problem set, derived from FraCas. The original English Fracas [1] was converted to html and edited by Bill MacCartney [2], and then automatically translated to Swedish by Peter Ljunglöf and Magdalena Siverbo [3]. The current tabular form of the set was created by Aleksandrs Berdicevskis by merging the Swedish and English versions and removing some of the problems. Finally, Lars Borin went through all the translations, correcting and Swedifying them manually. As a result, many translations are rather liberal and diverge noticeably from the English original

    Funded by* Vinnova (grant no. 2020-02523)

    Cite as

    Related datasets Part of the SuperLim collection (https://spraakbanken.gu.se/en/resources/superlim). See also Abstract

    II. USAGE

    Key applications Machine Learning, Inference, Entailment, Evaluation of language models, Diagnostics

    Intended task(s)/usage(s) (1) Evaluate models on the following task: given the question and the premises, choose the suitable answer (Ja 'Yes'; Nej 'No'; Vet ej 'Don't know'; Jo 'Positive answer to a negated question')

    Recommended evaluation measures (1) R4 (Matthews correlation coefficient)

    Dataset function(s) Testing

    Recommended split(s) Test data only

    III. DATA

    Primary data* Text

    Language* Swedish

    Dataset in numbers* 305 problems

    Nature of the content* Inference problems, where a question has to be answered, given a number of promises

    Format* Tab-separated, five columns: "id" -- unique integer id of the problem; "original_id" -- the id of the corresponding problem in the original dataset "attribute" -- which attribute does the row within the problem contain: "premiss" (premise), "fråga" (question), "svar" (answer), "why" and "note". The latter two are taken from MacCartney's conversion and refer only to English data. They are kept for information only; "value" -- the Swedish sentence. "why" and "note" are always empty for Swedish; "original_value" -- the original English sentence. Provided for information only. Note that many translations are rather liberal.

    Data source(s)* See Abstract

    Data collection method(s)* See Abstract

    Data selection and filtering* 41 problems in the original set did not have a definite answer (different answers were possible depending on the interpretation). They were excluded.

    Data preprocessing* None

    Data labeling* Most of the labels map straightforwardly on the original English labels (Yes Ja, Don't know Vet ej, No Nej), with three exceptions: 97, 98 (Nej Jo) and 108 (No Vet ej)

    Annotator characteristics PhD in linguistics; native speaker of Swedish

    IV. ETHICS AND CAVEATS

    Ethical considerations

    Things to watch out for In the original dataset, all examples were classified by the linguistic phenomena they represent. It is not necessary that the Swedish translations follow exactly the same classification (most of them probably do, but it has not been checked).

    V. ABOUT DOCUMENTATION

    Data last updated* 2021-06-09, v1.0

    Which changes have been made, compared to the previous version* This is the first official version

    Access to previous versions

    This document created* 2021-06-09, Aleksandrs Berdicevskis

    This document last updated* 2021-06-09, Aleksandrs Berdicevskis

    Where to look for further details

    Documentation template version* v1.0

    VI. OTHER

    Related projects

    References [1] Robin Cooper, Dick Crouch, Jan Van Eijck, Chris Fox, Johan Van Genabith, Jan Jaspars, Hans Kamp, David Milward, Manfred Pinkal, Massimo Poesio, et al. 1996. Using the framework. Technical report, Technical Report LRE 62-051 D-16, The FraCaS Consortium. ftp://ftp.cogsci.ed.ac.uk/pub/FRACAS/del16.ps.gz [2] https://nlp.stanford.edu/~wcmac/downloads/fracas.xml [3] Peter Ljunglöf and Magdalena Siverbo. 2012. A bilingual treebank for the FraCas test suite. In SLTC 2012, page 53. https://gup.ub.gu.se/publication/168965?lang=en, https://gup.ub.gu.se/publication/168965?lang=en

  4. w

    Vehicle licensing statistics data tables

    • gov.uk
    • s3.amazonaws.com
    Updated Oct 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department for Transport (2025). Vehicle licensing statistics data tables [Dataset]. https://www.gov.uk/government/statistical-data-sets/vehicle-licensing-statistics-data-tables
    Explore at:
    Dataset updated
    Oct 15, 2025
    Dataset provided by
    GOV.UK
    Authors
    Department for Transport
    Description

    Data files containing detailed information about vehicles in the UK are also available, including make and model data.

    Some tables have been withdrawn and replaced. The table index for this statistical series has been updated to provide a full map between the old and new numbering systems used in this page.

    The Department for Transport is committed to continuously improving the quality and transparency of our outputs, in line with the Code of Practice for Statistics. In line with this, we have recently concluded a planned review of the processes and methodologies used in the production of Vehicle licensing statistics data. The review sought to seek out and introduce further improvements and efficiencies in the coding technologies we use to produce our data and as part of that, we have identified several historical errors across the published data tables affecting different historical periods. These errors are the result of mistakes in past production processes that we have now identified, corrected and taken steps to eliminate going forward.

    Most of the revisions to our published figures are small, typically changing values by less than 1% to 3%. The key revisions are:

    Licensed Vehicles (2014 Q3 to 2016 Q3)

    We found that some unlicensed vehicles during this period were mistakenly counted as licensed. This caused a slight overstatement, about 0.54% on average, in the number of licensed vehicles during this period.

    3.5 - 4.25 tonnes Zero Emission Vehicles (ZEVs) Classification

    Since 2023, ZEVs weighing between 3.5 and 4.25 tonnes have been classified as light goods vehicles (LGVs) instead of heavy goods vehicles (HGVs). We have now applied this change to earlier data and corrected an error in table VEH0150. As a result, the number of newly registered HGVs has been reduced by:

    • 3.1% in 2024

    • 2.3% in 2023

    • 1.4% in 2022

    Table VEH0156 (2018 to 2023)

    Table VEH0156, which reports average CO₂ emissions for newly registered vehicles, has been updated for the years 2018 to 2023. Most changes are minor (under 3%), but the e-NEDC measure saw a larger correction, up to 15.8%, due to a calculation error. Other measures (WLTP and Reported) were less notable, except for April 2020 when COVID-19 led to very few new registrations which led to greater volatility in the resultant percentages.

    Neither these specific revisions, nor any of the others introduced, have had a material impact on the statistics overall, the direction of trends nor the key messages that they previously conveyed.

    Specific details of each revision made has been included in the relevant data table notes to ensure transparency and clarity. Users are advised to review these notes as part of their regular use of the data to ensure their analysis accounts for these changes accordingly.

    If you have questions regarding any of these changes, please contact the Vehicle statistics team.

    All vehicles

    Licensed vehicles

    Overview

    VEH0101: https://assets.publishing.service.gov.uk/media/68ecf5acf159f887526bbd7c/veh0101.ods">Vehicles at the end of the quarter by licence status and body type: Great Britain and United Kingdom (ODS, 99.7 KB)

    Detailed breakdowns

    VEH0103: https://assets.publishing.service.gov.uk/media/68ecf5abf159f887526bbd7b/veh0103.ods">Licensed vehicles at the end of the year by tax class: Great Britain and United Kingdom (ODS, 23.8 KB)

    VEH0105: https://assets.publishing.service.gov.uk/media/68ecf5ac2adc28a81b4acfc8/veh0105.ods">Licensed vehicles at

  5. Z

    Code and data set for data analysis published as manuscript "Bacttle: a...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Jul 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Lausanne; Miguel Trabajo, Tania; Benigno, Valentina; Dorcey, Eavan; van der Meer, Jan Roelof (2024). Code and data set for data analysis published as manuscript "Bacttle: a microbiology educational board game for lay public and schools" [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_12800100
    Explore at:
    Dataset updated
    Jul 23, 2024
    Dataset provided by
    University of Lausanne
    Authors
    University of Lausanne; Miguel Trabajo, Tania; Benigno, Valentina; Dorcey, Eavan; van der Meer, Jan Roelof
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Code that processed raw data and plots the figures of the manuscript "Bacttle: a microbiology educational board game for lay public and schools"

    Below is a table with the original survey questions. The ID corresponds to the column displayed on the data set. When letters are followed by a number (1 or 2), it means that the question was answered before playing the game (1) and after playing the game (2).

    ID1

    Question text

    Possible answers2

    A

    How old are you?

    B

    Do you know what a bacterium is?

    y/n

    C

    Do you know what a bacterial capsule is?

    y/n

    D

    Do bacteria have tools to harm each other?

    y/n/idk

    E

    Do bacteria reproduce at the same pace?

    y/n/idk

    F

    What is sporulation?

    A resistant state that some bacteria can achieve under unfavorable conditions.

    The release of toxins by bacteria.

    idk

    G

    What are flagella used for?

    Sticking to surfaces.

    Motility in liquid environments.

    idk

    H

    What does it mean to be lithotrophic?

    A bacterium can get energy from minerals.

    A bacterium can get energy from the sunlight.

    idk

    I

    Can bacteria be infected by viruses?

    y/n/idk

    J

    Are all bacteria harmful for humans?

    y/n/idk

    K

    How many bacteria are in a coffee spoon of yoghurt?

    Millions

    Hundreds

    idk

    L

    How easy did you find the gameplay?

    VE/E/A/D/VD

    M

    Did you find the card content easy to understand?

    VE/E/A/D/VD

    N

    Did you like the setup of the game?

    y/n/idk

    O

    Would you like to play this game again?

    y/n/idk

    P

    What can we improve?

    1) Question A categorizes the player’s age; B and C assess the initial level of knowledge in microbiology (none -both questions are answered negatively-, basic -player knows what a bacterium is but not a bacterial capsule-, or advanced -both answers are positive-); questions D-I score knowledge acquisition; J and K are control questions; L-O evaluate the appreciation of the game; and P is an optional free text-entry answer for additional feedback. 2) y= yes, n=no, idk=I don’t know, VE=very easy, E=easy, A=adequate, D=difficult, VD=very difficult.

  6. Facebook users worldwide 2017-2027

    • statista.com
    • de.statista.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stacy Jo Dixon, Facebook users worldwide 2017-2027 [Dataset]. https://www.statista.com/topics/1164/social-networks/
    Explore at:
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Stacy Jo Dixon
    Description

    The global number of Facebook users was forecast to continuously increase between 2023 and 2027 by in total 391 million users (+14.36 percent). After the fourth consecutive increasing year, the Facebook user base is estimated to reach 3.1 billion users and therefore a new peak in 2027. Notably, the number of Facebook users was continuously increasing over the past years. User figures, shown here regarding the platform Facebook, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).

  7. w

    Data from: Evapotranspiration Data

    • data.wu.ac.at
    • search.dataone.org
    co%3b2, html
    Updated Dec 12, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of the Interior (2017). Evapotranspiration Data [Dataset]. https://data.wu.ac.at/schema/data_gov/YzJjY2FkZDctYjU0Yy00MTMxLTllMzMtNzFlMjlkZmFjYjRk
    Explore at:
    html, co%3b2Available download formats
    Dataset updated
    Dec 12, 2017
    Dataset provided by
    Department of the Interior
    Area covered
    814f5a7e8eefdeaedb656c9cca3a38f2463b83fa
    Description

    A regional evaluation of evapotranspiration (ET) in the Florida Everglades began in 1996 with operation of 9 sites at locations selected to represent the sawgrass or cattail marshes, wet prairie, and open-water areas that constitute most of the natural Everglades system. The Bowen-ratio energy-budget method was used to measure ET at 30-minute intervals. Site models were developed to determine ET for intervals when a Bowen ratio could not be accurately determined. Regional models were then developed for determining 30-minute ET at any location as a function of solar intensity and water depth using data from the 9 sites for 1996-97. Five of the original 9 sites continued in operation after 1997 for various periods. Two of these sites were operated continuously until September 2003. Three new sites were installed in the western part of Shark Valley in November 2001 for the purpose of testing regional model transferability. Additionally, an evaporation pan was installed at one site in April 2001 for comparing actual ET determined by the Bowen-ratio site with potential pan evaporation. All data collection ended in September 2003. The dataset contains the meteorological and evapotranspiration data. Additionally, tables listing model coefficients and goodness-of-fit statistics for site models for the period 1998-2003 are included, and tables listing a comparison for measured ET and ET estimated from the regional models. Data is available by year for each of the collection sites. The a_read_me file in the Data summary and data files for Everglades ET sites, 1996-2003 describes the format of data files of meteorological and evapotranspiration data. Additionally, tables listing model coefficients and goodness-of-fit statistics for site models for the period 1998-2003 are included, and tables listing a comparison for measured ET and ET estimated from the regional models. This latest data release is different in format from the original release for all data from 1998 on. No changes were made in the 1996-97 data. One change made in reporting format is that ET data from 1998 on are not smoothed by averaging over one or more measurement intervals. With this release data are provided at the measurement interval so that users may use whatever smoothing technique that is appropriate for the intended use. Another change in format for data from 1998 on is that ET sums are provided for "raw" and "edited" 30-minute periods. The "raw" data refer to ET sums that have not been edited from computed results, although the ET sum may be an actual measurement that has passed all input-data screening tests (see WRI 00-4217), or may be a "gap-filled" value computed from the Priestley-Taylor site mode that was developed using only data that passed all screening tests. Data in the "edited" column have been edited graphically by comparing each value to the pattern of ET defined by the entire set of data during part of a day. The final change in format for data from 1998 on is that a flag indicator is provided to show which 30-minute ET data are measured and which are model derived because the input data did not pass screening criteria.

  8. f

    Data from: The Often-Overlooked Power of Summary Statistics in Exploratory...

    • acs.figshare.com
    xlsx
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford (2023). The Often-Overlooked Power of Summary Statistics in Exploratory Data Analysis: Comparison of Pattern Recognition Entropy (PRE) to Other Summary Statistics and Introduction of Divided Spectrum-PRE (DS-PRE) [Dataset]. http://doi.org/10.1021/acs.jcim.1c00244.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    ACS Publications
    Authors
    Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Unsupervised exploratory data analysis (EDA) is often the first step in understanding complex data sets. While summary statistics are among the most efficient and convenient tools for exploring and describing sets of data, they are often overlooked in EDA. In this paper, we show multiple case studies that compare the performance, including clustering, of a series of summary statistics in EDA. The summary statistics considered here are pattern recognition entropy (PRE), the mean, standard deviation (STD), 1-norm, range, sum of squares (SSQ), and X4, which are compared with principal component analysis (PCA), multivariate curve resolution (MCR), and/or cluster analysis. PRE and the other summary statistics are direct methods for analyzing datathey are not factor-based approaches. To quantify the performance of summary statistics, we use the concept of the “critical pair,” which is employed in chromatography. The data analyzed here come from different analytical methods. Hyperspectral images, including one of a biological material, are also analyzed. In general, PRE outperforms the other summary statistics, especially in image analysis, although a suite of summary statistics is useful in exploring complex data sets. While PRE results were generally comparable to those from PCA and MCR, PRE is easier to apply. For example, there is no need to determine the number of factors that describe a data set. Finally, we introduce the concept of divided spectrum-PRE (DS-PRE) as a new EDA method. DS-PRE increases the discrimination power of PRE. We also show that DS-PRE can be used to provide the inputs for the k-nearest neighbor (kNN) algorithm. We recommend PRE and DS-PRE as rapid new tools for unsupervised EDA.

  9. a

    Online News Popularity Data Set

    • academictorrents.com
    • kaggle.com
    bittorrent
    Updated Feb 11, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kelwin Fernandes and Pedro Vinagre and Paulo Cortez and Pedro Sernadela (2016). Online News Popularity Data Set [Dataset]. https://academictorrents.com/details/95d3b03397a0bafd74a662fe13ba3550c13b7ce1
    Explore at:
    bittorrent(7476401)Available download formats
    Dataset updated
    Feb 11, 2016
    Dataset authored and provided by
    Kelwin Fernandes and Pedro Vinagre and Paulo Cortez and Pedro Sernadela
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    Data Set Information: * The articles were published by Mashable (www.mashable.com) and their content as the rights to reproduce it belongs to them. Hence, this dataset does not share the original content but some statistics associated with it. The original content be publicly accessed and retrieved using the provided urls. * Acquisition date: January 8, 2015 * The estimated relative performance values were estimated by the authors using a Random Forest classifier and a rolling windows as assessment method. See their article for more details on how the relative performance values were set. ##Attribute Information: Number of Attributes: 61 (58 predictive attributes, 2 non-predictive, 1 goal field) 0. url: URL of the article (non-predictive) 1. timedelta: Days between the article publication and the dataset acquisition (non-predictive) 2. n_tokens_title: Number of words in the title 3. n_tokens_content: Number of words in the content 4. n_unique_tokens: Rate of unique words in the conte

  10. d

    Lithium and Manganese Uptake Data from Initial Set of Imprinted Polymers

    • catalog.data.gov
    • gdr.openei.org
    • +1more
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SRI International (2025). Lithium and Manganese Uptake Data from Initial Set of Imprinted Polymers [Dataset]. https://catalog.data.gov/dataset/lithium-and-manganese-uptake-data-from-initial-set-of-imprinted-polymers-a7654
    Explore at:
    Dataset updated
    Jan 20, 2025
    Dataset provided by
    SRI International
    Description

    Batch tests of cross-linked lithium and manganese imprinted polymers of variable composition to assess their ability to extract lithium and manganese from synthetic brines at T=45 deg C .

  11. f

    Data from: FunQG: Molecular Representation Learning via Quotient Graphs

    • acs.figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hossein Hajiabolhassan; Zahra Taheri; Ali Hojatnia; Yavar Taheri Yeganeh (2023). FunQG: Molecular Representation Learning via Quotient Graphs [Dataset]. http://doi.org/10.1021/acs.jcim.3c00445.s002
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    ACS Publications
    Authors
    Hossein Hajiabolhassan; Zahra Taheri; Ali Hojatnia; Yavar Taheri Yeganeh
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    To accurately predict molecular properties, it is important to learn expressive molecular representations. Graph neural networks (GNNs) have made significant advances in this area, but they often face limitations like neighbors-explosion, under-reaching, oversmoothing, and oversquashing. Additionally, GNNs tend to have high computational costs due to their large number of parameters. These limitations emerge or increase when dealing with larger graphs or deeper GNN models. One potential solution is to simplify the molecular graph into a smaller, richer, and more informative one that is easier to train GNNs. Our proposed molecular graph coarsening framework called FunQG, uses Functional groups as building blocks to determine a molecule’s properties, based on a graph-theoretic concept called Quotient Graph. We show through experiments that the resulting informative graphs are much smaller than the original molecular graphs and are thus more suitable for training GNNs. We apply FunQG to popular molecular property prediction benchmarks and compare the performance of popular baseline GNNs on the resulting data sets to that of state-of-the-art baselines on the original data sets. Our experiments demonstrate that FunQG yields notable results on various data sets while dramatically reducing the number of parameters and computational costs. By utilizing functional groups, we can achieve an interpretable framework that indicates their significant role in determining the properties of molecular quotient graphs. Consequently, FunQG is a straightforward, computationally efficient, and generalizable solution for addressing the molecular representation learning problem.

  12. N

    Old Town, ME Population Breakdown by Gender and Age

    • neilsberg.com
    csv, json
    Updated Sep 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2023). Old Town, ME Population Breakdown by Gender and Age [Dataset]. https://www.neilsberg.com/research/datasets/674a3d42-3d85-11ee-9abe-0aa64bf2eeb2/
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Sep 14, 2023
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Old Town, Maine
    Variables measured
    Male and Female Population Under 5 Years, Male and Female Population over 85 years, Male and Female Population Between 5 and 9 years, Male and Female Population Between 10 and 14 years, Male and Female Population Between 15 and 19 years, Male and Female Population Between 20 and 24 years, Male and Female Population Between 25 and 29 years, Male and Female Population Between 30 and 34 years, Male and Female Population Between 35 and 39 years, Male and Female Population Between 40 and 44 years, and 8 more
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates. To measure the three variables, namely (a) Population (Male), (b) Population (Female), and (c) Gender Ratio (Males per 100 Females), we initially analyzed and categorized the data for each of the gender classifications (biological sex) reported by the US Census Bureau across 18 age groups, ranging from under 5 years to 85 years and above. These age groups are described above in the variables section. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the population of Old Town by gender across 18 age groups. It lists the male and female population in each age group along with the gender ratio for Old Town. The dataset can be utilized to understand the population distribution of Old Town by gender and age. For example, using this dataset, we can identify the largest age group for both Men and Women in Old Town. Additionally, it can be used to see how the gender ratio changes from birth to senior most age group and male to female ratio across each age group for Old Town.

    Key observations

    Largest age group (population): Male # 20-24 years (587) | Female # 35-39 years (454). Source: U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

    Age groups:

    • Under 5 years
    • 5 to 9 years
    • 10 to 14 years
    • 15 to 19 years
    • 20 to 24 years
    • 25 to 29 years
    • 30 to 34 years
    • 35 to 39 years
    • 40 to 44 years
    • 45 to 49 years
    • 50 to 54 years
    • 55 to 59 years
    • 60 to 64 years
    • 65 to 69 years
    • 70 to 74 years
    • 75 to 79 years
    • 80 to 84 years
    • 85 years and over

    Scope of gender :

    Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis.

    Variables / Data Columns

    • Age Group: This column displays the age group for the Old Town population analysis. Total expected values are 18 and are define above in the age groups section.
    • Population (Male): The male population in the Old Town is shown in the following column.
    • Population (Female): The female population in the Old Town is shown in the following column.
    • Gender Ratio: Also known as the sex ratio, this column displays the number of males per 100 females in Old Town for each age group.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Old Town Population by Gender. You can refer the same here

  13. N

    Old Bridge Township, New Jersey Population Breakdown by Gender and Age

    • neilsberg.com
    csv, json
    Updated Sep 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2023). Old Bridge Township, New Jersey Population Breakdown by Gender and Age [Dataset]. https://www.neilsberg.com/research/datasets/6749fff9-3d85-11ee-9abe-0aa64bf2eeb2/
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Sep 14, 2023
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Old Bridge, New Jersey
    Variables measured
    Male and Female Population Under 5 Years, Male and Female Population over 85 years, Male and Female Population Between 5 and 9 years, Male and Female Population Between 10 and 14 years, Male and Female Population Between 15 and 19 years, Male and Female Population Between 20 and 24 years, Male and Female Population Between 25 and 29 years, Male and Female Population Between 30 and 34 years, Male and Female Population Between 35 and 39 years, Male and Female Population Between 40 and 44 years, and 8 more
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates. To measure the three variables, namely (a) Population (Male), (b) Population (Female), and (c) Gender Ratio (Males per 100 Females), we initially analyzed and categorized the data for each of the gender classifications (biological sex) reported by the US Census Bureau across 18 age groups, ranging from under 5 years to 85 years and above. These age groups are described above in the variables section. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the population of Old Bridge township by gender across 18 age groups. It lists the male and female population in each age group along with the gender ratio for Old Bridge township. The dataset can be utilized to understand the population distribution of Old Bridge township by gender and age. For example, using this dataset, we can identify the largest age group for both Men and Women in Old Bridge township. Additionally, it can be used to see how the gender ratio changes from birth to senior most age group and male to female ratio across each age group for Old Bridge township.

    Key observations

    Largest age group (population): Male # 55-59 years (2,937) | Female # 55-59 years (3,130). Source: U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

    Age groups:

    • Under 5 years
    • 5 to 9 years
    • 10 to 14 years
    • 15 to 19 years
    • 20 to 24 years
    • 25 to 29 years
    • 30 to 34 years
    • 35 to 39 years
    • 40 to 44 years
    • 45 to 49 years
    • 50 to 54 years
    • 55 to 59 years
    • 60 to 64 years
    • 65 to 69 years
    • 70 to 74 years
    • 75 to 79 years
    • 80 to 84 years
    • 85 years and over

    Scope of gender :

    Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis.

    Variables / Data Columns

    • Age Group: This column displays the age group for the Old Bridge township population analysis. Total expected values are 18 and are define above in the age groups section.
    • Population (Male): The male population in the Old Bridge township is shown in the following column.
    • Population (Female): The female population in the Old Bridge township is shown in the following column.
    • Gender Ratio: Also known as the sex ratio, this column displays the number of males per 100 females in Old Bridge township for each age group.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Old Bridge township Population by Gender. You can refer the same here

  14. N

    Old Field, NY Population Breakdown by Gender and Age

    • neilsberg.com
    csv, json
    Updated Sep 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2023). Old Field, NY Population Breakdown by Gender and Age [Dataset]. https://www.neilsberg.com/research/datasets/674a0cc9-3d85-11ee-9abe-0aa64bf2eeb2/
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Sep 14, 2023
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Old Field, New York
    Variables measured
    Male and Female Population Under 5 Years, Male and Female Population over 85 years, Male and Female Population Between 5 and 9 years, Male and Female Population Between 10 and 14 years, Male and Female Population Between 15 and 19 years, Male and Female Population Between 20 and 24 years, Male and Female Population Between 25 and 29 years, Male and Female Population Between 30 and 34 years, Male and Female Population Between 35 and 39 years, Male and Female Population Between 40 and 44 years, and 8 more
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates. To measure the three variables, namely (a) Population (Male), (b) Population (Female), and (c) Gender Ratio (Males per 100 Females), we initially analyzed and categorized the data for each of the gender classifications (biological sex) reported by the US Census Bureau across 18 age groups, ranging from under 5 years to 85 years and above. These age groups are described above in the variables section. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the population of Old Field by gender across 18 age groups. It lists the male and female population in each age group along with the gender ratio for Old Field. The dataset can be utilized to understand the population distribution of Old Field by gender and age. For example, using this dataset, we can identify the largest age group for both Men and Women in Old Field. Additionally, it can be used to see how the gender ratio changes from birth to senior most age group and male to female ratio across each age group for Old Field.

    Key observations

    Largest age group (population): Male # 55-59 years (82) | Female # 60-64 years (56). Source: U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

    Age groups:

    • Under 5 years
    • 5 to 9 years
    • 10 to 14 years
    • 15 to 19 years
    • 20 to 24 years
    • 25 to 29 years
    • 30 to 34 years
    • 35 to 39 years
    • 40 to 44 years
    • 45 to 49 years
    • 50 to 54 years
    • 55 to 59 years
    • 60 to 64 years
    • 65 to 69 years
    • 70 to 74 years
    • 75 to 79 years
    • 80 to 84 years
    • 85 years and over

    Scope of gender :

    Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis.

    Variables / Data Columns

    • Age Group: This column displays the age group for the Old Field population analysis. Total expected values are 18 and are define above in the age groups section.
    • Population (Male): The male population in the Old Field is shown in the following column.
    • Population (Female): The female population in the Old Field is shown in the following column.
    • Gender Ratio: Also known as the sex ratio, this column displays the number of males per 100 females in Old Field for each age group.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Old Field Population by Gender. You can refer the same here

  15. N

    Old Mill Creek, IL Population Breakdown by Gender and Age Dataset: Male and...

    • neilsberg.com
    csv, json
    Updated Feb 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2025). Old Mill Creek, IL Population Breakdown by Gender and Age Dataset: Male and Female Population Distribution Across 18 Age Groups // 2025 Edition [Dataset]. https://www.neilsberg.com/insights/old-mill-creek-il-population-by-gender/
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Feb 24, 2025
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Illinois, Old Mill Creek
    Variables measured
    Male and Female Population Under 5 Years, Male and Female Population over 85 years, Male and Female Population Between 5 and 9 years, Male and Female Population Between 10 and 14 years, Male and Female Population Between 15 and 19 years, Male and Female Population Between 20 and 24 years, Male and Female Population Between 25 and 29 years, Male and Female Population Between 30 and 34 years, Male and Female Population Between 35 and 39 years, Male and Female Population Between 40 and 44 years, and 8 more
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the three variables, namely (a) Population (Male), (b) Population (Female), and (c) Gender Ratio (Males per 100 Females), we initially analyzed and categorized the data for each of the gender classifications (biological sex) reported by the US Census Bureau across 18 age groups, ranging from under 5 years to 85 years and above. These age groups are described above in the variables section. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the population of Old Mill Creek by gender across 18 age groups. It lists the male and female population in each age group along with the gender ratio for Old Mill Creek. The dataset can be utilized to understand the population distribution of Old Mill Creek by gender and age. For example, using this dataset, we can identify the largest age group for both Men and Women in Old Mill Creek. Additionally, it can be used to see how the gender ratio changes from birth to senior most age group and male to female ratio across each age group for Old Mill Creek.

    Key observations

    Largest age group (population): Male # 70-74 years (9) | Female # 40-44 years (9). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

    Age groups:

    • Under 5 years
    • 5 to 9 years
    • 10 to 14 years
    • 15 to 19 years
    • 20 to 24 years
    • 25 to 29 years
    • 30 to 34 years
    • 35 to 39 years
    • 40 to 44 years
    • 45 to 49 years
    • 50 to 54 years
    • 55 to 59 years
    • 60 to 64 years
    • 65 to 69 years
    • 70 to 74 years
    • 75 to 79 years
    • 80 to 84 years
    • 85 years and over

    Scope of gender :

    Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis.

    Variables / Data Columns

    • Age Group: This column displays the age group for the Old Mill Creek population analysis. Total expected values are 18 and are define above in the age groups section.
    • Population (Male): The male population in the Old Mill Creek is shown in the following column.
    • Population (Female): The female population in the Old Mill Creek is shown in the following column.
    • Gender Ratio: Also known as the sex ratio, this column displays the number of males per 100 females in Old Mill Creek for each age group.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Old Mill Creek Population by Gender. You can refer the same here

  16. RuBQ 1.0

    • kaggle.com
    zip
    Updated Aug 9, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valentin Biryukov (2021). RuBQ 1.0 [Dataset]. https://www.kaggle.com/valentinbiryukov/rubq-10
    Explore at:
    zip(174857 bytes)Available download formats
    Dataset updated
    Aug 9, 2021
    Authors
    Valentin Biryukov
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    RuBQ 1.0: A Russian Knowledge Base Question Answering Data Set

    Introduction

    We present RuBQ (pronounced [`rubik]) -- Russian Knowledge Base Questions, a KBQA dataset that consists of 1,500 Russian questions of varying complexity along with their English machine translations, corresponding SPARQL queries, answers, as well as a subset of Wikidata covering entities with Russian labels. 300 RuBQ questions are unanswerable, which poses a new challenge for KBQA systems and makes the task more realistic. The dataset is based on a collection of quiz questions. The data generation pipeline combines automatic processing, crowdsourced and in-house verification, see details in the paper. To the best of our knowledge, this is the first Russian KBQA and semantic parsing dataset.

    Links

    ISWC 2020 paper (newest) :page_facing_up:

    arXiv paper :page_facing_up:

    Test and Dev subsets

    RuWikidata sample

    Dataset is also published on Zenodo

    Usage

    The dataset is thought to be used as a development and test sets in cross-lingual transfer, few-shot learning, or learning with synthetic data scenarios.

    Format

    Data set files are presented in JSON format as an array of dictionary entries. See full specifications here.

    Examples

    QuestionQueryAnswersTags
    Rus: Кто написал роман «Хижина дяди Тома»?

    Eng: Who wrote the novel "Uncle Tom's Cabin"?
    SELECT ?answer 
    WHERE {
    wd:Q2222 wdt:P50 ?answer .
    }
    wd:Q102513
    (Harriet Beecher Stowe)
    1-hop
    Rus: Кто сыграл князя Андрея Болконского в фильме С. Ф. Бондарчука «Война и мир»?

    Eng: Who played Prince Andrei Bolkonsky in S. F. Bondarchuk's film "War and peace"?
    SELECT ?answer
    WHERE {
    wd:Q845176 p:P161 [
    ps:P161 ?answer;
    pq:P453 wd:Q2737140
    ] .
    }
    wd:Q312483
    (Vyacheslav Tikhonov)
    qualifier-constraint
    Rus: Кто на работе пользуется теодолитом?

    Eng: Who uses a theodolite for work?
    SELECT ?answer 
    WHERE {
    wd:Q181517 wdt:P366 [
    wdt:P3095 ?answer
    ] .
    }
    wd:Q1734662
    (cartographer)
    wd:Q11699606
    (geodesist)
    wd:Q294126
    (land surveyor)
    multi-hop
    Rus: Какой океан самый маленький?

    Eng: Which ocean is the smallest?
    SELECT ?answer 
    WHERE {
    ?answer p:P2046/
    psn:P2046/
    wikibase:quantityAmount ?sq .
    ?answer wdt:P31 wd:Q9430 .
    }
    ORDER BY ASC(?sq)
    LIMIT 1
    wd:Q788
    (Arctic Ocean)
    multi-constraint

    reverse

    ranking

    RuWikidata8M Sample

    We provide a Wikidata sample containing all the entities with Russian labels. It consists of about 212M triples with 8.1M unique entities. This snapshot mitigates the problem of Wikidata’s dynamics – a reference answer may change with time as the knowledge base evolves. The sample guarantees the correctness of the queries and answers. In addition, the smaller dump makes it easier to conduct experiments with our dataset.

    We strongly recommend using this sample for evaluation.

    Details

    Sample is a collection of several RDF files in Turtle.

    • wdt_all.ttl contains all the truthy statements.
    • names.ttl contains Russian and English labels and aliases for all entities. Names in other language also provided when needed.
    • onto.ttl contains all Wikidata triples with relation wdt:P279 - subclass of. It represents some class hierarchy, but remember that there is no class or instance concepts in Wikidata.
    • pch_{0,6}.ttl contain all statetment nodes and their data for all entities.

    Evaluation

    rdfs:label and skos:altLabel predicates convention

    Some question in our dataset require using rdfs:label or skos:altLabel for retrieving answer which is a literal. In cases where answer language doesn't have to be inferred from question, our evaluation script takes into account Russian literals only.

    Reference

    If you use RuBQ dataset in your work, please cite:

    @inproceedings{RuBQ2020,
     title={{RuBQ}: A {Russian} Dataset for Question Answering over {Wikidata}},
     author={Vladislav Korablinov and Pavel Braslavski},
     booktitle={ISWC},
     year={2020},
     pages={97--110}
    }
    

    This work is licensed under a "http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License.

    CC BY-SA 4.0

  17. Sigma Dolphin Filtered and Cleaned

    • kaggle.com
    zip
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryan Mutiga (2024). Sigma Dolphin Filtered and Cleaned [Dataset]. https://www.kaggle.com/datasets/ryanmutiga/sigma-dolphin-filtered-and-cleaned
    Explore at:
    zip(60569 bytes)Available download formats
    Dataset updated
    Jun 25, 2024
    Authors
    Ryan Mutiga
    Description

    Dataset Description for Filtered Sigma Dolphin Dataset

    Overview

    This dataset is a cleaned and filtered version of the Sigma Dolphin dataset (https://www.kaggle.com/datasets/saurabhshahane/sigmadolphin), designed to aid in solving maths word problems using AI techniques. This was used as an effort towards taking part in the AI Mathematical Olympiad - Progress Prize 1 (https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/overview). The dataset was processed using TF-IDF vectorisation and K-means clustering, specifically targeting questions relevant to the AIME (American Invitational Mathematics Examination) and AMC 12 (American Mathematics Competitions).

    Context

    The Sigma Dolphin dataset is a project initiated by Microsoft Research Asia, aimed at building an intelligent system with natural language understanding and reasoning capacities to automatically solve maths word problems written in natural language. This project began in early 2013, and the dataset includes maths word problems from various sources, including community question-answering sites like Yahoo! Answers.

    Source and Original Dataset Details

    Content

    The filtered dataset includes problems that are relevant for preparing for maths competitions such as AIME and AMC. The data is structured to facilitate the training and evaluation of AI models aimed at solving these types of problems.

    Datasets:

    There are several filtered versions of the dataset based on different similarity thresholds (0.3 and 0.5). These thresholds were used to determine the relevance of problems from the original Sigma Dolphin dataset to the AIME and AMC problems.

    1. Number Word Problems Filtered at 0.3 Threshold:

      • File: number_word_test_filtered_0.3_Threshold.csv
      • Description: Contains problems filtered with a similarity threshold of 0.3, ensuring moderate relevance to AIME and AMC 12 problems.
    2. Number Word Problems Filtered at 0.5 Threshold:

      • File: number_word_std.test_filtered_0.5_Threshold.csv
      • Description: Contains problems filtered with a higher similarity threshold of 0.5, ensuring higher relevance to AIME and AMC 12 problems.
    3. Filtered Number Word Problems 2 at 0.3 Threshold:

      • File: filtered_number_word_problems2_Threshold.csv
      • Description: Another set of problems filtered at a 0.3 similarity threshold.
    4. Filtered Number Word Problems 2 at 0.5 Threshold:

      • File: filtered_number_word_problems_Threshold.csv
      • Description: Another set of problems filtered at a 0.5 similarity threshold.

    Why Different Similarity Thresholds?

    Different similarity thresholds (0.3 and 0.5) are used to provide flexibility in selecting problems based on their relevance to AIME and AMC problems. A lower threshold (0.3) includes a broader range of problems, ensuring a diverse set of questions, while a higher threshold (0.5) focuses on problems with stronger relevance, offering a more targeted and precise dataset. This allows users to choose the level of specificity that best fits their needs.

    For a detailed explanation of the preprocessing and filtering process, please refer to the Sigma Dolphin Filtered & Cleaned Notebook.

    Acknowledgements

    We extend our gratitude to all the original authors of the Sigma Dolphin dataset and the creators of the AIME and AMC problems. This project leverages the work of numerous researchers and datasets to build a comprehensive resource for AI-based problem solving in mathematics.

    Usage

    This dataset is intended for research and educational purposes. It can be used to train AI models for natural language processing and problem-solving tasks, specifically targeting maths word problems in competitive environments like AIME and AMC.

    Licensing

    This dataset is shared under the Computational Use of Data Agreement v1.0.

    This description provides an extensive overview of the dataset, its sources, contents, and usage. If any specific details or additional sections are needed, please let me know!

  18. e

    SweNLI 1.0

    • data.europa.eu
    • researchdata.se
    unknown
    Updated Oct 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Göteborgs universitet (2025). SweNLI 1.0 [Dataset]. https://data.europa.eu/data/datasets/https-doi-org-10-23695-ds6w-d280?locale=it
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Oct 18, 2025
    Dataset authored and provided by
    Göteborgs universitet
    Description

    I. IDENTIFYING INFORMATION

    Title* SweNLI

    Subtitle

    Created by* Felix Morger (felix.morger@gu.se), Lars Borin, Aleksandrs Berdicevskis (Gothenburg University)

    Publisher(s)* Språkbanken Text (sb-info@svenska.gu.se)

    Link(s) / permanent identifier(s)* https://spraakbanken.gu.se/en/resources/superlim

    License(s)* CC BY 4.0

    Abstract* A Swedish NLI dataset. Train and dev are machine-translated from the English MNLI dataset, test is manually translated and adapted from the English Fracas dataset.

    Funded by* Vinnova (grants no. 2020-02523, 2021-04165)

    Cite as

    Related datasets Part of the SuperLim collection. Similar to SuperGLUE diagnostic dataset.

    II. USAGE

    Key applications Machine Learning, Inference, Entailment, Evaluation of language models, Diagnostics

    Intended task(s)/usage(s) Natural language inference.

    Recommended evaluation measures Krippendorff's Alpha (the official SuperLim measure), Accuracy

    Dataset function(s) Training, testing

    Recommended split(s) Train, dev, test (provided)

    III. DATA

    Primary data* Text

    Language* Swedish. Train and dev: machine-translated

    Dataset in numbers* Train: 392704 items, dev: 9815 items, test: 305 items

    Nature of the content* Inference problems, where a relation between a premise and a hypothesis has to be detected: entailment, neutral or contradiction.

    Format* JSON Lines, with one item per line. Each item contains an id, a premise (in test, the premise may contain several sentences, but is still represented as a single item), a hypothesis and a label. The dataset is also available as a tsv with self-explanatory column names. For test, an additional file is provided where the items can be matched with the original Fracas items

    Data source(s)* Train and dev: see [1]. Machine translated from English to Swedish using OPUS-MT. Test: see [2] and 'Data collection methods'.

    Data collection method(s)* Train and dev: see [1]. Test: SweFracas (part of the SuperLim 1.0). The original English Fracas [2] was converted to html and edited by Bill MacCartney [3], and then automatically translated to Swedish by Peter Ljunglöf and Magdalena Siverbo [4]. The current form of the set was created by Aleksandrs Berdicevskis by merging the Swedish and English versions and removing some of the problems. Finally, Lars Borin went through all the translations, correcting and Swedifying them manually. As a result, many translations are rather liberal and diverge noticeably from the English original

    Data selection and filtering* Train and dev: We keep only the mismatched validation as a dev set and do not include the matched version. We also do not include the test MNLI datasets. Test: 41 problems in the original set did not have a definite answer (different answers were possible depending on the interpretation). They were excluded.

    Data preprocessing* Train and dev: see [1]. All extra column labels except for hypothesis (sentence1), premise (sentence2) have been removed for this data source. Test: SweFracas used questions (Ja/Nej/Vet ej/Jo) instead of hypotheses. Questions were semi-automatically converted to hypotheses by Aleksandrs Berdicevskis to fit the train and dev format.

    Data labeling* Train and dev: see [1]. Test: Most of the labels map straightforwardly on the original English labels, with one exception: 108 (No Neutral)

    Annotator characteristics Train and dev: see [1]. Test: PhD in linguistics; native speaker of Swedish

    IV. ETHICS AND CAVEATS

    Ethical considerations Train and dev: see [1].

    Things to watch out for Train and dev: see [1]. Remember that the data were machine-translated. Test: In the original dataset, all examples were classified by the linguistic phenomena they represent. It is not necessary that the Swedish translations follow exactly the same classification (most of them probably do, but it has not been checked).

    V. ABOUT DOCUMENTATION

    Data last updated* 2023-01-25

    Which changes have been made, compared to the previous version* The translated MNLI and SweFracas were merged to created a complete dataset.

    Access to previous versions

    This document created* 2023-01-25, Felix Morger.

    This document last updated* 2023-02-08, Aleksandrs Berdicevskis.

    Where to look for further details

    Documentation template version* v1.1

    VI. OTHER

    Related projects

    References [1] Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.

    [2] Robin Cooper, Dick Crouch, Jan Van Eijck, Chris Fox, Johan Van Genabith, Jan Jaspars, Hans Kamp, David Milward, Manfred Pinkal, Massimo Poesio, et al. 1996. Using the framework. Technical report, Technical Report LRE 62-051 D-16, The FraCaS Consortium. ftp://ftp.cogsci.ed.ac.uk/pub/FRACAS/del16

  19. Wine Quality Data Set (Red & White Wine)

    • kaggle.com
    zip
    Updated Nov 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ruthgn (2021). Wine Quality Data Set (Red & White Wine) [Dataset]. https://www.kaggle.com/datasets/ruthgn/wine-quality-data-set-red-white-wine
    Explore at:
    zip(100361 bytes)Available download formats
    Dataset updated
    Nov 3, 2021
    Authors
    ruthgn
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Set Information

    This data set contains records related to red and white variants of the Portuguese Vinho Verde wine. It contains information from 1599 red wine samples and 4898 white wine samples. Input variables in the data set consist of the type of wine (either red or white wine) and metrics from objective tests (e.g. acidity levels, PH values, ABV, etc.), while the target/output variable is a numerical score based on sensory data—median of at least 3 evaluations made by wine experts. Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Due to privacy and logistic issues, there is no data about grape types, wine brand, and wine selling price.

    This data set is a combined version of the two separate files (distinct red and white wine data sets) originally shared in the UCI Machine Learning Repository.

    The following are some existing data sets on Kaggle from the same source (with notable differences from this data set): - Red Wine Quality (contains red wine data only) - Wine Quality (combination of red and white wine data but with some values randomly removed) - Wine Quality (red and white wine data not combined)

    Contents

    Input variables:

    1 - type of wine: type of wine (categorical: 'red', 'white')

    (continuous variables based on physicochemical tests)

    2 - fixed acidity: The acids that naturally occur in the grapes used to ferment the wine and carry over into the wine. They mostly consist of tartaric, malic, citric or succinic acid that mostly originate from the grapes used to ferment the wine. They also do not evaporate easily. (g / dm^3)

    3 - volatile acidity: Acids that evaporate at low temperatures—mainly acetic acid which can lead to an unpleasant, vinegar-like taste at very high levels. (g / dm^3)

    4 - citric acid: Citric acid is used as an acid supplement which boosts the acidity of the wine. It's typically found in small quantities and can add 'freshness' and flavor to wines. (g / dm^3)

    5 - residual sugar: The amount of sugar remaining after fermentation stops. It's rare to find wines with less than 1 gram/liter. Wines residual sugar level greater than 45 grams/liter are considered sweet. On the other end of the spectrum, a wine that does not taste sweet is considered as dry. (g / dm^3)

    6 - chlorides: The amount of chloride salts (sodium chloride) present in the wine. (g / dm^3)

    7 - free sulfur dioxide: The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine. All else constant, the higher the free sulfur dioxide content, the stronger the preservative effect. (mg / dm^3)

    8 - total sulfur dioxide: The amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine. (mg / dm^3)

    9 - density: The density of wine juice depending on the percent alcohol and sugar content; it's typically similar but higher than that of water (wine is 'thicker'). (g / cm^3)

    10 - pH: A measure of the acidity of wine; most wines are between 3-4 on the pH scale. The lower the pH, the more acidic the wine is; the higher the pH, the less acidic the wine. (The pH scale technically is a logarithmic scale that measures the concentration of free hydrogen ions floating around in your wine. Each point of the pH scale is a factor of 10. This means a wine with a pH of 3 is 10 times more acidic than a wine with a pH of 4)

    11 - sulphates: Amount of potassium sulphate as a wine additive which can contribute to sulfur dioxide gas (S02) levels; it acts as an antimicrobial and antioxidant agent.(g / dm3)

    12 - alcohol: How much alcohol is contained in a given volume of wine (ABV). Wine generally contains between 5–15% of alcohols. (% by volume)

    Output variable:

    13 - quality: score between 0 (very bad) and 10 (very excellent) by wine experts

    Acknowledgements

    Source: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

    Data credit goes to UCI. Visit their website to access the original data set directly: https://archive.ics.uci.edu/ml/datasets/wine+quality

    Context

    So much about wine making remains elusive—taste is very subjective, making it extremely challenging to predict exactly how consumers will react to a certain bottle of wine. There is no doubt that winemakers, connoisseurs, and scientists have greatly contributed their expertise to ...

  20. E-News Express

    • kaggle.com
    zip
    Updated Sep 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mariyam Al Shatta (2023). E-News Express [Dataset]. https://www.kaggle.com/datasets/mariyamalshatta/e-news-express
    Explore at:
    zip(925 bytes)Available download formats
    Dataset updated
    Sep 28, 2023
    Authors
    Mariyam Al Shatta
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Business Context

    The advent of e-news, or electronic news, portals has offered us a great opportunity to quickly get updates on the day-to-day events occurring globally. The information on these portals is retrieved electronically from online databases, processed using a variety of software, and then transmitted to the users. There are multiple advantages of transmitting new electronically, like faster access to the content and the ability to utilize different technologies such as audio, graphics, video, and other interactive elements that are either not being used or aren’t common yet in traditional newspapers.

    E-news Express, an online news portal, aims to expand its business by acquiring new subscribers. With every visitor to the website taking certain actions based on their interest, the company plans to analyze these actions to understand user interests and determine how to drive better engagement. The executives at E-news Express are of the opinion that there has been a decline in new monthly subscribers compared to the past year because the current webpage is not designed well enough in terms of the outline & recommended content to keep customers engaged long enough to make a decision to subscribe.

    [Companies often analyze user responses to two variants of a product to decide which of the two variants is more effective. This experimental technique, known as A/B testing, is used to determine whether a new feature attracts users based on a chosen metric.]

    Objective

    The design team of the company has researched and created a new landing page that has a new outline & more relevant content shown compared to the old page. In order to test the effectiveness of the new landing page in gathering new subscribers, the Data Science team conducted an experiment by randomly selecting 100 users and dividing them equally into two groups. The existing landing page was served to the first group (control group) and the new landing page to the second group (treatment group). Data regarding the interaction of users in both groups with the two versions of the landing page was collected. Being a data scientist in E-news Express, you have been asked to explore the data and perform a statistical analysis (at a significance level of 5%) to determine the effectiveness of the new landing page in gathering new subscribers for the news portal by answering the following questions:

    Do the users spend more time on the new landing page than on the existing landing page? Is the conversion rate (the proportion of users who visit the landing page and get converted) for the new page greater than the conversion rate for the old page? Does the converted status depend on the preferred language? Is the time spent on the new page the same for the different language users?

    Data Dictionary

    The data contains information regarding the interaction of users in both groups with the two versions of the landing page.

    user_id - Unique user ID of the person visiting the website group - Whether the user belongs to the first group (control) or the second group (treatment) landing_page - Whether the landing page is new or old time_spent_on_the_page - Time (in minutes) spent by the user on the landing page converted - Whether the user gets converted to a subscriber of the news portal or not language_preferred - Language chosen by the user to view the landing page

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Stephan Wiefling; Stephan Wiefling; Paul René Jørgensen; Paul René Jørgensen; Sigurd Thunem; Sigurd Thunem; Luigi Lo Iacono; Luigi Lo Iacono (2022). Login Data Set for Risk-Based Authentication [Dataset]. http://doi.org/10.5281/zenodo.6782156
Organization logo

Data from: Login Data Set for Risk-Based Authentication

Related Article
Explore at:
zipAvailable download formats
Dataset updated
Jun 30, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Stephan Wiefling; Stephan Wiefling; Paul René Jørgensen; Paul René Jørgensen; Sigurd Thunem; Sigurd Thunem; Luigi Lo Iacono; Luigi Lo Iacono
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Login Data Set for Risk-Based Authentication

Synthesized login feature data of >33M login attempts and >3.3M users on a large-scale online service in Norway. Original data collected between February 2020 and February 2021.

This data sets aims to foster research and development for Risk-Based Authentication (RBA) systems. The data was synthesized from the real-world login behavior of more than 3.3M users at a large-scale single sign-on (SSO) online service in Norway.

The users used this SSO to access sensitive data provided by the online service, e.g., a cloud storage and billing information. We used this data set to study how the Freeman et al. (2016) RBA model behaves on a large-scale online service in the real world (see Publication). The synthesized data set can reproduce these results made on the original data set (see Study Reproduction). Beyond that, you can use this data set to evaluate and improve RBA algorithms under real-world conditions.

WARNING: The feature values are plausible, but still totally artificial. Therefore, you should NOT use this data set in productive systems, e.g., intrusion detection systems.

Overview

The data set contains the following features related to each login attempt on the SSO:

FeatureData TypeDescriptionRange or Example
IP AddressStringIP address belonging to the login attempt0.0.0.0 - 255.255.255.255
CountryStringCountry derived from the IP addressUS
RegionStringRegion derived from the IP addressNew York
CityStringCity derived from the IP addressRochester
ASNIntegerAutonomous system number derived from the IP address0 - 600000
User Agent StringStringUser agent string submitted by the clientMozilla/5.0 (Windows NT 10.0; Win64; ...
OS Name and VersionStringOperating system name and version derived from the user agent stringWindows 10
Browser Name and VersionStringBrowser name and version derived from the user agent stringChrome 70.0.3538
Device TypeStringDevice type derived from the user agent string(mobile, desktop, tablet, bot, unknown)1
User IDIntegerIdenfication number related to the affected user account[Random pseudonym]
Login TimestampIntegerTimestamp related to the login attempt[64 Bit timestamp]
Round-Trip Time (RTT) [ms]IntegerServer-side measured latency between client and server1 - 8600000
Login SuccessfulBooleanTrue: Login was successful, False: Login failed(true, false)
Is Attack IPBooleanIP address was found in known attacker data set(true, false)
Is Account TakeoverBooleanLogin attempt was identified as account takeover by incident response team of the online service(true, false)

Data Creation

As the data set targets RBA systems, especially the Freeman et al. (2016) model, the statistical feature probabilities between all users, globally and locally, are identical for the categorical data. All the other data was randomly generated while maintaining logical relations and timely order between the features.

The timestamps, however, are not identical and contain randomness. The feature values related to IP address and user agent string were randomly generated by publicly available data, so they were very likely not present in the real data set. The RTTs resemble real values but were randomly assigned among users per geolocation. Therefore, the RTT entries were probably in other positions in the original data set.

  • The country was randomly assigned per unique feature value. Based on that, we randomly assigned an ASN related to the country, and generated the IP addresses for this ASN. The cities and regions were derived from the generated IP addresses for privacy reasons and do not reflect the real logical relations from the original data set.

  • The device types are identical to the real data set. Based on that, we randomly assigned the OS, and based on the OS the browser information. From this information, we randomly generated the user agent string. Therefore, all the logical relations regarding the user agent are identical as in the real data set.

  • The RTT was randomly drawn from the login success status and synthesized geolocation data. We did this to ensure that the RTTs are realistic ones.

Regarding the Data Values

Due to unresolvable conflicts during the data creation, we had to assign some unrealistic IP addresses and ASNs that are not present in the real world. Nevertheless, these do not have any effects on the risk scores generated by the Freeman et al. (2016) model.

You can recognize them by the following values:

  • ASNs with values >= 500.000

  • IP addresses in the range 10.0.0.0 - 10.255.255.255 (10.0.0.0/8 CIDR range)

Study Reproduction

Based on our evaluation, this data set can reproduce our study results regarding the RBA behavior of an RBA model using the IP address (IP address, country, and ASN) and user agent string (Full string, OS name and version, browser name and version, device type) as features.

The calculated RTT significances for countries and regions inside Norway are not identical using this data set, but have similar tendencies. The same is true for the Median RTTs per country. This is due to the fact that the available number of entries per country, region, and city changed with the data creation procedure. However, the RTTs still reflect the real-world distributions of different geolocations by city.

See RESULTS.md for more details.

Ethics

By using the SSO service, the users agreed in the data collection and evaluation for research purposes. For study reproduction and fostering RBA research, we agreed with the data owner to create a synthesized data set that does not allow re-identification of customers.

The synthesized data set does not contain any sensitive data values, as the IP addresses, browser identifiers, login timestamps, and RTTs were randomly generated and assigned.

Publication

You can find more details on our conducted study in the following journal article:

Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service (2022)
Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono.
ACM Transactions on Privacy and Security

Bibtex

@article{Wiefling_Pump_2022,
 author = {Wiefling, Stephan and Jørgensen, Paul René and Thunem, Sigurd and Lo Iacono, Luigi},
 title = {Pump {Up} {Password} {Security}! {Evaluating} and {Enhancing} {Risk}-{Based} {Authentication} on a {Real}-{World} {Large}-{Scale} {Online} {Service}},
 journal = {{ACM} {Transactions} on {Privacy} and {Security}},
 doi = {10.1145/3546069},
 publisher = {ACM},
 year  = {2022}
}

License

This data set and the contents of this repository are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. See the LICENSE file for details. If the data set is used within a publication, the following journal article has to be cited as the source of the data set:

Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono: Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service. In: ACM Transactions on Privacy and Security (2022). doi: 10.1145/3546069

  1. Few (invalid) user agents strings from the original data set could not be parsed, so their device type is empty. Perhaps this parse error is useful information for your studies, so we kept these 1526 entries.↩︎

Search
Clear search
Close search
Google apps
Main menu