100+ datasets found

f
The numerical data used in all figures.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liu, Bin; Pang, Yihe (2023). The numerical data used in all figures. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001063777
Explore at:
Dataset updated
Nov 24, 2023
Authors
Liu, Bin; Pang, Yihe
Description
Intrinsically disordered proteins (IDPs) and regions (IDRs) are a class of functionally important proteins and regions that lack stable three-dimensional structures under the native physiologic conditions. They participate in critical biological processes and thus are associated with the pathogenesis of many severe human diseases. Identifying the IDPs/IDRs and their functions will be helpful for a comprehensive understanding of protein structures and functions, and inform studies of rational drug design. Over the past decades, the exponential growth in the number of proteins with sequence information has deepened the gap between uncharacterized and annotated disordered sequences. Protein language models have recently demonstrated their powerful abilities to capture complex structural and functional information from the enormous quantity of unlabelled protein sequences, providing opportunities to apply protein language models to uncover the intrinsic disorders and their biological properties from the amino acid sequences. In this study, we proposed a computational predictor called IDP-LM for predicting intrinsic disorder and disorder functions by leveraging the pre-trained protein language models. IDP-LM takes the embeddings extracted from three pre-trained protein language models as the exclusive inputs, including ProtBERT, ProtT5 and a disorder specific language model (IDP-BERT). The ablation analysis shown that the IDP-BERT provided fine-grained feature representations of disorder, and the combination of three language models is the key to the performance improvement of IDP-LM. The evaluation results on independent test datasets demonstrated that the IDP-LM provided high-quality prediction results for intrinsic disorder and four common disordered functions.
Z
Data Set of Existing Summary Statistics from Equipment Sensor Data
data.niaid.nih.gov
data.europa.eu
Updated Feb 11, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Pleschberger (2021). Data Set of Existing Summary Statistics from Equipment Sensor Data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4533817
Explore at:
Dataset updated
Feb 11, 2021
Dataset provided by
KAI - Kompetenzzentrum Automobil- und Industrieelektronik GmbH
Authors
Martin Pleschberger
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set was generated in accordance with the semiconductor industry and contains sensor recordings from high-precision and high-tech production equipment. Basically, the semiconductor production consists of hundreds of process steps performing physical and chemical operations on so-called wafers, i.e. slices based on semiconductor material. Typically, bunches of wafers are aggregated into so-called lots of size 25, which always pass through the same operations in the production chain.

In the production chain, each process equipment is equipped with several sensors recording physical parameters like gas flow, temperature, voltage, etc., resulting in so-called sensor data recorded during each process step. Out of these time-dependent sensor data, so called key numbers (KNs) are extracted using a certain time frame in the individual sensor recordings judged by experts to be important for the process. To keep the entire production as stable as possible, the KNs are used in order to intervene in case of deviations.

After the production, each device on the wafer is tested in the most careful way resulting in so-called wafer test data. In some cases, suspicious patterns occur in the wafer test data potentially leading to failure. In this case the root cause must be found in the production chain. For this purpose, the given KNs are provided. The aim is to find correlations between the wafer test data and the KNs in order to identify the root cause.

The given data is divided into three data sets: "process1.csv", "process2.csv" and "response.csv". "process1.csv" and "process2.csv" represent the extracted KNs from two process equipment. The "response.csv" data set contains the corresponding wafer test data. For the unique identification, the first two columns in each data set are the lot number and the wafer number respectively.

The exact column structure is given as follows: for "process1.csv" and "process2.csv":

lot: the lot number wafer: the wafer number KN1: the recordings of the first sensor KN2: the recordings of the second sensor . . . KN50: the recordings of the last sensor

"KN1"-"KN36" belongs to "process1" and "KN37"-"KN50" belongs to "process2".

for "response.csv":

lot: the lot number wafer: the wafer number response: the numerical test values class: the "good"/"bad" classification depending on the response value (threshold: 0,75)
d
Data from: Data and code from: The Impacts of Parental Choice and...
catalog.data.gov
agdatacommons.nal.usda.gov
+1more
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data and code from: The Impacts of Parental Choice and Intrapopulation Selection for Seed Size on the Uprightness of Progeny Derived from Interspecific Hybridization between Glycine max and Glycine soja [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-the-impacts-of-parental-choice-and-intrapopulation-selection-for-seed-s-e773b
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Service
Description
This dataset contains all data and code necessary to reproduce the analysis described under the heading "Experiment 3" in the manuscript: Taliercio, E., Eickholt, D., Read, Q. D., Carter, T., Waldeck, N., & Fallen, B. (2023). Parental choice and seed size impact the uprightness of progeny from interspecific Glycine hybridizations. Crop Science. https://doi.org/10.1002/csc2.21015 The attached files are: G_max_G_soja_seedweight_seedcolor_analysis.Rmd: RMarkdown notebook containing all analysis code. The CSV data files should be placed in a subdirectory called data within the working directory from which the notebook is rendered. G_max_G_soja_seedweight_seedcolor_analysis.html: Rendered HTML output from RMarkdown notebook, including figures, tables, and explanatory text. counts_seedwt.csv: CSV file containing the number of progeny selected and average 100-seed weight data for each combination of cross, size class, and replicate. Columns are: F3_location: text identifier of F3 nursery location, either "CLA" or "FF" plot: numeric ID of plot pop: numeric ID of population max: name of G. max parent soja: name of G. soja parent F2_location: text identifier of F2 nursery location, either "Caswell" or "Hugo" n_planted: number of seeds planted (raw) n_selected: number of progeny selected size_ordered: seed size class, to be converted to an ordered factor size_combined: seed size class aggregated to fewer unique levels ave_100sw: average 100-seed weight for the given size class n_planted_trials: number of seeds planted rounded to nearest integer seedcolor.csv: CSV file with additional data on number of seeds of each color by population. Columns are: cross: text identifier of cross line: text identifier of line light: number of light seeds mid: number of mid-green seeds brown: number of brown seeds dark: number of dark or black seeds population: identifier of population type (F2 derived or selected) max: name of G. max parent n_total: sum of the light, mid, brown, and dark columns soja: name of G. soja parent The data processing and analysis pipeline in the RMarkdown notebook includes: Importing the data (slightly cleaned version is provided) Creating boxplots of proportion selected by cross, nursery location, and size class Fitting logistic GLMM to estimate the probability of selection as a function of parent, 100-seed weight, and their interactions Extracting and plotting random effect estimates from model Calculating and plotting estimated marginal means from model Taking contrasts between pairs of estimated marginal means and trends Calculating Bayes Factors associated with the contrasts Generating figures and tables for all above results Additional seed color analysis: importing data (slightly cleaned version is provided) Additional seed color analysis: drawing exploratory bar plot Additional seed color analysis: fitting multinomial GLM modeling the proportion of seeds with each color as a function of population Additional seed color analysis: generating expected value predictions from GLM and taking contrasts Additional seed color analysis: creating figures and tables for model results This research was funded by CRIS 6070-21220-069-00D, United Soybean Board Project # 2333-203-0101, and falls under National Program NP301. Resources in this dataset:Resource Title: RMarkdown document with all analysis code. File Name: G_max_G_soja_seedweight_seedcolor_analysis.RmdResource Title: Rendered HTML version of notebook. File Name: G_max_G_soja_seedweight_seedcolor_analysis.htmlResource Title: Progeny counts and seed weight data. File Name: counts_seedwt.csvResource Title: Seed color counts data. File Name: seedcolor.csv
Z
Data from: Grouping strategies in number estimation extend the subitizing...
data-staging.niaid.nih.gov
data.niaid.nih.gov
Updated Nov 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maldonado Moscoso Paula Andrea; Castaldi Elisa; Burr David C.; Arrighi Roberto; Anobile Giovanni (2020). Grouping strategies in number estimation extend the subitizing range [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_4292116
Explore at:
Dataset updated
Nov 30, 2020
Dataset provided by
Department of Neuroscience, Psychology, Pharmacology and Child Health, University of Florence, Florence, Italy
Authors
Maldonado Moscoso Paula Andrea; Castaldi Elisa; Burr David C.; Arrighi Roberto; Anobile Giovanni
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In the calculation folder: each file contains a matrix called “MATR”. Each row of the matrix “MATR” is a trial.

The columns contain the following information:

1st: Number of trial

2nd: Subject response

4th: Response time

5th: first number

6th: math symbol (1=*; 2= +; 3= –)

7th: second number

8th: third number

In the calculation folder: each file contains a matrix called “matr”. Each row of the matrix “matr” is a trial.

The columns contain the following information:

1st: subject response in the numerosity task

2nd: the presented numerosity

3rd: subject response in the numerosity task

4th: zero

5th: stimulus duration

6th: Response time in the numerosity task

7th: Grouped (1) or random (2) presentation

8th: 1

9th: 1

10th: Number of items of the upper-left quadrant

11th: Number of items of the lower-left quadrant

12th: Number of items of the upper-right quadrant

13th: Number of items of the lower - right quadrant

14th: odd shape presented (1=diamond; 2=triangle; 3=circle)

15th: subject response in the shape task

16th: 0.2 in the single task response time in the shape task when dual task

17th: single (0) or dual (1) task

18th: time stimulus on

19th: time stimulus off
d
Phone Number Data | USA Coverage | 765 Mil+ Numbers
datarade.ai
.csv
Updated Mar 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BIGDBM (2023). Phone Number Data | USA Coverage | 765 Mil+ Numbers [Dataset]. https://datarade.ai/data-products/bigdbm-us-consumer-phone-package-bigdbm
Explore at:
.csvAvailable download formats
Dataset updated
Mar 15, 2023
Dataset authored and provided by
BIGDBM
Area covered
United States
Description
The US Consumer Phone file contains phone numbers, mobile and landline, tied to an individual in the Consumer Database. The fields available include the phone number, phone type, mobile carrier, and Do Not Call registry status.

All phone numbers can be processed and cleansed using telecom carrier data. The telecom data, including phone and texting activity, porting instances, carrier scoring, spam, and known fraud activity, comprise a proprietary Phone Quality Level (PQL), which is a data-science derived score to ensure the highest levels of deliverability at a fraction of the cost compared to competitors.

We have developed this file to be tied to our Consumer Demographics Database so additional demographics can be applied as needed. Each record is ranked by confidence and only the highest quality data is used.

Note - all Consumer packages can include necessary PII (address, email, phone, DOB, etc.) for merging, linking, and activation of the data.

BIGDBM Privacy Policy: https://bigdbm.com/privacy.html
Number of data compromises and impacted individuals in U.S. 2005-2024
statista.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista, Number of data compromises and impacted individuals in U.S. 2005-2024 [Dataset]. https://www.statista.com/statistics/273550/data-breaches-recorded-in-the-united-states-by-number-of-breaches-and-records-exposed/
Explore at:
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
United States
Description
In 2024, the number of data compromises in the United States stood at 3,158 cases. Meanwhile, over 1.35 billion individuals were affected in the same year by data compromises, including data breaches, leakage, and exposure. While these are three different events, they have one thing in common. As a result of all three incidents, the sensitive data is accessed by an unauthorized threat actor. Industries most vulnerable to data breaches Some industry sectors usually see more significant cases of private data violations than others. This is determined by the type and volume of the personal information organizations of these sectors store. In 2024 the financial services, healthcare, and professional services were the three industry sectors that recorded most data breaches. Overall, the number of healthcare data breaches in some industry sectors in the United States has gradually increased within the past few years. However, some sectors saw decrease. Largest data exposures worldwide In 2020, an adult streaming website, CAM4, experienced a leakage of nearly 11 billion records. This, by far, is the most extensive reported data leakage. This case, though, is unique because cyber security researchers found the vulnerability before the cyber criminals. The second-largest data breach is the Yahoo data breach, dating back to 2013. The company first reported about one billion exposed records, then later, in 2017, came up with an updated number of leaked records, which was three billion. In March 2018, the third biggest data breach happened, involving India’s national identification database Aadhaar. As a result of this incident, over 1.1 billion records were exposed.
d
Specific heat data in GaAs/AlGaAs at 5/2 filling fraction: 20201112 data set...
dataone.org
borealisdata.ca
Updated Dec 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Petrescu, Matei (2023). Specific heat data in GaAs/AlGaAs at 5/2 filling fraction: 20201112 data set [Dataset]. http://doi.org/10.5683/SP3/31PUPX
Explore at:
Unique identifier
https://doi.org/10.5683/SP3/31PUPX
Dataset updated
Dec 28, 2023
Dataset provided by
Borealis
Authors
Petrescu, Matei
Description
Each data set name incorporates the date in which the data was acquired. There are a total of 6 data sets as shown in the table below (the current one, 20201112, is part of it). In each data set there is a series of txt files (current one has 30) that contain relevant information about the dataset such as the magnetic field (both relative B* and absolute values), the FQH state as well as a table with the raw data file names and the corresponding quantities of interest (input square wave DC offset + amplitude in Volts, set temperature in Kelvin, mean/uncertainty of/in temperature in Kelvin, signal samples - averaging, resistor values in Ohms, gain, frequency in Hertz, time stamp and time difference between each acquisition in seconds). The name of these information files is composed of the date in which the data set was acquired, the cooldown number, the sample name/number and the conductance (G) followed by the file number (increment of 1). Furthermore, each data set contains all the raw data save in txt files (current one has 570). The name of each file begins with Zurich (ZH) is followed be the cooldown number, the date in which the dataset was acquired, the row number of the data matrix (for each new thermal bath - dilution refrigerator temperature) and the column number of the data matrix (for each different input voltage - i.e. square wave DC offset + amplitude at the same thermal bath - dilution refrigerator temperature). Each raw data file contains 2 columns. The first column is the time in seconds. The second column is the voltage drop across the sensing 1 kilo Ohm resistor in Volts (which translates to current across the Corbino device). This is the raw response (to the bipolar square wave input) of the 2D electron gas (2DEG) measured by the Zurich digitizer. Raw data to download: ZH-20C1-1112-1-00000.txt up to ZH-20C1-1112-30-00018.txt (570 files in total).

Automobile Sales data

kaggle.com

zip

Updated Nov 18, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

dee dee (2023). Automobile Sales data [Dataset]. https://www.kaggle.com/datasets/ddosad/auto-sales-data/data

Explore at:

zip(81125 bytes)Available download formats

Dataset updated

Nov 18, 2023

Authors

dee dee

Description

The dataset contains Sales data of an Automobile company.

Do explore pinned 📌 notebook under code section for quick EDA📊 reference

Consider an upvote ^ if you find the dataset useful

Data Description

Column Name	Description
ORDERNUMBER	This column represents the unique identification number assigned to each order.
QUANTITYORDERED	It indicates the number of items ordered in each order.
PRICEEACH	This column specifies the price of each item in the order.
ORDERLINENUMBER	It represents the line number of each item within an order.
SALES	This column denotes the total sales amount for each order, which is calculated by multiplying the quantity ordered by the price of each item.
ORDERDATE	It denotes the date on which the order was placed.
DAYS_SINCE_LASTORDER	This column represents the number of days that have passed since the last order for each customer. It can be used to analyze customer purchasing patterns.
STATUS	It indicates the status of the order, such as "Shipped," "In Process," "Cancelled," "Disputed," "On Hold," or "Resolved."
PRODUCTLINE	This column specifies the product line categories to which each item belongs.
MSRP	It stands for Manufacturer's Suggested Retail Price and represents the suggested selling price for each item.
PRODUCTCODE	This column represents the unique code assigned to each product.
CUSTOMERNAME	It denotes the name of the customer who placed the order.
PHONE	This column contains the contact phone number for the customer.
ADDRESSLINE1	It represents the first line of the customer's address.
CITY	This column specifies the city where the customer is located.
POSTALCODE	It denotes the postal code or ZIP code associated with the customer's address.
COUNTRY	This column indicates the country where the customer is located.
CONTACTLASTNAME	It represents the last name of the contact person associated with the customer.
CONTACTFIRSTNAME	This column denotes the first name of the contact person associated with the customer.
DEALSIZE	It indicates the size of the deal or order, which are the categories "Small," "Medium," or "Large."

c
Data from: ARS Water Database
s.cnmilf.com
data.cnra.ca.gov
+2more
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). ARS Water Database [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/ars-water-database-82912
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Service
Description
The ARS Water Data Base is a collection of precipitation and streamflow data from small agricultural watersheds in the United States. This national archive of variable time-series readings for precipitation and runoff contains sufficient detail to reconstruct storm hydrographs and hyetographs. There are currently about 14,000 station years of data stored in the data base. Watersheds used as study areas range from 0.2 hectare (0.5 acres) to 12,400 square kilometers (4,786 square miles). Raingage networks range from one station per watershed to over 200 stations. The period of record for individual watersheds vary from 1 to 50 years. Some watersheds have been in continuous operation since the mid 1930's. Resources in this dataset:Resource Title: FORMAT INFORMATION FOR VARIOUS RECORD TYPES. File Name: format.txtResource Description: Format information identifying fields and their length will be included in this file for all files except those ending with the extension .txt TYPES OF FILES As indicated in the previous section data has been stored by _location number in the form, LXX where XX is the _location number. In each subdirectory, there will be various files using the following naming conventions: Runoff data: WSXXX.zip where XXX is the watershed number assigned by the WDC. This number may or may not correspond to a naming convention used in common literature. Rainfall data: RGXXXXXX.zip where XXXXXX is the rain gage station identification. Maximum-minimum daily air temperature: MMTXXXXX.zip where XXXXX is the watershed number assigned by the WDC. Ancillary text files: NOTXXXXX.txt where XXXXX is the watershed number assigned by the WDC. These files will contain textual information including latitude-longitude, name commonly used in literature, acreage, most commonly-associated rain gage(s) (if known by the WDC), a list of all rain gages on or near the watershed. Land use, topography, and soils as known by the WDC. Topographic maps of the watersheds: MAPXXXXX.zip where XXXXX is the _location/watershed number assigned by the WDC. Map files are binary TIF files. NOT ALL FILE TYPES MAY BE AVAILABLE FOR SPECIFIC WATERSHEDS. Data files are still being compiled and translated into a form viable for this archive. Please bear with us while we grow.Resource Title: Data Inventory - watersheds. File Name: inventor.txtResource Description: Watersheds at which records of runoff were being collected by the Agricultural Research Service. Variables: Study Location & Number of Rain Gages1; Name; Lat.; Long; Number; Pub. Code; Record Began; Land Use2; Area (Acres); Types of Data3Resource Title: Information about the ARS Water Database. File Name: README.txtResource Title: INDEX TO INFORMATION ON EXPERIMENTAL AGRICULTURAL WATERSHEDS. File Name: INDEX.TXTResource Description: This report includes identification information on all watersheds operated by the ARS. Only some of these are included in the ARS Water Data Base. They are so indicated in the column titled ARS Water Data Base. Other watersheds will not have data available here or through the Water Data Center. This index is particularly important since it relates watershed names with the indexing system used by the Water Data Center. Each _location has been assigned a number. The data for that _location will be stored in a sub-directory coded as LXX where XX is the _location number. The index also indicates the watershed number used by the WDC. Data for a particular watershed will be stored in a compressed file named WSXXXXX.zip where XXXXX is the watershed number assigned by the WDC. Although not included in the index, rain gage information will be stored in compressed files named RGXXXXXX.zip where XXXXXX is a 6-character identification of the rain gage station. The Index also provides information such as latitude-longitude for each of the watersheds, acreage, the period-of-record for each acreage. Multiple entries for a particular watershed will either indicate that the acreage designated for the watershed changed or there was a break in operations of the watershed. Resource Title: ARS Water Database files. File Name: ars_water.zipResource Description: USING THIS SYSTEM Before downloading huge amounts of data from the ARS Water Data Base, you should first review the text files included in this directory. They include: INDEX OF ARS EXPERIMENTAL WATERSHEDS: index.txt This report includes identification information on all watersheds operated by the ARS. Only some of these are included in the ARS Water Data Base. They are so indicated in the column titled ARS Water Data Base. Other watersheds will not have data available here or through the Water Data Center. This index is particularly important since it relates watershed names with the indexing system used by the Water Data Center. Each _location has been assigned a number. The data for that _location will be stored in a sub-directory coded as LXX where XX is the _location number. The index also indicates the watershed number used by the WDC. Data for a particular watershed will be stored in a compressed file named WSXXXXX.zip where XXXXX is the watershed number assigned by the WDC. Although not included in the index, rain gage information will be stored in compressed files named RGXXXXXX.zip where XXXXXX is a 6-character identification of the rain gage station. The Index also provides information such as latitude-longitude for each of the watersheds, acreage, the period-of-record for each acreage. Multiple entries for a particular watershed will either indicate that the acreage designated for the watershed changed or there was a break in operations of the watershed. STATION TABLE FOR THE ARS WATER DATA BASE: station.txt This report indicates the period of record for each recording station represented in the ARS Water Data Base. The data for a particular station will be stored in a single compressed file. FORMAT INFORMATION FOR VARIOUS RECORD TYPES: format.txt Format information identifying fields and their length will be included in this file for all files except those ending with the extension .txt TYPES OF FILES As indicated in the previous section data has been stored by _location number in the form, LXX where XX is the _location number. In each subdirectory, there will be various files using the following naming conventions: Runoff data: WSXXX.zip where XXX is the watershed number assigned by the WDC. This number may or may not correspond to a naming convention used in common literature. Rainfall data: RGXXXXXX.zip where XXXXXX is the rain gage station identification. Maximum-minimum daily air temperature: MMTXXXXX.zip where XXXXX is the watershed number assigned by the WDC. Ancillary text files: NOTXXXXX.txt where XXXXX is the watershed number assigned by the WDC. These files will contain textual information including latitude-longitude, name commonly used in literature, acreage, most commonly-associated rain gage(s) (if known by the WDC), a list of all rain gages on or near the watershed. Land use, topography, and soils as known by the WDC. Topographic maps of the watersheds: MAPXXXXX.zip where XXXXX is the _location/watershed number assigned by the WDC. Map files are binary TIF files. NOT ALL FILE TYPES MAY BE AVAILABLE FOR SPECIFIC WATERSHEDS. Data files are still being compiled and translated into a form viable for this archive. Please bear with us while we grow.
Data from: Three-systems for visual numerosity: A single case study
data.europa.eu
unknown
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo, Three-systems for visual numerosity: A single case study [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-4299086?locale=bg
Explore at:
unknown(1264587)Available download formats
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MOT TASK Each file refers to a session with a particular condition. It is spelt out in the file name the number of objects to track as well as the total amount of objects on the screen Within each file it is found a variable called “MatriceRisultati”. Which contains: Number of targets to follow Number of correct answers Number of trials at the condition Percent correct responses (i.e. value_2 / value_3) NECKLACE – DISTANCE TASK Each file contains raw data for each session. All the data are store in a variable called RESP, which contains parameters for each trial. The crucial columns are Inter dot distance in the reference (in pixels – typically 1 pixel =~0.03 cm) Interdot distance in the test (in pixels) Subject choice to the question “which contains closer dots” For analysis one has to draw a psychometric curve (i.e. a cumulative gaussian) that fits the data of column 3, as a function of interdot distance (column 2). Varinat may include dividing column 2 by column 1 (so to have the ratio between test and reference) and run the psychometric curve on such normalized dimension Further explanation of the other columns (seeds for generating the stimuli) can be obtained from the authors. Numerosity discrimination The relevant columns in the matrix ‘a’ contain the following information: 1st: Numerosity 2nd: Log10 Numerosity 3rd: Response Further explanation of the other columns (seeds for generating the stimuli) can be obtained from the authors. Numerosity estimation The relevant columns in the matrix ‘ContengoRisultati’ contain the following information: 1st: Numerosity 2nd: Response Further explanation of the other columns (seeds for generating the stimuli) can be obtained from the authors.
d
Post-wildfire debris-flow monitoring data, 2019 Woodbury Fire, Superstition...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Post-wildfire debris-flow monitoring data, 2019 Woodbury Fire, Superstition Mountains, Arizona, USA [Dataset]. https://catalog.data.gov/dataset/post-wildfire-debris-flow-monitoring-data-2019-woodbury-fire-superstition-mountains-arizon
Explore at:
Dataset updated
Nov 27, 2025
Dataset provided by
U.S. Geological Survey
Area covered
Arizona, Superstition Mountains, United States
Description
This data release contains numerous comma-separated text files with data summarizing observations in the within and adjacent to the Woodbury Fire, which burned from 8 June to 15 July 2019. In particular, this monitoring data was focused on debris flows in burned and unburned areas. Rainfall data (Wdby_Rainfall.zip) are contained in csv files called Wdby_Rainfall for 3 rain gages named: B2, B6, and Reavis. This is time-series data where the total rainfall is recorded at each timestamp. The location of each rain gage is listed as a latitude/longitude in each file. Data from absolute (i.e. not vented) pressure transducers (Wdby_Pressure.zip), which can be used to constrain the time of passage of a flood or debris flow, are available in csv files called Wdby_Pressure for four drainages (B1, B6, Reavis 1, and Reavis 2). This is time-series data where the measured pressure in kilopascals is recorded at each timestamp. The location of each pressure transducer is listed as a latitude/longitude in each file. Infiltration data are located in the csv file called WoodburyInfiltration.csv. The location of the measurement is listed as a latitude/longitude. Three measurement values are reported at each location: Saturated Hydraulic Conductivity (Ks) [mm/hr], Sorptivity (S) [mm/h^(1/2)], and pressure head (hf) [m]. The date of each measurement and soil burn severity class are also reported at each location, as well as a table explaining the burn-severity numerical class conversion. Particle size analyses using laser diffraction (WoodburyLaserDiffractionSummary.zip) are located in the files called WoodburyLaserDiffractionSummary for the fine fraction (< 2 mm) of hillslope and debris flow Deposits. The diameter of each particle size class is listed in the first column. All subsequent columns begin with the sample name. The value in each row is the percentage of the grain sizes in the size class. Location data for each of these samples is listed in the accompanying data table titled: WoodburyParticleSizeSummary.csv. The particle size data are summarized in the csv files (WoodburyParticleSizeSummary.zip) called WoodburyParticleSizeSummary by debris flow deposits and hillslope samples. These files group the raw data into more useable information. The sample name (Lab ID) is used to identify the Laser Diffraction data. The data columns (Lat) and (Lon) show the latitude and longitude of the sample locations. The total fraction of all the grain sizes, determined by sieving, are listed in three classes (Fraction < 16 mm, Fraction < 4 mm, Fraction < 2 mm). The fine fractions (< 2 mm) are also summarized in the columns (%Sand, %Silt, %Clay), as determined by laser diffraction. The data are identfied as in the burn area using entries of Yes, whereas unburned areas are shown as No, indicating no burn. The median particle size (D50) is listed if the sample collected in the field was representative of the deposit. In some cases, large cobbles and boulders had to be removed from the sample because were much too large to be included in sample bags that were brought back to the lab for analysis. The last column label (Description) contains notes about each sample. Pebble count data (WoodburyPebbleCountsSummary.zip) are available in csv files called WoodburyPebbleCountsSummary for six drainages (U10 Fan, U10 Channel, U22 Channel, B1 Channel, B7 Fan, and U42 Fan). Here U represents unburned, and B represents burned. The data name indicates whether the data come from a deposit located in a channel or a fan. In each file the particle is numbered (Num) and the B-axis measurement of the particle is reported in centimeters. The location of each pebble count is listed as a latitude/longitude in each file. Channel width measurements for 23 channels are saved in unique shapefiles within the file called Channel_Width_Transects.zip. These width measurements were made using Digital Globe imagery from 19 October 2019. The study basins used for the entire study can be found in the shapefile: Woodbury_StudyBasins.shp. The attribute table along with many morphometric and fire related statistics for each basin is also available in the file Woodbury_StudyBasins_Table.csv. A description of each column name in the table is available in the file Woodbury_StudyBasins_Table_descriptions.csv. Debris flow volumes were available in eleven drainage basins. The volume data is contained in the file Wdby_FlowVolume.csv in a column named (Volume). The volume units are cubic meters. The other column is the Basin ID, which can be found in the shapefile: Woodbury_StudyBasins.shp.
d
Secondary Input Data used in Developing Stochastically Generated Climate and...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Secondary Input Data used in Developing Stochastically Generated Climate and Streamflow Conditions in the Souris River Basin, United States and Canada, [Dataset]. https://catalog.data.gov/dataset/secondary-input-data-used-in-developing-stochastically-generated-climate-and-streamflow-co
Explore at:
Dataset updated
Nov 19, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Souris River, Canada, United States
Description
i. .\File_Mapping.csv: This file relates historical reconstructed hydrology streamflow from the U.S. Army Corps of Engineers () to the appropriate stochastic streamflow file for disaggregation of streamflow. Column A is an assigned ID, column B is named “Stochastic” and is the stochastic streamflow file needed for disaggregation, column c is called “RH_Ratio_Col” and is the name of the column in the reconstructed hydrology dataset associated with a stochastic streamflow file, and column D is named “Col_Num” and is the column number in the reconstructed hydrology dataset with the name given in column C. ii. .\Original_Draw_YearDat.csv: This file contains the historical year from 1930 to 2017 with the closest total streamflow for the Souris River Basin to each year in the stochastic streamflow dataset. Column A is an index number, column B is named “V1” and is the year in a simulation, column C is called “V2” and is the stochastic simulation number, column D is an integer that can be related to historical years by adding 1929, and column D is named “year” and is the historical year with the closest total Souris River Basin streamflow volume to the associated year in the stochastic traces. iii. .\revdrawyr.csv: This file is setup the same way that .\Original_Draw_YearDat.csv was except that, when a year had over 400 occurrences, it was randomly replaced with one of the 20 other closest years. The replacement process was completed until there were less than 400 occurrences of each reconstructed hydrology year associated with stochastic simulation years. Column A is an index number, column B is named “V1” and is the year in a simulation, column C is called “V2” and is the stochastic simulation number, column D is called “V3” and is the historical year who’s streamflow ratios will be multiplied by stochastic streamflow, and column E is called “Stoch_yr” and is the total of 2999 and the year in column B. iv. .\RH_1930_2017.csv: This file contains the daily streamflow from the U.S. Army Corps of Engineers (2020), reconstructed hydrology for the Souris River Basin for the period of 1930 to 2017. Column A is the date and columns B through AA are the daily streamflow in cubic feet per second. v. .\rhmoflow_1930Present.csv: This file was created based on .\RH_1930_2017.csv and provides streamflow for each site in cubic meters for a given month. Column A is an unnamed index column, column B is historical year, column C is the historical month associated with the historical year, column D provides a day equal to 1 but does not have particular significance and columns E through AD are monthly streamflow volume for each site location. vi. .\Stoch_Annual_TotVol_CubicDecameters.csv: This file contains the total volume of streamflow for each of the 26 sites for each month in the stochastic streamflow time timeseries and provides a total streamflow volume divided by 100,000 on a monthly basis for the entire Souris River Basin. Column A is unnamed and contains an index number, column B is month and is named “V1”, column C is the year in a simulation, column D is the simulation number, columns E (V4 through V29) through AD are streamflow volume in cubic meters, and column AE (V30) is total Souris River Basin monthly streamflow volume in cubic decameters/1,000.
Z
A Tagged Traffic Accident Dataset for Machine Learning
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Berres, Andy; Moriano, Pablo; Xu, Haowen; Tennille, Sarah; Lee Smith; Storey, Jonathan; Sanyal, Jibonananda (2024). A Tagged Traffic Accident Dataset for Machine Learning [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7964287
Explore at:
Dataset updated
Jul 16, 2024
Dataset provided by
Tennessee Department of Transportation
Computer Science and Mathematics Division, Oak Ridge National Laboratory
Computational Sciences and Engineering Division, Oak Ridge National Laboratory
Energy Conversion and Storage Systems Center, National Renewable Energy Laboratory
Authors
Berres, Andy; Moriano, Pablo; Xu, Haowen; Tennille, Sarah; Lee Smith; Storey, Jonathan; Sanyal, Jibonananda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains tagged accident data and is provided for reproducibility for our journal paper

Pablo Moriano, Andy Berres, Haowen Xu, Jibonananda Sanyal. “Spatiotemporal Features of Traffic Help Reduce Automatic Accident Detection Time.” Expert Systems with Applications 244 (2024): 122813. https://doi.org/10.1016/j.eswa.2023.122813

The accompanying Data in Brief publication discusses the methodology behind the creation of these data.

Berres, Andy, Pablo Moriano, Haowen Xu, Sarah Tennille, Lee Smith, Jonathan Storey, and Jibonananda Sanyal. "A Traffic Accident Dataset for Chattanooga, Tennessee." Data in Brief (2024): 110675.

The zip folder annotatedData.zip contains two subfolders: allData and bestData. The bestData folder contains all data for which a full neighborhood of five sensors upstream and five sensors downstream is available, whereas allData includes everything from bestData as well as data with a smaller number of neighboring sensors. Each folder contains one subfolder called accidents and one subfolder called non-accidents. The accidents folder contains one file per accident. The non-accidents folder contains files for the same location, day of the week and time as a corresponding accident, for each week during which there was no accident impact on the traffic.

The file names in both folders are formatted as follows: yyyy-mm-dd-hhmm-rrrrrXaaa.a.csv, consisting of date (yyyy-mm-dd), time (hhmm in 24-hour format), and sensor name (rrrrrXaaa.a), which consists of road name (rrrrr; 5 alphanumerical characters), heading (X), and mile marker (aaa.a). For example, the file 2020-11-03-1611-00I24W182.8.csv contains data for an accident which occurred at 4:11 p.m. on November 3, 2020 on I-24 Westbound near the radar sensor at mile marker 182.8.

The content of each CSV file is a timeseries of radar data beginning 15 minutes prior to the reported incident and ending 15 minutes after the reported incident. It also contains metadata, such as the accident type, etc. Each CSV file contains the following columns:

incident at sensor(i): 1 for yes (accidents folder), 0 for no (non-accidents folder)

road: road name with heading, e.g. 00I24E

mile: mile marker of nearest radar sensor, e.g. 182.8

type: accident type, e.g. “Prop Damage (over)” for property damage exceeding a certain threshold. For non-accidents, the type is given as “None”.

date: date of the data sample. For accidents, this is the date on which the accident occurred. For non-accidents, this is the date for which the non-accident data sample is collected.

incident_time: time the reference accident was reported in hh:mm. This is the time which is provided in E-TRIMS as the time the 911 call was made.

incident_hour: just the hour from the incident_time, in integer format.

data_time: timestamp for the timeseries contained in the file in hh:mm:ss format. The timeseries consists of 30 second timesteps.

weather: weather during data_time, based on data collected from NASA POWER. We used dry bulb temperature (°C), precipitation (mm/h), and wind speed (m/s) from the raw NASA POWER data to produce the classifications of rain (at least 1mm precipitation and temperatures above 2°C), snow (at least 1mm precipitation and temperatures at or below 2°C), and wind (wind speeds over 30 mph or 13.5 m/s). If there were no inclement weather conditions, we set the category to “--".

light: light conditions during data_time. To produce this field, we collected sunrise, sunset, civil twilight start and civil twilight end times from https://sunrise-sunset.org, and derived the categories dawn, daylight, dusk, and dark using these start and end times.

The last 33 columns contain radar data for the 11 sensors surrounding the accident or non-accident. For each sensor, we collected speed (mean over 30-second interval in miles per hour, or empty if no vehicles passed), volume (count of all vehicles passing during 30-second interval), and occupancy (mean % of occupancy over 30-second interval). These three variables are grouped in triples, of speed (k), volume (k), occupancy (k), where k indicates the sensor number relative to the closest sensor i to the incident, ki indicate downstream sensors. For example, speed (i-5) refers to the mean speed at the sensor which is 5 hops upstream from the accident, and volume(i+1) refers to the number of vehicles at the sensor immediately downstream from the accident.

The folder metaData.zip contains the following files:

Accidents.csv: cleaned-up accidents file with all accidents which happened on Chattanooga area highways between November 1, 2020 and April 29, 2021. We have removed accidents which happened on non-highway roads, and we have corrected the timestamps (which were in 12-hour format but missing a.m./p.m. markers) by cross-referencing light and weather conditions.

WeatherDict.json: a dictionary containing the weather data synthesized from NASA POWER.

LightDict.json: a dictionary containing the light data synthesized from Sunrise-and-Sunset.

SensorTopology.csv: neighborhood information for each radar sensor in the Chattanooga area.

SensorZones.geojson: polygons used to determine the nearest radar sensor for each accident location. Each polygon is tagged with the corresponding radar sensor’s name.
Number of data centers worldwide 2025, by country or territory
statista.com
Updated Nov 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Number of data centers worldwide 2025, by country or territory [Dataset]. https://www.statista.com/statistics/1228433/data-centers-worldwide-by-country/
Explore at:
Dataset updated
Nov 19, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2025
Area covered
World
Description
As of November 2025, there were a reported 4,165 data centers in the United States, the most of any country worldwide. A further 499 were located in the United Kingdom, while 487 were located in Germany. What is a data center? Data centers are facilities designed to store and compute vast amounts of data efficiently and securely. Growing in importance amid the rise of cloud computing and artificial intelligence, data centers form the core infrastructure powering global digital transformation. Modern data centers consist of critical computing hardware such as servers, storage systems, and networking equipment organized into racks, alongside specialized secondary infrastructure providing power, cooling, and security. AI data centers Data centers are vital for artificial intelligence, with the world’s leading technology companies investing vast sums in new facilities across the globe. Purpose-built AI data centers provide the immense computing power required to train the most advanced AI models, as well as to process user requests in real time, a task known as inference. Increasing attention has therefore turned to the location of these powerful facilities, as governments grow more concerned with AI sovereignty. At the same time, rapid data center expansion has sparked a global debate over resource use, including land, energy, and water, as modern facilities begin to strain local infrastructure.
Data from: GEDI L1B Geolocated Waveform Data Global Footprint Level V002
data.nasa.gov
catalog.data.gov
Updated Apr 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). GEDI L1B Geolocated Waveform Data Global Footprint Level V002 [Dataset]. https://data.nasa.gov/dataset/gedi-l1b-geolocated-waveform-data-global-footprint-level-v002-8fe3b
Explore at:
Dataset updated
Apr 1, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
The Global Ecosystem Dynamics Investigation (GEDI) mission aims to characterize ecosystem structure and dynamics to enable radically improved quantification and understanding of the Earth’s carbon cycle and biodiversity. The GEDI instrument produces high resolution laser ranging observations of the 3-dimensional structure of the Earth. GEDI is attached to the International Space Station (ISS) and collects data globally between 51.6° N and 51.6° S latitudes at the highest resolution and densest sampling of any light detection and ranging (lidar) instrument in orbit to date. Each GEDI Version 2 granule encompasses one-fourth of an ISS orbit and includes georeferenced metadata to allow for spatial querying and subsetting.The GEDI instrument was removed from the ISS and placed into storage on March 17, 2023. No data were acquired during the hibernation period from March 17, 2023, to April 24, 2024. GEDI has since been reinstalled on the ISS and resumed operations as of April 26, 2024.The GEDI Level 1B Geolocated Waveforms product (GEDI01_B) provides geolocated corrected and smoothed waveforms, geolocation parameters, and geophysical corrections for each laser shot for all eight GEDI beams. GEDI01_B data are created by geolocating the GEDI01_A raw waveform data. The GEDI01_B product is provided in HDF5 format and has a spatial resolution (average footprint) of 25 meters.The GEDI01_B data product contains 85 layers for each of the eight beams including the geolocated corrected and smoothed waveform datasets and parameters and the accompanying ancillary, geolocation, and geophysical correction. Additional information can be found in the GEDI L1B Product Data Dictionary.Known Issues Data acquisition gaps: GEDI data acquisitions were suspended on December 19, 2019 (2019 Day 353) and resumed on January 8, 2020 (2020 Day 8). Incorrect Reference Ground Track (RGT) number in the filename for select GEDI files: GEDI Science Data Products for six orbits on August 7, 2020, and November 12, 2021, had the incorrect RGT number in the filename. There is no impact to the science data, but users should reference this document for the correct RGT numbers. Known Issues: Section 8 of the User Guide provides additional information on known issues.Improvements/Changes from Previous Versions Metadata has been updated to include spatial coordinates. Granule size has been reduced from one full ISS orbit (~7.99 GB) to four segments per orbit (~2.00 GB). Filename has been updated to include segment number and version number. Improved geolocation for an orbital segment. Added elevation from the SRTM digital elevation model for comparison. Modified the method to predict an optimum algorithm setting group per laser shot. Added additional land cover datasets related to phenology, urban infrastructure, and water persistence. Added selected_mode_flag dataset to root beam group using selected algorithm. Removed shots when the laser is not firing.* Modified file name to include segment number and dataset version.
u
Data from: Sentinel2GlobalLULC: A dataset of Sentinel-2 georeferenced RGB...
observatorio-cientifico.ua.es
produccioncientifica.ugr.es
+2more
Updated 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benhammou, Yassir; Alcaraz-Segura, Domingo; Guirado, Emilio; Khaldi, Rohaifa; Tabik, Siham; Benhammou, Yassir; Alcaraz-Segura, Domingo; Guirado, Emilio; Khaldi, Rohaifa; Tabik, Siham (2022). Sentinel2GlobalLULC: A dataset of Sentinel-2 georeferenced RGB imagery annotated for global land use/land cover mapping with deep learning (License CC BY 4.0) [Dataset]. https://observatorio-cientifico.ua.es/documentos/668fc45eb9e7c03b01bdb38a
Explore at:
Dataset updated
2022
Authors
Benhammou, Yassir; Alcaraz-Segura, Domingo; Guirado, Emilio; Khaldi, Rohaifa; Tabik, Siham; Benhammou, Yassir; Alcaraz-Segura, Domingo; Guirado, Emilio; Khaldi, Rohaifa; Tabik, Siham
Description
Sentinel2GlobalLULC is a deep learning-ready dataset of RGB images from the Sentinel-2 satellites designed for global land use and land cover (LULC) mapping. Sentinel2GlobalLULC v2.1 contains 194,877 images in GeoTiff and JPEG format corresponding to 29 broad LULC classes. Each image has 224 x 224 pixels at 10 m spatial resolution and was produced by assigning the 25th percentile of all available observations in the Sentinel-2 collection between June 2015 and October 2020 in order to remove atmospheric effects (i.e., clouds, aerosols, shadows, snow, etc.). A spatial purity value was assigned to each image based on the consensus across 15 different global LULC products available in Google Earth Engine (GEE). Our dataset is structured into 3 main zip-compressed folders, an Excel file with a dictionary for class names and descriptive statistics per LULC class, and a python script to convert RGB GeoTiff images into JPEG format. The first folder called "Sentinel2LULC_GeoTiff.zip" contains 29 zip-compressed subfolders where each one corresponds to a specific LULC class with hundreds to thousands of GeoTiff Sentinel-2 RGB images. The second folder called "Sentinel2LULC_JPEG.zip" contains 29 zip-compressed subfolders with a JPEG formatted version of the same images provided in the first main folder. The third folder called "Sentinel2LULC_CSV.zip" includes 29 zip-compressed CSV files with as many rows as provided images and with 12 columns containing the following metadata (this same metadata is provided in the image filenames): Land Cover Class ID: is the identification number of each LULC class Land Cover Class Short Name: is the short name of each LULC class Image ID: is the identification number of each image within its corresponding LULC class Pixel purity Value: is the spatial purity of each pixel for its corresponding LULC class calculated as the spatial consensus across up to 15 land-cover products GHM Value: is the spatial average of the Global Human Modification index (gHM) for each image Latitude: is the latitude of the center point of each image Longitude: is the longitude of the center point of each image Country Code: is the Alpha-2 country code of each image as described in the ISO 3166 international standard. To understand the country codes, we recommend the user to visit the following website where they present the Alpha-2 code for each country as described in the ISO 3166 international standard:https: //www.iban.com/country-codes Administrative Department Level1: is the administrative level 1 name to which each image belongs Administrative Department Level2: is the administrative level 2 name to which each image belongs Locality: is the name of the locality to which each image belongs Number of S2 images : is the number of found instances in the corresponding Sentinel-2 image collection between June 2015 and October 2020, when compositing and exporting its corresponding image tile For seven LULC classes, we could not export from GEE all images that fulfilled a spatial purity of 100% since there were millions of them. In this case, we exported a stratified random sample of 14,000 images and provided an additional CSV file with the images actually contained in our dataset. That is, for these seven LULC classes, we provide these 2 CSV files: A CSV file that contains all exported images for this class A CSV file that contains all images available for this class at spatial purity of 100%, both the ones exported and the ones not exported, in case the user wants to export them. These CSV filenames end with "including_non_downloaded_images". To clearly state the geographical coverage of images available in this dataset, we included in the version v2.1, a compressed folder called "Geographic_Representativeness.zip". This zip-compressed folder contains a csv file for each LULC class that provides the complete list of countries represented in that class. Each csv file has two columns, the first one gives the country code and the second one gives the number of images provided in that country for that LULC class. In addition to these 29 csv files, we provided another csv file that maps each ISO Alpha-2 country code to its original full country name. © Sentinel2GlobalLULC Dataset by Yassir Benhammou, Domingo Alcaraz-Segura, Emilio Guirado, Rohaifa Khaldi, Boujemâa Achchab, Francisco Herrera & Siham Tabik is marked with Attribution 4.0 International (CC-BY 4.0)
w
Data Use in Academia Dataset
datacatalog.worldbank.org
csv, utf-8
Updated Nov 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Semantic Scholar Open Research Corpus (S2ORC) (2023). Data Use in Academia Dataset [Dataset]. https://datacatalog.worldbank.org/search/dataset/0065200/data_use_in_academia_dataset
Explore at:
utf-8, csvAvailable download formats
Dataset updated
Nov 27, 2023
Dataset provided by
Semantic Scholar Open Research Corpus (S2ORC)
Brian William Stacy
License
https://datacatalog.worldbank.org/public-licenses?fragment=cchttps://datacatalog.worldbank.org/public-licenses?fragment=cc
Description
This dataset contains metadata (title, abstract, date of publication, field, etc) for around 1 million academic articles. Each record contains additional information on the country of study and whether the article makes use of data. Machine learning tools were used to classify the country of study and data use.

Our data source of academic articles is the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al. 2020). The corpus contains more than 130 million English language academic papers across multiple disciplines. The papers included in the Semantic Scholar corpus are gathered directly from publishers, from open archives such as arXiv or PubMed, and crawled from the internet.

We placed some restrictions on the articles to make them usable and relevant for our purposes. First, only articles with an abstract and parsed PDF or latex file are included in the analysis. The full text of the abstract is necessary to classify the country of study and whether the article uses data. The parsed PDF and latex file are important for extracting important information like the date of publication and field of study. This restriction eliminated a large number of articles in the original corpus. Around 30 million articles remain after keeping only articles with a parsable (i.e., suitable for digital processing) PDF, and around 26% of those 30 million are eliminated when removing articles without an abstract. Second, only articles from the year 2000 to 2020 were considered. This restriction eliminated an additional 9% of the remaining articles. Finally, articles from the following fields of study were excluded, as we aim to focus on fields that are likely to use data produced by countries’ national statistical system: Biology, Chemistry, Engineering, Physics, Materials Science, Environmental Science, Geology, History, Philosophy, Math, Computer Science, and Art. Fields that are included are: Economics, Political Science, Business, Sociology, Medicine, and Psychology. This third restriction eliminated around 34% of the remaining articles. From an initial corpus of 136 million articles, this resulted in a final corpus of around 10 million articles.

Due to the intensive computer resources required, a set of 1,037,748 articles were randomly selected from the 10 million articles in our restricted corpus as a convenience sample.

The empirical approach employed in this project utilizes text mining with Natural Language Processing (NLP). The goal of NLP is to extract structured information from raw, unstructured text. In this project, NLP is used to extract the country of study and whether the paper makes use of data. We will discuss each of these in turn.

To determine the country or countries of study in each academic article, two approaches are employed based on information found in the title, abstract, or topic fields. The first approach uses regular expression searches based on the presence of ISO3166 country names. A defined set of country names is compiled, and the presence of these names is checked in the relevant fields. This approach is transparent, widely used in social science research, and easily extended to other languages. However, there is a potential for exclusion errors if a country’s name is spelled non-standardly.

The second approach is based on Named Entity Recognition (NER), which uses machine learning to identify objects from text, utilizing the spaCy Python library. The Named Entity Recognition algorithm splits text into named entities, and NER is used in this project to identify countries of study in the academic articles. SpaCy supports multiple languages and has been trained on multiple spellings of countries, overcoming some of the limitations of the regular expression approach. If a country is identified by either the regular expression search or NER, it is linked to the article. Note that one article can be linked to more than one country.

The second task is to classify whether the paper uses data. A supervised machine learning approach is employed, where 3500 publications were first randomly selected and manually labeled by human raters using the Mechanical Turk service (Paszke et al. 2019).[1] To make sure the human raters had a similar and appropriate definition of data in mind, they were given the following instructions before seeing their first paper:

Each of these documents is an academic article. The goal of this study is to measure whether a specific academic article is using data and from which country the data came.
There are two classification tasks in this exercise:
1. identifying whether an academic article is using data from any country
2. Identifying from which country that data came.
For task 1, we are looking specifically at the use of data. Data is any information that has been collected, observed, generated or created to produce research findings. As an example, a study that reports findings or analysis using a survey data, uses data. Some clues to indicate that a study does use data includes whether a survey or census is described, a statistical model estimated, or a table or means or summary statistics is reported.
After an article is classified as using data, please note the type of data used. The options are population or business census, survey data, administrative data, geospatial data, private sector data, and other data. If no data is used, then mark "Not applicable". In cases where multiple data types are used, please click multiple options.[2]
For task 2, we are looking at the country or countries that are studied in the article. In some cases, no country may be applicable. For instance, if the research is theoretical and has no specific country application. In some cases, the research article may involve multiple countries. In these cases, select all countries that are discussed in the paper.
We expect between 10 and 35 percent of all articles to use data.

The median amount of time that a worker spent on an article, measured as the time between when the article was accepted to be classified by the worker and when the classification was submitted was 25.4 minutes. If human raters were exclusively used rather than machine learning tools, then the corpus of 1,037,748 articles examined in this study would take around 50 years of human work time to review at a cost of $3,113,244, which assumes a cost of $3 per article as was paid to MTurk workers.

A model is next trained on the 3,500 labelled articles. We use a distilled version of the BERT (bidirectional Encoder Representations for transformers) model to encode raw text into a numeric format suitable for predictions (Devlin et al. (2018)). BERT is pre-trained on a large corpus comprising the Toronto Book Corpus and Wikipedia. The distilled version (DistilBERT) is a compressed model that is 60% the size of BERT and retains 97% of the language understanding capabilities and is 60% faster (Sanh, Debut, Chaumond, Wolf 2019). We use PyTorch to produce a model to classify articles based on the labeled data. Of the 3,500 articles that were hand coded by the MTurk workers, 900 are fed to the machine learning model. 900 articles were selected because of computational limitations in training the NLP model. A classification of “uses data” was assigned if the model predicted an article used data with at least 90% confidence.

The performance of the models classifying articles to countries and as using data or not can be compared to the classification by the human raters. We consider the human raters as giving us the ground truth. This may underestimate the model performance if the workers at times got the allocation wrong in a way that would not apply to the model. For instance, a human rater could mistake the Republic of Korea for the Democratic People’s Republic of Korea. If both humans and the model perform the same kind of errors, then the performance reported here will be overestimated.

The model was able to predict whether an article made use of data with 87% accuracy evaluated on the set of articles held out of the model training. The correlation between the number of articles written about each country using data estimated under the two approaches is given in the figure below. The number of articles represents an aggregate total of
ERA5 monthly averaged data on pressure levels from 1940 to present
cds.climate.copernicus.eu
grib
Updated Nov 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ECMWF (2025). ERA5 monthly averaged data on pressure levels from 1940 to present [Dataset]. http://doi.org/10.24381/cds.6860a573
Explore at:
gribAvailable download formats
Unique identifier
https://doi.org/10.24381/cds.6860a573
Dataset updated
Nov 6, 2025
Dataset provided by
European Centre for Medium-Range Weather Forecastshttp://ecmwf.int/
Authors
ECMWF
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 8 decades. Data is available from 1940 onwards. ERA5 replaces the ERA-Interim reanalysis. Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product. ERA5 provides hourly estimates for a large number of atmospheric, ocean-wave and land-surface quantities. An uncertainty estimate is sampled by an underlying 10-member ensemble at three-hourly intervals. Ensemble mean and spread have been pre-computed for convenience. Such uncertainty estimates are closely related to the information content of the available observing system which has evolved considerably over time. They also indicate flow-dependent sensitive areas. To facilitate many climate applications, monthly-mean averages have been pre-calculated too, though monthly means are not available for the ensemble mean and spread. ERA5 is updated daily with a latency of about 5 days (monthly means are available around the 6th of each month). In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. So far this has only been the case for the month September 2021, while it will also be the case for October, November and December 2021. For months prior to September 2021 the final release has always been equal to ERA5T, and the goal is to align the two again after December 2021. ERA5 is updated daily with a latency of about 5 days (monthly means are available around the 6th of each month). In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. In case that this occurs users are notified. The data set presented here is a regridded subset of the full ERA5 data set on native resolution. It is online on spinning disk, which should ensure fast and easy access. It should satisfy the requirements for most common applications. An overview of all ERA5 datasets can be found in this article. Information on access to ERA5 data on native resolution is provided in these guidelines. Data has been regridded to a regular lat-lon grid of 0.25 degrees for the reanalysis and 0.5 degrees for the uncertainty estimate (0.5 and 1 degree respectively for ocean waves). There are four main sub sets: hourly and monthly products, both on pressure levels (upper air fields) and single levels (atmospheric, ocean-wave and land surface quantities). The present entry is "ERA5 monthly mean data on pressure levels from 1940 to present".
d
US B2B Phone Number Data | 148MM Phone Numbers, Verified Data
datarade.ai
Updated Feb 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Salutary Data (2024). US B2B Phone Number Data | 148MM Phone Numbers, Verified Data [Dataset]. https://datarade.ai/data-products/salutary-data-b2b-data-phone-number-data-mobile-phone-72-salutary-data
Explore at:
.json, .csv, .xls, .txtAvailable download formats
Dataset updated
Feb 20, 2024
Dataset authored and provided by
Salutary Data
Area covered
United States of America
Description
Discover the ultimate resource for your B2B needs with our meticulously curated dataset, featuring 148MM+ highly relevant US B2B Contact Data records and associated company information.

Very high fill rates for Phone Number, including for Mobile Phone!

This encompasses a diverse range of fields, including Contact Name (First & Last), Work Address, Work Email, Personal Email, Mobile Phone, Direct-Dial Work Phone, Job Title, Job Function, Job Level, LinkedIn URL, Company Name, Domain, Email Domain, HQ Address, Employee Size, Revenue Size, Industry, NAICS and SIC Codes + Descriptions, ensuring you have the most detailed insights for your business endeavors.

Key Features:

Extensive Data Coverage: Access a vast pool of B2B Contact Data records, providing valuable information on where the contacts work now, empowering your sales, marketing, recruiting, and research efforts.

Versatile Applications: Leverage this robust dataset for Sales Prospecting, Lead Generation, Marketing Campaigns, Recruiting initiatives, Identity Resolution, Analytics, Research, and more.

Phone Number Data Inclusion: Benefit from our comprehensive Phone Number Data, ensuring you have direct and effective communication channels. Explore our Phone Number Datasets and Phone Number Databases for an even more enriched experience.

Flexible Pricing Models: Tailor your investment to match your unique business needs, data use-cases, and specific requirements. Choose from targeted lists, CSV enrichment, or licensing our entire database or subsets to seamlessly integrate this data into your products, platform, or service offerings.

Strategic Utilization of B2B Intelligence:

Sales Prospecting: Identify and engage with the right decision-makers to drive your sales initiatives.

Lead Generation: Generate high-quality leads with precise targeting based on specific criteria.

Marketing Campaigns: Amplify your marketing strategies by reaching the right audience with targeted campaigns.

Recruiting: Streamline your recruitment efforts by connecting with qualified candidates.

Identity Resolution: Enhance your data quality and accuracy by resolving identities with our reliable dataset.

Analytics and Research: Fuel your analytics and research endeavors with comprehensive and up-to-date B2B insights.

Access Your Tailored B2B Data Solution:

Reach out to us today to explore flexible pricing options and discover how Salutary Data Company Data, B2B Contact Data, B2B Marketing Data, B2B Email Data, Phone Number Data, Phone Number Datasets, and Phone Number Databases can transform your business strategies. Elevate your decision-making with top-notch B2B intelligence.
S
Training effect of feedforward neural networks with different inputs and 2pF...
scidb.cn
Updated Feb 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
shang tian shuai (2023). Training effect of feedforward neural networks with different inputs and 2pF parameter prediction results for (near) stable nuclei [Dataset]. http://doi.org/10.57760/sciencedb.j00186.00010
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.j00186.00010
Dataset updated
Feb 11, 2023
Dataset provided by
Science Data Bank
Authors
shang tian shuai
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains five Unicode Origin Graph (.opju) files, which can be opened with the Origin 2021 software. Details of the five data files are as follows:1、 Fig.2-data.opjuThe file named "Fig.2-data.opju" contains the original information from Figure 2 in the associated article. It contains two workbooks. One is the density distributions and error bands information for the training set, and the other is for the validation set. Each workbook contains several data tables (more tables in the training set and less tables in the validation set). Each data table is named by the nuclei in the corresponding data set (training set or validation set), and they contain density information and error band information for the corresponding nuclei. Each data table has six columns and one thousand rows. Columns 1 through 6 are: the radial coordinate r of the nucleus, in fm; The charge density distribution predicted by FNN-3I; Error band value of charge density predicted by FNN-3I; The charge density distribution predicted by FNN-4I; Error band value of charge density predicted by FNN-4I; The value of charge density distribution obtained by experiment (2pF model). Each row of data pairs should be the density value or error band value at r. The practice of obtaining these data can be found in the associated article.2、 Fig.3-data.opjuThe file named "Fig.3-data.opju" contains the original information from Figure 3 in the associated article. It contains two workbooks. One is the density distributions information for the training set, and the other is for the validation set. Each workbook contains several data tables (more tables in the training set and less tables in the validation set). Each data table is named by the nuclei in the corresponding data set (training set or validation set). Each data table has six columns and one thousand rows. Columns 1 through 6 are: the radial coordinate r of the nucleus, in fm; The average charge density distribution value obtained by density averaging method; The average charge density distribution value obtained by parameter averaging method; The value of the charge density distribution derived from the network with the smallest loss function; The value of the charge density distribution derived from the network with the largest loss function; The value of charge density distribution obtained by experiment (2pF model). Each row of data pairs should be the density value at r. The practice of obtaining these data can be found in the associated article.3、 Fig.4-data.opjuThe file named "Fig.4-data.opju" contains the original information from Figure 4 in the associated article. It contains two workbooks. One is the density distributions and error bands information for the training set, and the other is for the validation set. Each workbook contains several data tables (more tables in the training set and less tables in the validation set). Each data table is named by the nuclei in the corresponding data set (training set or validation set). Each data table has six columns and one thousand rows. Columns 1 through 6 are: the radial coordinate r of the nucleus, in fm; The average charge density distribution value obtained by density averaging method; The error band value obtained by density averaging method; The average charge density distribution value obtained by parameter averaging method; The error band value obtained by parameter average method; The value of charge density distribution obtained by experiment (2pF model). Each row of data pairs should be the density value or error band value at r. The practice of obtaining these data can be found in the associated article.4、 Fig.5 6 7-data.opjuThe file named "Fig.5 6 7-data.opju" contains the original information from Figures 5 6 and 7 in the associated article. It consists of two tables, one containing the training result information for the training set and the other containing the prediction results for the prediction set. The table related to the training set has 15 columns and 86 rows, in which each column represents: proton number; Neutron number; The experimental value of parameter c; The experimental value of parameter z; The predicted value of parameter c; The predicted value of parameter z; Experimental value of charge radius R (2pF model); The predicted value of charge radius R; The difference between the experimental value and the predicted value of charge radius R; Experimental values of the second moment of charge (2pF model); The predicted value of the second moment of charge; The difference of the second moment of charge; Experimental values of fourth moment of charge (2pF model); The predicted value of fourth moment of charge; The difference of the fourth moment of charge. Each row represents the data of a nucleus. The table associated with the prediction set has 284 rows in 7 columns, each of which contains data on a single nucleus. The seven columns represent the number of protons; Neutron number; Mass number; The predicted value of parameter c; The predicted value of parameter z; The predicted value of charge radius R; The experimental value of the charge radius R.5、 Fig.8-data.opjuThe file named "Fig.8-data.opju" contains the original information from Figure 8 in the associated article. It has only one workbook, which contains several tables, each containing information about a calcium isotope and named by that isotope. Each table has 3 columns and 1000 rows. The three columns are: radial coordinate r, unit is fm; Charge density distribution value; Charge density error band value. Each row corresponds to a drawing coordinate point and error band.

Facebook

Twitter

Click to copy link

Link copied

Cite

Liu, Bin; Pang, Yihe (2023). The numerical data used in all figures. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001063777

The numerical data used in all figures.

Explore at:

Dataset updated

Nov 24, 2023

Authors

Liu, Bin; Pang, Yihe

Description

Intrinsically disordered proteins (IDPs) and regions (IDRs) are a class of functionally important proteins and regions that lack stable three-dimensional structures under the native physiologic conditions. They participate in critical biological processes and thus are associated with the pathogenesis of many severe human diseases. Identifying the IDPs/IDRs and their functions will be helpful for a comprehensive understanding of protein structures and functions, and inform studies of rational drug design. Over the past decades, the exponential growth in the number of proteins with sequence information has deepened the gap between uncharacterized and annotated disordered sequences. Protein language models have recently demonstrated their powerful abilities to capture complex structural and functional information from the enormous quantity of unlabelled protein sequences, providing opportunities to apply protein language models to uncover the intrinsic disorders and their biological properties from the amino acid sequences. In this study, we proposed a computational predictor called IDP-LM for predicting intrinsic disorder and disorder functions by leveraging the pre-trained protein language models. IDP-LM takes the embeddings extracted from three pre-trained protein language models as the exclusive inputs, including ProtBERT, ProtT5 and a disorder specific language model (IDP-BERT). The ablation analysis shown that the IDP-BERT provided fine-grained feature representations of disorder, and the combination of three language models is the key to the performance improvement of IDP-LM. The evaluation results on independent test datasets demonstrated that the IDP-LM provided high-quality prediction results for intrinsic disorder and four common disordered functions.

Clear search

Close search

Google apps

Main menu

The numerical data used in all figures.

Data Set of Existing Summary Statistics from Equipment Sensor Data

Data from: Data and code from: The Impacts of Parental Choice and...

Data from: Grouping strategies in number estimation extend the subitizing...

Phone Number Data | USA Coverage | 765 Mil+ Numbers

Number of data compromises and impacted individuals in U.S. 2005-2024

Specific heat data in GaAs/AlGaAs at 5/2 filling fraction: 20201112 data set...

Automobile Sales data

Data from: ARS Water Database

Data from: Three-systems for visual numerosity: A single case study

Post-wildfire debris-flow monitoring data, 2019 Woodbury Fire, Superstition...

Secondary Input Data used in Developing Stochastically Generated Climate and...

A Tagged Traffic Accident Dataset for Machine Learning

Number of data centers worldwide 2025, by country or territory

Data from: GEDI L1B Geolocated Waveform Data Global Footprint Level V002

Data from: Sentinel2GlobalLULC: A dataset of Sentinel-2 georeferenced RGB...

Data Use in Academia Dataset

ERA5 monthly averaged data on pressure levels from 1940 to present

US B2B Phone Number Data | 148MM Phone Numbers, Verified Data

Training effect of feedforward neural networks with different inputs and 2pF...

The numerical data used in all figures.