Facebook
TwitterIntrinsically disordered proteins (IDPs) and regions (IDRs) are a class of functionally important proteins and regions that lack stable three-dimensional structures under the native physiologic conditions. They participate in critical biological processes and thus are associated with the pathogenesis of many severe human diseases. Identifying the IDPs/IDRs and their functions will be helpful for a comprehensive understanding of protein structures and functions, and inform studies of rational drug design. Over the past decades, the exponential growth in the number of proteins with sequence information has deepened the gap between uncharacterized and annotated disordered sequences. Protein language models have recently demonstrated their powerful abilities to capture complex structural and functional information from the enormous quantity of unlabelled protein sequences, providing opportunities to apply protein language models to uncover the intrinsic disorders and their biological properties from the amino acid sequences. In this study, we proposed a computational predictor called IDP-LM for predicting intrinsic disorder and disorder functions by leveraging the pre-trained protein language models. IDP-LM takes the embeddings extracted from three pre-trained protein language models as the exclusive inputs, including ProtBERT, ProtT5 and a disorder specific language model (IDP-BERT). The ablation analysis shown that the IDP-BERT provided fine-grained feature representations of disorder, and the combination of three language models is the key to the performance improvement of IDP-LM. The evaluation results on independent test datasets demonstrated that the IDP-LM provided high-quality prediction results for intrinsic disorder and four common disordered functions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set was generated in accordance with the semiconductor industry and contains sensor recordings from high-precision and high-tech production equipment. Basically, the semiconductor production consists of hundreds of process steps performing physical and chemical operations on so-called wafers, i.e. slices based on semiconductor material. Typically, bunches of wafers are aggregated into so-called lots of size 25, which always pass through the same operations in the production chain.
In the production chain, each process equipment is equipped with several sensors recording physical parameters like gas flow, temperature, voltage, etc., resulting in so-called sensor data recorded during each process step. Out of these time-dependent sensor data, so called key numbers (KNs) are extracted using a certain time frame in the individual sensor recordings judged by experts to be important for the process. To keep the entire production as stable as possible, the KNs are used in order to intervene in case of deviations.
After the production, each device on the wafer is tested in the most careful way resulting in so-called wafer test data. In some cases, suspicious patterns occur in the wafer test data potentially leading to failure. In this case the root cause must be found in the production chain. For this purpose, the given KNs are provided. The aim is to find correlations between the wafer test data and the KNs in order to identify the root cause.
The given data is divided into three data sets: "process1.csv", "process2.csv" and "response.csv". "process1.csv" and "process2.csv" represent the extracted KNs from two process equipment. The "response.csv" data set contains the corresponding wafer test data. For the unique identification, the first two columns in each data set are the lot number and the wafer number respectively.
The exact column structure is given as follows: for "process1.csv" and "process2.csv":
lot: the lot number wafer: the wafer number KN1: the recordings of the first sensor KN2: the recordings of the second sensor . . . KN50: the recordings of the last sensor
"KN1"-"KN36" belongs to "process1" and "KN37"-"KN50" belongs to "process2".
for "response.csv":
lot: the lot number wafer: the wafer number response: the numerical test values class: the "good"/"bad" classification depending on the response value (threshold: 0,75)
Facebook
TwitterThis dataset contains all data and code necessary to reproduce the analysis described under the heading "Experiment 3" in the manuscript: Taliercio, E., Eickholt, D., Read, Q. D., Carter, T., Waldeck, N., & Fallen, B. (2023). Parental choice and seed size impact the uprightness of progeny from interspecific Glycine hybridizations. Crop Science. https://doi.org/10.1002/csc2.21015 The attached files are: G_max_G_soja_seedweight_seedcolor_analysis.Rmd: RMarkdown notebook containing all analysis code. The CSV data files should be placed in a subdirectory called data within the working directory from which the notebook is rendered. G_max_G_soja_seedweight_seedcolor_analysis.html: Rendered HTML output from RMarkdown notebook, including figures, tables, and explanatory text. counts_seedwt.csv: CSV file containing the number of progeny selected and average 100-seed weight data for each combination of cross, size class, and replicate. Columns are: F3_location: text identifier of F3 nursery location, either "CLA" or "FF" plot: numeric ID of plot pop: numeric ID of population max: name of G. max parent soja: name of G. soja parent F2_location: text identifier of F2 nursery location, either "Caswell" or "Hugo" n_planted: number of seeds planted (raw) n_selected: number of progeny selected size_ordered: seed size class, to be converted to an ordered factor size_combined: seed size class aggregated to fewer unique levels ave_100sw: average 100-seed weight for the given size class n_planted_trials: number of seeds planted rounded to nearest integer seedcolor.csv: CSV file with additional data on number of seeds of each color by population. Columns are: cross: text identifier of cross line: text identifier of line light: number of light seeds mid: number of mid-green seeds brown: number of brown seeds dark: number of dark or black seeds population: identifier of population type (F2 derived or selected) max: name of G. max parent n_total: sum of the light, mid, brown, and dark columns soja: name of G. soja parent The data processing and analysis pipeline in the RMarkdown notebook includes: Importing the data (slightly cleaned version is provided) Creating boxplots of proportion selected by cross, nursery location, and size class Fitting logistic GLMM to estimate the probability of selection as a function of parent, 100-seed weight, and their interactions Extracting and plotting random effect estimates from model Calculating and plotting estimated marginal means from model Taking contrasts between pairs of estimated marginal means and trends Calculating Bayes Factors associated with the contrasts Generating figures and tables for all above results Additional seed color analysis: importing data (slightly cleaned version is provided) Additional seed color analysis: drawing exploratory bar plot Additional seed color analysis: fitting multinomial GLM modeling the proportion of seeds with each color as a function of population Additional seed color analysis: generating expected value predictions from GLM and taking contrasts Additional seed color analysis: creating figures and tables for model results This research was funded by CRIS 6070-21220-069-00D, United Soybean Board Project # 2333-203-0101, and falls under National Program NP301. Resources in this dataset:Resource Title: RMarkdown document with all analysis code. File Name: G_max_G_soja_seedweight_seedcolor_analysis.RmdResource Title: Rendered HTML version of notebook. File Name: G_max_G_soja_seedweight_seedcolor_analysis.htmlResource Title: Progeny counts and seed weight data. File Name: counts_seedwt.csvResource Title: Seed color counts data. File Name: seedcolor.csv
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the calculation folder: each file contains a matrix called “MATR”. Each row of the matrix “MATR” is a trial.
The columns contain the following information:
1st: Number of trial
2nd: Subject response
4th: Response time
5th: first number
6th: math symbol (1=*; 2= +; 3= –)
7th: second number
8th: third number
In the calculation folder: each file contains a matrix called “matr”. Each row of the matrix “matr” is a trial.
The columns contain the following information:
1st: subject response in the numerosity task
2nd: the presented numerosity
3rd: subject response in the numerosity task
4th: zero
5th: stimulus duration
6th: Response time in the numerosity task
7th: Grouped (1) or random (2) presentation
8th: 1
9th: 1
10th: Number of items of the upper-left quadrant
11th: Number of items of the lower-left quadrant
12th: Number of items of the upper-right quadrant
13th: Number of items of the lower - right quadrant
14th: odd shape presented (1=diamond; 2=triangle; 3=circle)
15th: subject response in the shape task
16th: 0.2 in the single task response time in the shape task when dual task
17th: single (0) or dual (1) task
18th: time stimulus on
19th: time stimulus off
Facebook
TwitterThe US Consumer Phone file contains phone numbers, mobile and landline, tied to an individual in the Consumer Database. The fields available include the phone number, phone type, mobile carrier, and Do Not Call registry status.
All phone numbers can be processed and cleansed using telecom carrier data. The telecom data, including phone and texting activity, porting instances, carrier scoring, spam, and known fraud activity, comprise a proprietary Phone Quality Level (PQL), which is a data-science derived score to ensure the highest levels of deliverability at a fraction of the cost compared to competitors.
We have developed this file to be tied to our Consumer Demographics Database so additional demographics can be applied as needed. Each record is ranked by confidence and only the highest quality data is used.
Note - all Consumer packages can include necessary PII (address, email, phone, DOB, etc.) for merging, linking, and activation of the data.
BIGDBM Privacy Policy: https://bigdbm.com/privacy.html
Facebook
TwitterIn 2024, the number of data compromises in the United States stood at 3,158 cases. Meanwhile, over 1.35 billion individuals were affected in the same year by data compromises, including data breaches, leakage, and exposure. While these are three different events, they have one thing in common. As a result of all three incidents, the sensitive data is accessed by an unauthorized threat actor. Industries most vulnerable to data breaches Some industry sectors usually see more significant cases of private data violations than others. This is determined by the type and volume of the personal information organizations of these sectors store. In 2024 the financial services, healthcare, and professional services were the three industry sectors that recorded most data breaches. Overall, the number of healthcare data breaches in some industry sectors in the United States has gradually increased within the past few years. However, some sectors saw decrease. Largest data exposures worldwide In 2020, an adult streaming website, CAM4, experienced a leakage of nearly 11 billion records. This, by far, is the most extensive reported data leakage. This case, though, is unique because cyber security researchers found the vulnerability before the cyber criminals. The second-largest data breach is the Yahoo data breach, dating back to 2013. The company first reported about one billion exposed records, then later, in 2017, came up with an updated number of leaked records, which was three billion. In March 2018, the third biggest data breach happened, involving India’s national identification database Aadhaar. As a result of this incident, over 1.1 billion records were exposed.
Facebook
TwitterEach data set name incorporates the date in which the data was acquired. There are a total of 6 data sets as shown in the table below (the current one, 20201112, is part of it). In each data set there is a series of txt files (current one has 30) that contain relevant information about the dataset such as the magnetic field (both relative B* and absolute values), the FQH state as well as a table with the raw data file names and the corresponding quantities of interest (input square wave DC offset + amplitude in Volts, set temperature in Kelvin, mean/uncertainty of/in temperature in Kelvin, signal samples - averaging, resistor values in Ohms, gain, frequency in Hertz, time stamp and time difference between each acquisition in seconds). The name of these information files is composed of the date in which the data set was acquired, the cooldown number, the sample name/number and the conductance (G) followed by the file number (increment of 1). Furthermore, each data set contains all the raw data save in txt files (current one has 570). The name of each file begins with Zurich (ZH) is followed be the cooldown number, the date in which the dataset was acquired, the row number of the data matrix (for each new thermal bath - dilution refrigerator temperature) and the column number of the data matrix (for each different input voltage - i.e. square wave DC offset + amplitude at the same thermal bath - dilution refrigerator temperature). Each raw data file contains 2 columns. The first column is the time in seconds. The second column is the voltage drop across the sensing 1 kilo Ohm resistor in Volts (which translates to current across the Corbino device). This is the raw response (to the bipolar square wave input) of the 2D electron gas (2DEG) measured by the Zurich digitizer. Raw data to download: ZH-20C1-1112-1-00000.txt up to ZH-20C1-1112-30-00018.txt (570 files in total).
Facebook
TwitterThe dataset contains Sales data of an Automobile company.
Do explore pinned 📌 notebook under code section for quick EDA📊 reference
Consider an upvote ^ if you find the dataset useful
Data Description
| Column Name | Description |
|---|---|
| ORDERNUMBER | This column represents the unique identification number assigned to each order. |
| QUANTITYORDERED | It indicates the number of items ordered in each order. |
| PRICEEACH | This column specifies the price of each item in the order. |
| ORDERLINENUMBER | It represents the line number of each item within an order. |
| SALES | This column denotes the total sales amount for each order, which is calculated by multiplying the quantity ordered by the price of each item. |
| ORDERDATE | It denotes the date on which the order was placed. |
| DAYS_SINCE_LASTORDER | This column represents the number of days that have passed since the last order for each customer. It can be used to analyze customer purchasing patterns. |
| STATUS | It indicates the status of the order, such as "Shipped," "In Process," "Cancelled," "Disputed," "On Hold," or "Resolved." |
| PRODUCTLINE | This column specifies the product line categories to which each item belongs. |
| MSRP | It stands for Manufacturer's Suggested Retail Price and represents the suggested selling price for each item. |
| PRODUCTCODE | This column represents the unique code assigned to each product. |
| CUSTOMERNAME | It denotes the name of the customer who placed the order. |
| PHONE | This column contains the contact phone number for the customer. |
| ADDRESSLINE1 | It represents the first line of the customer's address. |
| CITY | This column specifies the city where the customer is located. |
| POSTALCODE | It denotes the postal code or ZIP code associated with the customer's address. |
| COUNTRY | This column indicates the country where the customer is located. |
| CONTACTLASTNAME | It represents the last name of the contact person associated with the customer. |
| CONTACTFIRSTNAME | This column denotes the first name of the contact person associated with the customer. |
| DEALSIZE | It indicates the size of the deal or order, which are the categories "Small," "Medium," or "Large." |
Facebook
TwitterThe ARS Water Data Base is a collection of precipitation and streamflow data from small agricultural watersheds in the United States. This national archive of variable time-series readings for precipitation and runoff contains sufficient detail to reconstruct storm hydrographs and hyetographs. There are currently about 14,000 station years of data stored in the data base. Watersheds used as study areas range from 0.2 hectare (0.5 acres) to 12,400 square kilometers (4,786 square miles). Raingage networks range from one station per watershed to over 200 stations. The period of record for individual watersheds vary from 1 to 50 years. Some watersheds have been in continuous operation since the mid 1930's. Resources in this dataset:Resource Title: FORMAT INFORMATION FOR VARIOUS RECORD TYPES. File Name: format.txtResource Description: Format information identifying fields and their length will be included in this file for all files except those ending with the extension .txt TYPES OF FILES As indicated in the previous section data has been stored by _location number in the form, LXX where XX is the _location number. In each subdirectory, there will be various files using the following naming conventions: Runoff data: WSXXX.zip where XXX is the watershed number assigned by the WDC. This number may or may not correspond to a naming convention used in common literature. Rainfall data: RGXXXXXX.zip where XXXXXX is the rain gage station identification. Maximum-minimum daily air temperature: MMTXXXXX.zip where XXXXX is the watershed number assigned by the WDC. Ancillary text files: NOTXXXXX.txt where XXXXX is the watershed number assigned by the WDC. These files will contain textual information including latitude-longitude, name commonly used in literature, acreage, most commonly-associated rain gage(s) (if known by the WDC), a list of all rain gages on or near the watershed. Land use, topography, and soils as known by the WDC. Topographic maps of the watersheds: MAPXXXXX.zip where XXXXX is the _location/watershed number assigned by the WDC. Map files are binary TIF files. NOT ALL FILE TYPES MAY BE AVAILABLE FOR SPECIFIC WATERSHEDS. Data files are still being compiled and translated into a form viable for this archive. Please bear with us while we grow.Resource Title: Data Inventory - watersheds. File Name: inventor.txtResource Description: Watersheds at which records of runoff were being collected by the Agricultural Research Service. Variables: Study Location & Number of Rain Gages1; Name; Lat.; Long; Number; Pub. Code; Record Began; Land Use2; Area (Acres); Types of Data3Resource Title: Information about the ARS Water Database. File Name: README.txtResource Title: INDEX TO INFORMATION ON EXPERIMENTAL AGRICULTURAL WATERSHEDS. File Name: INDEX.TXTResource Description: This report includes identification information on all watersheds operated by the ARS. Only some of these are included in the ARS Water Data Base. They are so indicated in the column titled ARS Water Data Base. Other watersheds will not have data available here or through the Water Data Center. This index is particularly important since it relates watershed names with the indexing system used by the Water Data Center. Each _location has been assigned a number. The data for that _location will be stored in a sub-directory coded as LXX where XX is the _location number. The index also indicates the watershed number used by the WDC. Data for a particular watershed will be stored in a compressed file named WSXXXXX.zip where XXXXX is the watershed number assigned by the WDC. Although not included in the index, rain gage information will be stored in compressed files named RGXXXXXX.zip where XXXXXX is a 6-character identification of the rain gage station. The Index also provides information such as latitude-longitude for each of the watersheds, acreage, the period-of-record for each acreage. Multiple entries for a particular watershed will either indicate that the acreage designated for the watershed changed or there was a break in operations of the watershed. Resource Title: ARS Water Database files. File Name: ars_water.zipResource Description: USING THIS SYSTEM Before downloading huge amounts of data from the ARS Water Data Base, you should first review the text files included in this directory. They include: INDEX OF ARS EXPERIMENTAL WATERSHEDS: index.txt This report includes identification information on all watersheds operated by the ARS. Only some of these are included in the ARS Water Data Base. They are so indicated in the column titled ARS Water Data Base. Other watersheds will not have data available here or through the Water Data Center. This index is particularly important since it relates watershed names with the indexing system used by the Water Data Center. Each _location has been assigned a number. The data for that _location will be stored in a sub-directory coded as LXX where XX is the _location number. The index also indicates the watershed number used by the WDC. Data for a particular watershed will be stored in a compressed file named WSXXXXX.zip where XXXXX is the watershed number assigned by the WDC. Although not included in the index, rain gage information will be stored in compressed files named RGXXXXXX.zip where XXXXXX is a 6-character identification of the rain gage station. The Index also provides information such as latitude-longitude for each of the watersheds, acreage, the period-of-record for each acreage. Multiple entries for a particular watershed will either indicate that the acreage designated for the watershed changed or there was a break in operations of the watershed. STATION TABLE FOR THE ARS WATER DATA BASE: station.txt This report indicates the period of record for each recording station represented in the ARS Water Data Base. The data for a particular station will be stored in a single compressed file. FORMAT INFORMATION FOR VARIOUS RECORD TYPES: format.txt Format information identifying fields and their length will be included in this file for all files except those ending with the extension .txt TYPES OF FILES As indicated in the previous section data has been stored by _location number in the form, LXX where XX is the _location number. In each subdirectory, there will be various files using the following naming conventions: Runoff data: WSXXX.zip where XXX is the watershed number assigned by the WDC. This number may or may not correspond to a naming convention used in common literature. Rainfall data: RGXXXXXX.zip where XXXXXX is the rain gage station identification. Maximum-minimum daily air temperature: MMTXXXXX.zip where XXXXX is the watershed number assigned by the WDC. Ancillary text files: NOTXXXXX.txt where XXXXX is the watershed number assigned by the WDC. These files will contain textual information including latitude-longitude, name commonly used in literature, acreage, most commonly-associated rain gage(s) (if known by the WDC), a list of all rain gages on or near the watershed. Land use, topography, and soils as known by the WDC. Topographic maps of the watersheds: MAPXXXXX.zip where XXXXX is the _location/watershed number assigned by the WDC. Map files are binary TIF files. NOT ALL FILE TYPES MAY BE AVAILABLE FOR SPECIFIC WATERSHEDS. Data files are still being compiled and translated into a form viable for this archive. Please bear with us while we grow.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MOT TASK Each file refers to a session with a particular condition. It is spelt out in the file name the number of objects to track as well as the total amount of objects on the screen Within each file it is found a variable called “MatriceRisultati”. Which contains: Number of targets to follow Number of correct answers Number of trials at the condition Percent correct responses (i.e. value_2 / value_3) NECKLACE – DISTANCE TASK Each file contains raw data for each session. All the data are store in a variable called RESP, which contains parameters for each trial. The crucial columns are Inter dot distance in the reference (in pixels – typically 1 pixel =~0.03 cm) Interdot distance in the test (in pixels) Subject choice to the question “which contains closer dots” For analysis one has to draw a psychometric curve (i.e. a cumulative gaussian) that fits the data of column 3, as a function of interdot distance (column 2). Varinat may include dividing column 2 by column 1 (so to have the ratio between test and reference) and run the psychometric curve on such normalized dimension Further explanation of the other columns (seeds for generating the stimuli) can be obtained from the authors. Numerosity discrimination The relevant columns in the matrix ‘a’ contain the following information: 1st: Numerosity 2nd: Log10 Numerosity 3rd: Response Further explanation of the other columns (seeds for generating the stimuli) can be obtained from the authors. Numerosity estimation The relevant columns in the matrix ‘ContengoRisultati’ contain the following information: 1st: Numerosity 2nd: Response Further explanation of the other columns (seeds for generating the stimuli) can be obtained from the authors.
Facebook
TwitterThis data release contains numerous comma-separated text files with data summarizing observations in the within and adjacent to the Woodbury Fire, which burned from 8 June to 15 July 2019. In particular, this monitoring data was focused on debris flows in burned and unburned areas. Rainfall data (Wdby_Rainfall.zip) are contained in csv files called Wdby_Rainfall for 3 rain gages named: B2, B6, and Reavis. This is time-series data where the total rainfall is recorded at each timestamp. The location of each rain gage is listed as a latitude/longitude in each file. Data from absolute (i.e. not vented) pressure transducers (Wdby_Pressure.zip), which can be used to constrain the time of passage of a flood or debris flow, are available in csv files called Wdby_Pressure for four drainages (B1, B6, Reavis 1, and Reavis 2). This is time-series data where the measured pressure in kilopascals is recorded at each timestamp. The location of each pressure transducer is listed as a latitude/longitude in each file. Infiltration data are located in the csv file called WoodburyInfiltration.csv. The location of the measurement is listed as a latitude/longitude. Three measurement values are reported at each location: Saturated Hydraulic Conductivity (Ks) [mm/hr], Sorptivity (S) [mm/h^(1/2)], and pressure head (hf) [m]. The date of each measurement and soil burn severity class are also reported at each location, as well as a table explaining the burn-severity numerical class conversion. Particle size analyses using laser diffraction (WoodburyLaserDiffractionSummary.zip) are located in the files called WoodburyLaserDiffractionSummary for the fine fraction (< 2 mm) of hillslope and debris flow Deposits. The diameter of each particle size class is listed in the first column. All subsequent columns begin with the sample name. The value in each row is the percentage of the grain sizes in the size class. Location data for each of these samples is listed in the accompanying data table titled: WoodburyParticleSizeSummary.csv. The particle size data are summarized in the csv files (WoodburyParticleSizeSummary.zip) called WoodburyParticleSizeSummary by debris flow deposits and hillslope samples. These files group the raw data into more useable information. The sample name (Lab ID) is used to identify the Laser Diffraction data. The data columns (Lat) and (Lon) show the latitude and longitude of the sample locations. The total fraction of all the grain sizes, determined by sieving, are listed in three classes (Fraction < 16 mm, Fraction < 4 mm, Fraction < 2 mm). The fine fractions (< 2 mm) are also summarized in the columns (%Sand, %Silt, %Clay), as determined by laser diffraction. The data are identfied as in the burn area using entries of Yes, whereas unburned areas are shown as No, indicating no burn. The median particle size (D50) is listed if the sample collected in the field was representative of the deposit. In some cases, large cobbles and boulders had to be removed from the sample because were much too large to be included in sample bags that were brought back to the lab for analysis. The last column label (Description) contains notes about each sample. Pebble count data (WoodburyPebbleCountsSummary.zip) are available in csv files called WoodburyPebbleCountsSummary for six drainages (U10 Fan, U10 Channel, U22 Channel, B1 Channel, B7 Fan, and U42 Fan). Here U represents unburned, and B represents burned. The data name indicates whether the data come from a deposit located in a channel or a fan. In each file the particle is numbered (Num) and the B-axis measurement of the particle is reported in centimeters. The location of each pebble count is listed as a latitude/longitude in each file. Channel width measurements for 23 channels are saved in unique shapefiles within the file called Channel_Width_Transects.zip. These width measurements were made using Digital Globe imagery from 19 October 2019. The study basins used for the entire study can be found in the shapefile: Woodbury_StudyBasins.shp. The attribute table along with many morphometric and fire related statistics for each basin is also available in the file Woodbury_StudyBasins_Table.csv. A description of each column name in the table is available in the file Woodbury_StudyBasins_Table_descriptions.csv. Debris flow volumes were available in eleven drainage basins. The volume data is contained in the file Wdby_FlowVolume.csv in a column named (Volume). The volume units are cubic meters. The other column is the Basin ID, which can be found in the shapefile: Woodbury_StudyBasins.shp.
Facebook
Twitteri. .\File_Mapping.csv: This file relates historical reconstructed hydrology streamflow from the U.S. Army Corps of Engineers () to the appropriate stochastic streamflow file for disaggregation of streamflow. Column A is an assigned ID, column B is named “Stochastic” and is the stochastic streamflow file needed for disaggregation, column c is called “RH_Ratio_Col” and is the name of the column in the reconstructed hydrology dataset associated with a stochastic streamflow file, and column D is named “Col_Num” and is the column number in the reconstructed hydrology dataset with the name given in column C. ii. .\Original_Draw_YearDat.csv: This file contains the historical year from 1930 to 2017 with the closest total streamflow for the Souris River Basin to each year in the stochastic streamflow dataset. Column A is an index number, column B is named “V1” and is the year in a simulation, column C is called “V2” and is the stochastic simulation number, column D is an integer that can be related to historical years by adding 1929, and column D is named “year” and is the historical year with the closest total Souris River Basin streamflow volume to the associated year in the stochastic traces. iii. .\revdrawyr.csv: This file is setup the same way that .\Original_Draw_YearDat.csv was except that, when a year had over 400 occurrences, it was randomly replaced with one of the 20 other closest years. The replacement process was completed until there were less than 400 occurrences of each reconstructed hydrology year associated with stochastic simulation years. Column A is an index number, column B is named “V1” and is the year in a simulation, column C is called “V2” and is the stochastic simulation number, column D is called “V3” and is the historical year who’s streamflow ratios will be multiplied by stochastic streamflow, and column E is called “Stoch_yr” and is the total of 2999 and the year in column B. iv. .\RH_1930_2017.csv: This file contains the daily streamflow from the U.S. Army Corps of Engineers (2020), reconstructed hydrology for the Souris River Basin for the period of 1930 to 2017. Column A is the date and columns B through AA are the daily streamflow in cubic feet per second. v. .\rhmoflow_1930Present.csv: This file was created based on .\RH_1930_2017.csv and provides streamflow for each site in cubic meters for a given month. Column A is an unnamed index column, column B is historical year, column C is the historical month associated with the historical year, column D provides a day equal to 1 but does not have particular significance and columns E through AD are monthly streamflow volume for each site location. vi. .\Stoch_Annual_TotVol_CubicDecameters.csv: This file contains the total volume of streamflow for each of the 26 sites for each month in the stochastic streamflow time timeseries and provides a total streamflow volume divided by 100,000 on a monthly basis for the entire Souris River Basin. Column A is unnamed and contains an index number, column B is month and is named “V1”, column C is the year in a simulation, column D is the simulation number, columns E (V4 through V29) through AD are streamflow volume in cubic meters, and column AE (V30) is total Souris River Basin monthly streamflow volume in cubic decameters/1,000.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains tagged accident data and is provided for reproducibility for our journal paper
Pablo Moriano, Andy Berres, Haowen Xu, Jibonananda Sanyal. “Spatiotemporal Features of Traffic Help Reduce Automatic Accident Detection Time.” Expert Systems with Applications 244 (2024): 122813. https://doi.org/10.1016/j.eswa.2023.122813
The accompanying Data in Brief publication discusses the methodology behind the creation of these data.
Berres, Andy, Pablo Moriano, Haowen Xu, Sarah Tennille, Lee Smith, Jonathan Storey, and Jibonananda Sanyal. "A Traffic Accident Dataset for Chattanooga, Tennessee." Data in Brief (2024): 110675.
The zip folder annotatedData.zip contains two subfolders: allData and bestData. The bestData folder contains all data for which a full neighborhood of five sensors upstream and five sensors downstream is available, whereas allData includes everything from bestData as well as data with a smaller number of neighboring sensors. Each folder contains one subfolder called accidents and one subfolder called non-accidents. The accidents folder contains one file per accident. The non-accidents folder contains files for the same location, day of the week and time as a corresponding accident, for each week during which there was no accident impact on the traffic.
The file names in both folders are formatted as follows: yyyy-mm-dd-hhmm-rrrrrXaaa.a.csv, consisting of date (yyyy-mm-dd), time (hhmm in 24-hour format), and sensor name (rrrrrXaaa.a), which consists of road name (rrrrr; 5 alphanumerical characters), heading (X), and mile marker (aaa.a). For example, the file 2020-11-03-1611-00I24W182.8.csv contains data for an accident which occurred at 4:11 p.m. on November 3, 2020 on I-24 Westbound near the radar sensor at mile marker 182.8.
The content of each CSV file is a timeseries of radar data beginning 15 minutes prior to the reported incident and ending 15 minutes after the reported incident. It also contains metadata, such as the accident type, etc. Each CSV file contains the following columns:
incident at sensor(i): 1 for yes (accidents folder), 0 for no (non-accidents folder)
road: road name with heading, e.g. 00I24E
mile: mile marker of nearest radar sensor, e.g. 182.8
type: accident type, e.g. “Prop Damage (over)” for property damage exceeding a certain threshold. For non-accidents, the type is given as “None”.
date: date of the data sample. For accidents, this is the date on which the accident occurred. For non-accidents, this is the date for which the non-accident data sample is collected.
incident_time: time the reference accident was reported in hh:mm. This is the time which is provided in E-TRIMS as the time the 911 call was made.
incident_hour: just the hour from the incident_time, in integer format.
data_time: timestamp for the timeseries contained in the file in hh:mm:ss format. The timeseries consists of 30 second timesteps.
weather: weather during data_time, based on data collected from NASA POWER. We used dry bulb temperature (°C), precipitation (mm/h), and wind speed (m/s) from the raw NASA POWER data to produce the classifications of rain (at least 1mm precipitation and temperatures above 2°C), snow (at least 1mm precipitation and temperatures at or below 2°C), and wind (wind speeds over 30 mph or 13.5 m/s). If there were no inclement weather conditions, we set the category to “--".
light: light conditions during data_time. To produce this field, we collected sunrise, sunset, civil twilight start and civil twilight end times from https://sunrise-sunset.org, and derived the categories dawn, daylight, dusk, and dark using these start and end times.
The last 33 columns contain radar data for the 11 sensors surrounding the accident or non-accident. For each sensor, we collected speed (mean over 30-second interval in miles per hour, or empty if no vehicles passed), volume (count of all vehicles passing during 30-second interval), and occupancy (mean % of occupancy over 30-second interval). These three variables are grouped in triples, of speed (k), volume (k), occupancy (k), where k indicates the sensor number relative to the closest sensor i to the incident, ki indicate downstream sensors. For example, speed (i-5) refers to the mean speed at the sensor which is 5 hops upstream from the accident, and volume(i+1) refers to the number of vehicles at the sensor immediately downstream from the accident.
The folder metaData.zip contains the following files:
Accidents.csv: cleaned-up accidents file with all accidents which happened on Chattanooga area highways between November 1, 2020 and April 29, 2021. We have removed accidents which happened on non-highway roads, and we have corrected the timestamps (which were in 12-hour format but missing a.m./p.m. markers) by cross-referencing light and weather conditions.
WeatherDict.json: a dictionary containing the weather data synthesized from NASA POWER.
LightDict.json: a dictionary containing the light data synthesized from Sunrise-and-Sunset.
SensorTopology.csv: neighborhood information for each radar sensor in the Chattanooga area.
SensorZones.geojson: polygons used to determine the nearest radar sensor for each accident location. Each polygon is tagged with the corresponding radar sensor’s name.
Facebook
TwitterAs of November 2025, there were a reported 4,165 data centers in the United States, the most of any country worldwide. A further 499 were located in the United Kingdom, while 487 were located in Germany. What is a data center? Data centers are facilities designed to store and compute vast amounts of data efficiently and securely. Growing in importance amid the rise of cloud computing and artificial intelligence, data centers form the core infrastructure powering global digital transformation. Modern data centers consist of critical computing hardware such as servers, storage systems, and networking equipment organized into racks, alongside specialized secondary infrastructure providing power, cooling, and security. AI data centers Data centers are vital for artificial intelligence, with the world’s leading technology companies investing vast sums in new facilities across the globe. Purpose-built AI data centers provide the immense computing power required to train the most advanced AI models, as well as to process user requests in real time, a task known as inference. Increasing attention has therefore turned to the location of these powerful facilities, as governments grow more concerned with AI sovereignty. At the same time, rapid data center expansion has sparked a global debate over resource use, including land, energy, and water, as modern facilities begin to strain local infrastructure.
Facebook
TwitterThe Global Ecosystem Dynamics Investigation (GEDI) mission aims to characterize ecosystem structure and dynamics to enable radically improved quantification and understanding of the Earth’s carbon cycle and biodiversity. The GEDI instrument produces high resolution laser ranging observations of the 3-dimensional structure of the Earth. GEDI is attached to the International Space Station (ISS) and collects data globally between 51.6° N and 51.6° S latitudes at the highest resolution and densest sampling of any light detection and ranging (lidar) instrument in orbit to date. Each GEDI Version 2 granule encompasses one-fourth of an ISS orbit and includes georeferenced metadata to allow for spatial querying and subsetting.The GEDI instrument was removed from the ISS and placed into storage on March 17, 2023. No data were acquired during the hibernation period from March 17, 2023, to April 24, 2024. GEDI has since been reinstalled on the ISS and resumed operations as of April 26, 2024.The GEDI Level 1B Geolocated Waveforms product (GEDI01_B) provides geolocated corrected and smoothed waveforms, geolocation parameters, and geophysical corrections for each laser shot for all eight GEDI beams. GEDI01_B data are created by geolocating the GEDI01_A raw waveform data. The GEDI01_B product is provided in HDF5 format and has a spatial resolution (average footprint) of 25 meters.The GEDI01_B data product contains 85 layers for each of the eight beams including the geolocated corrected and smoothed waveform datasets and parameters and the accompanying ancillary, geolocation, and geophysical correction. Additional information can be found in the GEDI L1B Product Data Dictionary.Known Issues Data acquisition gaps: GEDI data acquisitions were suspended on December 19, 2019 (2019 Day 353) and resumed on January 8, 2020 (2020 Day 8). Incorrect Reference Ground Track (RGT) number in the filename for select GEDI files: GEDI Science Data Products for six orbits on August 7, 2020, and November 12, 2021, had the incorrect RGT number in the filename. There is no impact to the science data, but users should reference this document for the correct RGT numbers. Known Issues: Section 8 of the User Guide provides additional information on known issues.Improvements/Changes from Previous Versions Metadata has been updated to include spatial coordinates. Granule size has been reduced from one full ISS orbit (~7.99 GB) to four segments per orbit (~2.00 GB). Filename has been updated to include segment number and version number. Improved geolocation for an orbital segment. Added elevation from the SRTM digital elevation model for comparison. Modified the method to predict an optimum algorithm setting group per laser shot. Added additional land cover datasets related to phenology, urban infrastructure, and water persistence. Added selected_mode_flag dataset to root beam group using selected algorithm. Removed shots when the laser is not firing.* Modified file name to include segment number and dataset version.
Facebook
TwitterSentinel2GlobalLULC is a deep learning-ready dataset of RGB images from the Sentinel-2 satellites designed for global land use and land cover (LULC) mapping. Sentinel2GlobalLULC v2.1 contains 194,877 images in GeoTiff and JPEG format corresponding to 29 broad LULC classes. Each image has 224 x 224 pixels at 10 m spatial resolution and was produced by assigning the 25th percentile of all available observations in the Sentinel-2 collection between June 2015 and October 2020 in order to remove atmospheric effects (i.e., clouds, aerosols, shadows, snow, etc.). A spatial purity value was assigned to each image based on the consensus across 15 different global LULC products available in Google Earth Engine (GEE). Our dataset is structured into 3 main zip-compressed folders, an Excel file with a dictionary for class names and descriptive statistics per LULC class, and a python script to convert RGB GeoTiff images into JPEG format. The first folder called "Sentinel2LULC_GeoTiff.zip" contains 29 zip-compressed subfolders where each one corresponds to a specific LULC class with hundreds to thousands of GeoTiff Sentinel-2 RGB images. The second folder called "Sentinel2LULC_JPEG.zip" contains 29 zip-compressed subfolders with a JPEG formatted version of the same images provided in the first main folder. The third folder called "Sentinel2LULC_CSV.zip" includes 29 zip-compressed CSV files with as many rows as provided images and with 12 columns containing the following metadata (this same metadata is provided in the image filenames): Land Cover Class ID: is the identification number of each LULC class Land Cover Class Short Name: is the short name of each LULC class Image ID: is the identification number of each image within its corresponding LULC class Pixel purity Value: is the spatial purity of each pixel for its corresponding LULC class calculated as the spatial consensus across up to 15 land-cover products GHM Value: is the spatial average of the Global Human Modification index (gHM) for each image Latitude: is the latitude of the center point of each image Longitude: is the longitude of the center point of each image Country Code: is the Alpha-2 country code of each image as described in the ISO 3166 international standard. To understand the country codes, we recommend the user to visit the following website where they present the Alpha-2 code for each country as described in the ISO 3166 international standard:https: //www.iban.com/country-codes Administrative Department Level1: is the administrative level 1 name to which each image belongs Administrative Department Level2: is the administrative level 2 name to which each image belongs Locality: is the name of the locality to which each image belongs Number of S2 images : is the number of found instances in the corresponding Sentinel-2 image collection between June 2015 and October 2020, when compositing and exporting its corresponding image tile For seven LULC classes, we could not export from GEE all images that fulfilled a spatial purity of 100% since there were millions of them. In this case, we exported a stratified random sample of 14,000 images and provided an additional CSV file with the images actually contained in our dataset. That is, for these seven LULC classes, we provide these 2 CSV files: A CSV file that contains all exported images for this class A CSV file that contains all images available for this class at spatial purity of 100%, both the ones exported and the ones not exported, in case the user wants to export them. These CSV filenames end with "including_non_downloaded_images". To clearly state the geographical coverage of images available in this dataset, we included in the version v2.1, a compressed folder called "Geographic_Representativeness.zip". This zip-compressed folder contains a csv file for each LULC class that provides the complete list of countries represented in that class. Each csv file has two columns, the first one gives the country code and the second one gives the number of images provided in that country for that LULC class. In addition to these 29 csv files, we provided another csv file that maps each ISO Alpha-2 country code to its original full country name. © Sentinel2GlobalLULC Dataset by Yassir Benhammou, Domingo Alcaraz-Segura, Emilio Guirado, Rohaifa Khaldi, Boujemâa Achchab, Francisco Herrera & Siham Tabik is marked with Attribution 4.0 International (CC-BY 4.0)
Facebook
Twitterhttps://datacatalog.worldbank.org/public-licenses?fragment=cchttps://datacatalog.worldbank.org/public-licenses?fragment=cc
This dataset contains metadata (title, abstract, date of publication, field, etc) for around 1 million academic articles. Each record contains additional information on the country of study and whether the article makes use of data. Machine learning tools were used to classify the country of study and data use.
Our data source of academic articles is the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al. 2020). The corpus contains more than 130 million English language academic papers across multiple disciplines. The papers included in the Semantic Scholar corpus are gathered directly from publishers, from open archives such as arXiv or PubMed, and crawled from the internet.
We placed some restrictions on the articles to make them usable and relevant for our purposes. First, only articles with an abstract and parsed PDF or latex file are included in the analysis. The full text of the abstract is necessary to classify the country of study and whether the article uses data. The parsed PDF and latex file are important for extracting important information like the date of publication and field of study. This restriction eliminated a large number of articles in the original corpus. Around 30 million articles remain after keeping only articles with a parsable (i.e., suitable for digital processing) PDF, and around 26% of those 30 million are eliminated when removing articles without an abstract. Second, only articles from the year 2000 to 2020 were considered. This restriction eliminated an additional 9% of the remaining articles. Finally, articles from the following fields of study were excluded, as we aim to focus on fields that are likely to use data produced by countries’ national statistical system: Biology, Chemistry, Engineering, Physics, Materials Science, Environmental Science, Geology, History, Philosophy, Math, Computer Science, and Art. Fields that are included are: Economics, Political Science, Business, Sociology, Medicine, and Psychology. This third restriction eliminated around 34% of the remaining articles. From an initial corpus of 136 million articles, this resulted in a final corpus of around 10 million articles.
Due to the intensive computer resources required, a set of 1,037,748 articles were randomly selected from the 10 million articles in our restricted corpus as a convenience sample.
The empirical approach employed in this project utilizes text mining with Natural Language Processing (NLP). The goal of NLP is to extract structured information from raw, unstructured text. In this project, NLP is used to extract the country of study and whether the paper makes use of data. We will discuss each of these in turn.
To determine the country or countries of study in each academic article, two approaches are employed based on information found in the title, abstract, or topic fields. The first approach uses regular expression searches based on the presence of ISO3166 country names. A defined set of country names is compiled, and the presence of these names is checked in the relevant fields. This approach is transparent, widely used in social science research, and easily extended to other languages. However, there is a potential for exclusion errors if a country’s name is spelled non-standardly.
The second approach is based on Named Entity Recognition (NER), which uses machine learning to identify objects from text, utilizing the spaCy Python library. The Named Entity Recognition algorithm splits text into named entities, and NER is used in this project to identify countries of study in the academic articles. SpaCy supports multiple languages and has been trained on multiple spellings of countries, overcoming some of the limitations of the regular expression approach. If a country is identified by either the regular expression search or NER, it is linked to the article. Note that one article can be linked to more than one country.
The second task is to classify whether the paper uses data. A supervised machine learning approach is employed, where 3500 publications were first randomly selected and manually labeled by human raters using the Mechanical Turk service (Paszke et al. 2019).[1] To make sure the human raters had a similar and appropriate definition of data in mind, they were given the following instructions before seeing their first paper:
Each of these documents is an academic article. The goal of this study is to measure whether a specific academic article is using data and from which country the data came.
There are two classification tasks in this exercise:
1. identifying whether an academic article is using data from any country
2. Identifying from which country that data came.
For task 1, we are looking specifically at the use of data. Data is any information that has been collected, observed, generated or created to produce research findings. As an example, a study that reports findings or analysis using a survey data, uses data. Some clues to indicate that a study does use data includes whether a survey or census is described, a statistical model estimated, or a table or means or summary statistics is reported.
After an article is classified as using data, please note the type of data used. The options are population or business census, survey data, administrative data, geospatial data, private sector data, and other data. If no data is used, then mark "Not applicable". In cases where multiple data types are used, please click multiple options.[2]
For task 2, we are looking at the country or countries that are studied in the article. In some cases, no country may be applicable. For instance, if the research is theoretical and has no specific country application. In some cases, the research article may involve multiple countries. In these cases, select all countries that are discussed in the paper.
We expect between 10 and 35 percent of all articles to use data.
The median amount of time that a worker spent on an article, measured as the time between when the article was accepted to be classified by the worker and when the classification was submitted was 25.4 minutes. If human raters were exclusively used rather than machine learning tools, then the corpus of 1,037,748 articles examined in this study would take around 50 years of human work time to review at a cost of $3,113,244, which assumes a cost of $3 per article as was paid to MTurk workers.
A model is next trained on the 3,500 labelled articles. We use a distilled version of the BERT (bidirectional Encoder Representations for transformers) model to encode raw text into a numeric format suitable for predictions (Devlin et al. (2018)). BERT is pre-trained on a large corpus comprising the Toronto Book Corpus and Wikipedia. The distilled version (DistilBERT) is a compressed model that is 60% the size of BERT and retains 97% of the language understanding capabilities and is 60% faster (Sanh, Debut, Chaumond, Wolf 2019). We use PyTorch to produce a model to classify articles based on the labeled data. Of the 3,500 articles that were hand coded by the MTurk workers, 900 are fed to the machine learning model. 900 articles were selected because of computational limitations in training the NLP model. A classification of “uses data” was assigned if the model predicted an article used data with at least 90% confidence.
The performance of the models classifying articles to countries and as using data or not can be compared to the classification by the human raters. We consider the human raters as giving us the ground truth. This may underestimate the model performance if the workers at times got the allocation wrong in a way that would not apply to the model. For instance, a human rater could mistake the Republic of Korea for the Democratic People’s Republic of Korea. If both humans and the model perform the same kind of errors, then the performance reported here will be overestimated.
The model was able to predict whether an article made use of data with 87% accuracy evaluated on the set of articles held out of the model training. The correlation between the number of articles written about each country using data estimated under the two approaches is given in the figure below. The number of articles represents an aggregate total of
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 8 decades. Data is available from 1940 onwards. ERA5 replaces the ERA-Interim reanalysis. Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product. ERA5 provides hourly estimates for a large number of atmospheric, ocean-wave and land-surface quantities. An uncertainty estimate is sampled by an underlying 10-member ensemble at three-hourly intervals. Ensemble mean and spread have been pre-computed for convenience. Such uncertainty estimates are closely related to the information content of the available observing system which has evolved considerably over time. They also indicate flow-dependent sensitive areas. To facilitate many climate applications, monthly-mean averages have been pre-calculated too, though monthly means are not available for the ensemble mean and spread. ERA5 is updated daily with a latency of about 5 days (monthly means are available around the 6th of each month). In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. So far this has only been the case for the month September 2021, while it will also be the case for October, November and December 2021. For months prior to September 2021 the final release has always been equal to ERA5T, and the goal is to align the two again after December 2021. ERA5 is updated daily with a latency of about 5 days (monthly means are available around the 6th of each month). In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. In case that this occurs users are notified. The data set presented here is a regridded subset of the full ERA5 data set on native resolution. It is online on spinning disk, which should ensure fast and easy access. It should satisfy the requirements for most common applications. An overview of all ERA5 datasets can be found in this article. Information on access to ERA5 data on native resolution is provided in these guidelines. Data has been regridded to a regular lat-lon grid of 0.25 degrees for the reanalysis and 0.5 degrees for the uncertainty estimate (0.5 and 1 degree respectively for ocean waves). There are four main sub sets: hourly and monthly products, both on pressure levels (upper air fields) and single levels (atmospheric, ocean-wave and land surface quantities). The present entry is "ERA5 monthly mean data on pressure levels from 1940 to present".
Facebook
TwitterDiscover the ultimate resource for your B2B needs with our meticulously curated dataset, featuring 148MM+ highly relevant US B2B Contact Data records and associated company information.
Very high fill rates for Phone Number, including for Mobile Phone!
This encompasses a diverse range of fields, including Contact Name (First & Last), Work Address, Work Email, Personal Email, Mobile Phone, Direct-Dial Work Phone, Job Title, Job Function, Job Level, LinkedIn URL, Company Name, Domain, Email Domain, HQ Address, Employee Size, Revenue Size, Industry, NAICS and SIC Codes + Descriptions, ensuring you have the most detailed insights for your business endeavors.
Key Features:
Extensive Data Coverage: Access a vast pool of B2B Contact Data records, providing valuable information on where the contacts work now, empowering your sales, marketing, recruiting, and research efforts.
Versatile Applications: Leverage this robust dataset for Sales Prospecting, Lead Generation, Marketing Campaigns, Recruiting initiatives, Identity Resolution, Analytics, Research, and more.
Phone Number Data Inclusion: Benefit from our comprehensive Phone Number Data, ensuring you have direct and effective communication channels. Explore our Phone Number Datasets and Phone Number Databases for an even more enriched experience.
Flexible Pricing Models: Tailor your investment to match your unique business needs, data use-cases, and specific requirements. Choose from targeted lists, CSV enrichment, or licensing our entire database or subsets to seamlessly integrate this data into your products, platform, or service offerings.
Strategic Utilization of B2B Intelligence:
Sales Prospecting: Identify and engage with the right decision-makers to drive your sales initiatives.
Lead Generation: Generate high-quality leads with precise targeting based on specific criteria.
Marketing Campaigns: Amplify your marketing strategies by reaching the right audience with targeted campaigns.
Recruiting: Streamline your recruitment efforts by connecting with qualified candidates.
Identity Resolution: Enhance your data quality and accuracy by resolving identities with our reliable dataset.
Analytics and Research: Fuel your analytics and research endeavors with comprehensive and up-to-date B2B insights.
Access Your Tailored B2B Data Solution:
Reach out to us today to explore flexible pricing options and discover how Salutary Data Company Data, B2B Contact Data, B2B Marketing Data, B2B Email Data, Phone Number Data, Phone Number Datasets, and Phone Number Databases can transform your business strategies. Elevate your decision-making with top-notch B2B intelligence.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains five Unicode Origin Graph (.opju) files, which can be opened with the Origin 2021 software. Details of the five data files are as follows:1、 Fig.2-data.opjuThe file named "Fig.2-data.opju" contains the original information from Figure 2 in the associated article. It contains two workbooks. One is the density distributions and error bands information for the training set, and the other is for the validation set. Each workbook contains several data tables (more tables in the training set and less tables in the validation set). Each data table is named by the nuclei in the corresponding data set (training set or validation set), and they contain density information and error band information for the corresponding nuclei. Each data table has six columns and one thousand rows. Columns 1 through 6 are: the radial coordinate r of the nucleus, in fm; The charge density distribution predicted by FNN-3I; Error band value of charge density predicted by FNN-3I; The charge density distribution predicted by FNN-4I; Error band value of charge density predicted by FNN-4I; The value of charge density distribution obtained by experiment (2pF model). Each row of data pairs should be the density value or error band value at r. The practice of obtaining these data can be found in the associated article.2、 Fig.3-data.opjuThe file named "Fig.3-data.opju" contains the original information from Figure 3 in the associated article. It contains two workbooks. One is the density distributions information for the training set, and the other is for the validation set. Each workbook contains several data tables (more tables in the training set and less tables in the validation set). Each data table is named by the nuclei in the corresponding data set (training set or validation set). Each data table has six columns and one thousand rows. Columns 1 through 6 are: the radial coordinate r of the nucleus, in fm; The average charge density distribution value obtained by density averaging method; The average charge density distribution value obtained by parameter averaging method; The value of the charge density distribution derived from the network with the smallest loss function; The value of the charge density distribution derived from the network with the largest loss function; The value of charge density distribution obtained by experiment (2pF model). Each row of data pairs should be the density value at r. The practice of obtaining these data can be found in the associated article.3、 Fig.4-data.opjuThe file named "Fig.4-data.opju" contains the original information from Figure 4 in the associated article. It contains two workbooks. One is the density distributions and error bands information for the training set, and the other is for the validation set. Each workbook contains several data tables (more tables in the training set and less tables in the validation set). Each data table is named by the nuclei in the corresponding data set (training set or validation set). Each data table has six columns and one thousand rows. Columns 1 through 6 are: the radial coordinate r of the nucleus, in fm; The average charge density distribution value obtained by density averaging method; The error band value obtained by density averaging method; The average charge density distribution value obtained by parameter averaging method; The error band value obtained by parameter average method; The value of charge density distribution obtained by experiment (2pF model). Each row of data pairs should be the density value or error band value at r. The practice of obtaining these data can be found in the associated article.4、 Fig.5 6 7-data.opjuThe file named "Fig.5 6 7-data.opju" contains the original information from Figures 5 6 and 7 in the associated article. It consists of two tables, one containing the training result information for the training set and the other containing the prediction results for the prediction set. The table related to the training set has 15 columns and 86 rows, in which each column represents: proton number; Neutron number; The experimental value of parameter c; The experimental value of parameter z; The predicted value of parameter c; The predicted value of parameter z; Experimental value of charge radius R (2pF model); The predicted value of charge radius R; The difference between the experimental value and the predicted value of charge radius R; Experimental values of the second moment of charge (2pF model); The predicted value of the second moment of charge; The difference of the second moment of charge; Experimental values of fourth moment of charge (2pF model); The predicted value of fourth moment of charge; The difference of the fourth moment of charge. Each row represents the data of a nucleus. The table associated with the prediction set has 284 rows in 7 columns, each of which contains data on a single nucleus. The seven columns represent the number of protons; Neutron number; Mass number; The predicted value of parameter c; The predicted value of parameter z; The predicted value of charge radius R; The experimental value of the charge radius R.5、 Fig.8-data.opjuThe file named "Fig.8-data.opju" contains the original information from Figure 8 in the associated article. It has only one workbook, which contains several tables, each containing information about a calcium isotope and named by that isotope. Each table has 3 columns and 1000 rows. The three columns are: radial coordinate r, unit is fm; Charge density distribution value; Charge density error band value. Each row corresponds to a drawing coordinate point and error band.
Facebook
TwitterIntrinsically disordered proteins (IDPs) and regions (IDRs) are a class of functionally important proteins and regions that lack stable three-dimensional structures under the native physiologic conditions. They participate in critical biological processes and thus are associated with the pathogenesis of many severe human diseases. Identifying the IDPs/IDRs and their functions will be helpful for a comprehensive understanding of protein structures and functions, and inform studies of rational drug design. Over the past decades, the exponential growth in the number of proteins with sequence information has deepened the gap between uncharacterized and annotated disordered sequences. Protein language models have recently demonstrated their powerful abilities to capture complex structural and functional information from the enormous quantity of unlabelled protein sequences, providing opportunities to apply protein language models to uncover the intrinsic disorders and their biological properties from the amino acid sequences. In this study, we proposed a computational predictor called IDP-LM for predicting intrinsic disorder and disorder functions by leveraging the pre-trained protein language models. IDP-LM takes the embeddings extracted from three pre-trained protein language models as the exclusive inputs, including ProtBERT, ProtT5 and a disorder specific language model (IDP-BERT). The ablation analysis shown that the IDP-BERT provided fine-grained feature representations of disorder, and the combination of three language models is the key to the performance improvement of IDP-LM. The evaluation results on independent test datasets demonstrated that the IDP-LM provided high-quality prediction results for intrinsic disorder and four common disordered functions.