Facebook
TwitterThese are tables used to process the loads of gulf shrimp data. It contains pre-validation tables, error tables and information about statistics on data loads. It contains no data tables and no code tables. This information need not be published data set contains catch (landed catch) and effort for fishing trips made by the larger vessels that fish near and offshore for the various species of shrimp in the Gulf of Mexico. The data set also contains landings by the smaller boats that fish in the bays, lakes, bayous, and rivers for saltwater shrimp species; however, these landings data may be aggregated for multiple trip and may not provide effort data similar to the data for the larger vessels. The landings statistics in this data set consist of the quantity and value for the individual species of shrimp by size category type and quantity of gear, fishing duration and fishing area The data collection procedures for the catch/effort data for the large vessels consist of two parts. The landings statistics are collected from the seafood dealers after the trips are unloaded; whereas, the data on fishing effort and area are collected by interviews with the captain or crew while the trip is being unloaded.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.
Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.
Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.
Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.
Methods eLAB Development and Source Code (R statistical software):
eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).
eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.
Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.
The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).
Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.
Data Dictionary (DD)
EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.
Study Cohort
This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.
Statistical Analysis
OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
Facebook
Twitterhttps://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Musical Scale Dataset: 1900+ Chroma Tensors Labeled by Scale
This dataset contains 1900+ unique synthetic musical audio samples generated from melodies in each of the 24 Western scales (12 major and 12 minor). Each sample has been converted into a chroma tensor, a 12-dimensional pitch class representation commonly used in music information retrieval (MIR) and deep learning tasks.
chroma_tensor: A JSON-safe formatted of a PyTorch tensor with shape [1, 12, T], where:
12 = the 12 pitch classes (C, C#, D, ... B)T = time steps scale_index: An integer label from 0–23 identifying the scale the sample belongs toThis dataset is ideal for: - Training deep learning models (CNNs, MLPs) to classify musical scales - Exploring pitch-class distributions in Western tonal music - Prototyping models for music key detection, chord prediction, or tonal analysis - Teaching or demonstrating chromagram-based ML workflows
| Index | Scale |
|---|---|
| 0 | C major |
| 1 | C# major |
| ... | ... |
| 11 | B major |
| 12 | C minor |
| ... | ... |
| 23 | B minor |
Chroma tensors are of shape [1, 12, T], where:
- 1 is the channel dimension (for CNN input)
- 12 represents the 12 pitch classes (C through B)
- T is the number of time frames
import torch
import pandas as pd
from tqdm import tqdm
df = pd.read_csv("/content/scale_dataset.csv")
# Reconstruct chroma tensors
X = [torch.tensor(eval(row)).reshape(1, 12, -1) for row in tqdm(df['chroma_tensor'])]
y = df['scale_index'].tolist()
Alternatively, you could directly load the chroma tensors and target scale indices using the .pt file.
import torch
import pandas as pd
data = torch.load("chroma_tensors.pt")
X_pt = data['X'] # list of [1, 12, 302] tensors
y_pt = data['y'] # list of scale indices
music21FluidSynthlibrosa.feature.chroma_stft| Column | Type | Description |
|---|---|---|
chroma_tensor | str | Flattened 1D chroma tensor [1×12×T] |
scale_index | int | Label from 0 to 23 |
T) for easy batching
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Abstract This project presents a comprehensive analysis of a company's annual sales, using the classic dataset classicmodels as the database. Python is used as the main programming language, along with the Pandas, NumPy and SQLAlchemy libraries for data manipulation and analysis, and PostgreSQL as the database management system.
The main objective of the project is to answer key questions related to the company's sales performance, such as: Which were the most profitable products and customers? Were sales goals met? The results obtained serve as input for strategic decision making in future sales campaigns.
Methodology 1. Data Extraction:
2. Data Cleansing and Transformation:
3. Exploratory Data Analysis (EDA):
4. Modeling and Prediction:
5. Report Generation:
Results - Identification of top products and customers: The best-selling products and the customers that generate the most revenue are identified. - Analysis of sales trends: Sales trends over time are analyzed and possible factors that influence sales behavior are identified. - Calculation of key metrics: Metrics such as average profit margin and sales growth rate are calculated.
Conclusions This project demonstrates how Python and PostgreSQL can be effectively used to analyze large data sets and obtain valuable insights for business decision making. The results obtained can serve as a starting point for future research and development in the area of sales analysis.
Technologies Used - Python: Pandas, NumPy, SQLAlchemy, Matplotlib/Seaborn - Database: PostgreSQL - Tools: Jupyter Notebook - Keywords: data analysis, Python, PostgreSQL, Pandas, NumPy, SQLAlchemy, EDA, sales, business intelligence
Facebook
Twitterhttps://spdx.org/licenses/MIT.htmlhttps://spdx.org/licenses/MIT.html
EchoTables is an innovative accessibility tool developed as part of the IKILeUS project at the University of Stuttgart. It is designed to improve the usability of tabular data for visually impaired users by converting structured tables into concise, auditory-friendly textual summaries. Traditional screen readers navigate tables linearly, which imposes a high cognitive load on users. EchoTables alleviates this issue by summarizing tables, facilitating quicker comprehension and more efficient information retrieval. Initially utilizing RUCAIBox (LLM), EchoTables transitioned to Mistral-7B, a more powerful open-source model, to enhance processing efficiency and scalability. The tool has been tested with widely used screen readers such as VoiceOver to ensure accessibility. EchoTables has been adapted to process diverse data sources, including lecture materials, assignments, and WikiTables, making it a valuable resource for students navigating complex datasets.
Facebook
TwitterSPAN-E Level 2 ELectron Energy Spectra Data-------------------------------------------File Naming Format: psp_swp_spb_sf1_L2_32E_YYYYMMDD_v01.cdfThe SF1 product is an energy spectrum produced on the spacecraft by summing over the Theta and Phi directions. The units are differential energy flux and eV. The sample filename above includes 32 Energies.The larger Theta angles (deflection angles) are artificially enhanced in the "sf1" energy spectra data products due to the method of spectra production on the SPAN-E instrument (straight summing). Thus, SF1 energy spectra are not recommended for rigid statistical analysis.Parker Solar Probe SWEAP Solar Probe Analyzer, SPAN, Electron Data Release Notes--------------------------------------------------------------------------------November 19, 2019 Initial Data Release--------------------------------------Overview of Measurements------------------------The SWEAP team is pleased to release the data from Encounter 1 and Encounter 2. The files contain data from the time range October 31, 2018 - June 18, 2019.The prime mission of Parker Solar Probe is to take data when within 0.25 AU of the Sun during its orbit. However, there has been some extended campaign measurements outside of this distance. The data are available for those days that are within 0.25 AU as well as those days when the instruments were operational outside of 0.25 AU.Each SWEAP data file includes a set of a particular type of measurements over a single observing day. Measurements are provided in Common Data Format (CDF), a self-documenting data framework for which convenient open source tools exist across most scientific computing platforms. Users are strongly encouraged to consult the global metadata in each file, and the metadata that are linked to each variable. The metadata includes comprehensive listings of relevant information, including units, coordinate systems, qualitative descriptions, measurement uncertainties, methodologies, links to further documentation, and so forth.SPAN-E Level 2 Version 01 Release Notes---------------------------------------The SPAN-Ae and SPAN-B instruments together have fields of view covering >90% of the sky; major obstructions to the FOV include the spacecraft heat shield and other intrusions by spacecraft components. Each individual SPAN-E has FOV of ±60° in Theta and 240° in Phi. The rotation matrices to convert into the spacecraft frame can be found in the individual CDF files, or in the instrument paper.This data set covers all periods for which the instrument was turned on and taking data in the solar wind in ion mode. This includes maneuvers affecting the spacecraft attitude and orientation. Measurements taken by SPAN-B when the spacecraft is pointed away from the sun are taken in sunlight.The data quality flags for the SPAN data can be found in the CDF files as: QUALITY_FLAG (0=good, 1=bad)General Remarks for Version 01 Data-----------------------------------Users interested in field-aligned electrons should take care regarding potential blockages from the heat shield when B is near radial, especially in SPAN-Ae. Artificial reductions in strahl width can result.Due to the relatively high electron temperature in the inner heliosphere, many secondary electrons are generated from spacecraft and instrument surfaces. As a result, electron measurements in this release below 30 eV are not advised for scientific analysis.The fields of view in SPAN-Ae and SPAN-B have many intrusions by the spacecraft, and erroneous pixels discovered in analysis, in particular near the edges of the FOV, should be viewed with skepticism. Details on FOV intrusion are found in the instrument paper, forthcoming, or by contacting the SPAN-E instrument scientist.The instrument mechanical attentuators are engaged during the eight days around perihelia 1 and perihelia 2, which results in a factor of about 10 reduction of the total electron flux into the instrument. During these eight days, halo electron measurements are artificially enhanced in the L2 products as a result of the reduced instrument geometric factor and subsequent ground corrections.A general note for Encounter 1 and Encounter 2 data: a miscalculation in the deflection tables loaded to both SPAN-Ae and SPAN-B resulted in over-deflection of the outermost Theta angles during these encounters. As such, pixels at large Thetas should be ignored. This error was corrected by a table upload prior to Encounter 3.Lastly, when viewing time gaps in the SPAN-E measurements, be advised that the first data point produced by the instrument after a power-on is the maximum value permitted by internal instrument counters. Therefore, the first data point after powerup is erroneous and should be discarded, as indicated by quality flags.SPAN-E Encounter 1 Remarks--------------------------SPAN-E operated nominally for the majority of the first encounter. Exceptions to this include: a few instances of corrupted, higher-energy sweep tables, and an instrument commanding error for the two hours surrounding perihelion 1. These and other instrument diagnostic tests are indicated with the QUALITY_FLAG variable in the CDFs.The mechanical attentuator was engaged for the 8 days around perihelion 1: as a result the microchannel plate, MCP, noise due to thermal effects and cosmic rays are artificially enhanced and are particularly obvious at higher energies. Exercise caution with this data release if looking for halo electrons when the mechanical attenuator is engaged.SPAN-E Cruise Phase Remarks---------------------------The cruise mode rates of SPAN-E are greatly reduced compared to the encounter mode rates. When the PSP spacecraft is in a communications slew, the SPAN-B instrument occasionally reaches its maximum allowable operating temperature and is powered off by SWEM.Timing for the SF1 products in cruise phase is not corrected in v01, and thus it is not advised to use the data at this time for scientific analysis. The typical return of SF0 products is one spectrum out of every 32 survey spectra is returned every 15 minutes or so. One out of every four 27.75 s SF1 spectra is produced every 111 s.SPAN-E Encounter 2 Remarks--------------------------SPAN-E operated nominally for the majority of the second encounter. Exceptions include instrument diagnostic and health checks and a few instances of corrupted high-energy sweep tables. These tests and corrupted table loads are indicated with the QUALITY_FLAG parameter.The mechanical attentuator was engaged for the 8 days around perihelion 2: as a result the MCP noise due to thermal effects and cosmic rays are artificially enhanced and are particularly obvious at higher energies. Exercise caution in this data release if looking for halo electrons when the mechanical attenuator is engaged.Parker Solar Probe SWEAP Rules of the Road------------------------------------------As part of the development of collaboration with the broader Heliophysics community, the mission has drafted a "Rules of the Road" to govern how PSP instrument data are to be used. 1) Users should consult with the PI to discuss the appropriate use of instrument data or model results and to ensure that the users are accessing the most recently available versions of the data and of the analysis routines. Instrument team Science Operations Centers, SOCs, and/or Virtual Observatories, VOs, should facilitate this process serving as the contact point between PI and users in most cases. 2) Users should heed the caveats of investigators to the interpretations and limitations of data or model results. Investigators supplying data or models may insist that such caveats be published. Data and model version numbers should also be specified. 3) Browse products, Quicklook, and Planning data are not intended for science analysis or publication and should not be used for those purposes without consent of the PI. 4) Users should acknowledge the sources of data used in all publications, presentations, and reports: "We acknowledge the NASA Parker Solar Probe Mission and the SWEAP team led by J. Kasper for use of data.".* 5) Users are encouraged to provide the PI a copy of each manuscript that uses the PI data prior to submission of that manuscript for consideration of publication. On publication, the citation should be transmitted to the PI and any other providers of data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 30,000 YouTube video analytics records, created to simulate realistic YouTube Studio performance data from the last 12 months. It provides per-video metrics such as impressions, click-through rate (CTR), average view duration, watch time, likes, comments, and traffic sources.
This dataset is useful for:
YouTube trend analysis Predictive modeling Engagement analysis Audience retention studies Recommender systems Machine learning and EDA Content performance optimization
All upload dates fall within the previous 365 days, making the dataset aligned with recent YouTube trends.
COLUMN DESCRIPTIONS
Post_ID – Unique video ID used to join with other tables. Upload_Date – Video upload date within the last 1 year. Video_Duration_Min – Total length of the video in minutes. Avg_View_Duration_Sec – Average watch time per viewer. Avg_View_Percentage – Percentage of the video that users watched. Subscribers_Gained – Number of subscribers gained from this video. Traffic_Source – How viewers discovered the video (Search, Suggested, Browse, External, etc.). CTR_Percentage – Click-through rate of the thumbnail impressions. Impressions – How many users saw the video thumbnail across YouTube surfaces. Likes – Total number of likes received. Comments – Number of comments posted. Shares – Number of times the video was shared. Total_Watch_Time_Hours – Total accumulated watch time in hours (critical YouTube ranking signal).
WHY THIS DATASET MATTERS
YouTube’s recommendation system prioritizes: high watch time high CTR strong audience retention strong engagement (likes, comments, shares)
This dataset includes all of these metrics, allowing deep analysis of: what makes videos perform well which traffic sources are strongest how video length affects watch time how engagement influences discoverability seasonal or monthly patterns in video performance
Facebook
TwitterPassive acoustic monitoring (PAM) offers the potential to dramatically increase the scale and robustness of species monitoring in rainforest ecosystems. PAM generates large volumes of data that require automated methods of target species detection. Species-specific recognisers, which often use supervised machine learning, can achieve this goal. However, they require a large training dataset of both target and non-target signals, which is time-consuming and challenging to create. Unfortunately, very little information about creating training datasets for supervised machine learning recognisers is available, especially for tropical ecosystems. Here we show an iterative approach to creating a training dataset that improved recogniser precision from 0.12 to 0.55. By sampling background noise using an initial small recogniser, we addressed one of the significant challenges of training dataset creation in acoustically diverse environments. Our work demonstrates that recognisers will likely f..., Raw data used to create this dataset was collected from autonomous recording units in northern Costa Rica. A template-matching process was used to identify candidate signals, then a one-second window was put around each candidate signal. We extracted a total of 113 acoustic features using the warbler package in R (R Core Team, 2020): 20 measurements of frequency, time, and amplitude parameters, and 93 Mel-frequency cepstral coefficients (MFCCs) (Araya†Salas and Smith†Vidaurre, 2017). This dataset also includes the results of manually checking detections that were the output of a trained random forest. These were initially output as selection tables, individual sound files were loaded in Raven Lite, selection tables were loaded, and each detection was manually checked and labelled. There is also the random forest model, which is a .rds format model created using tidymodels in R. , Following the code associated with this data will require R; the outputs from the machine learning require Raven Lite to open. The raw recordings are not included in this dataset.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
For more details about the dataset and its applications, please refer to our GitHub repository.
The Point-DeepONet Dataset is meticulously curated to support advanced research in nonlinear structural analysis and operator learning, specifically tailored for applications in structural mechanics of jet engine brackets. This dataset encompasses a diverse collection of non-parametric three-dimensional (3D) geometries subjected to varying load conditions—vertical, horizontal, and diagonal. It includes high-fidelity simulation results, such as displacement fields and von Mises stress distributions, derived from nonlinear finite element analyses (FEA).
Key Features:
This dataset was utilized to develop and train the Point-DeepONet model, which integrates PointNet within the DeepONet framework to achieve rapid and accurate predictions for structural analyses. By leveraging this dataset, researchers can explore operator-learning techniques, optimize design processes, and enhance decision-making in complex engineering workflows.
We utilize the DeepJEB dataset [1], a synthetic dataset specifically designed for 3D deep learning applications in structural mechanics, focusing on jet engine brackets. This dataset includes various bracket geometries subjected to different load cases—vertical, horizontal, and diagonal—providing a diverse range of scenarios to train and evaluate deep learning models for predicting field values. While the original DeepJEB dataset offers solutions from linear static analyses, in this study we extend its applicability by performing our own nonlinear static finite element analyses to predict displacement fields ($u_x$, $u_y$, $u_z$) and von Mises stress under varying geometric and loading conditions.
Finite element analyses (FEA) are conducted using Altair OptiStruct [2] to simulate the structural response under nonlinear static conditions. Each bracket geometry is discretized using second-order tetrahedral elements with an average element size of 2 mm, enhancing the precision of the displacement and stress predictions. The material properties for the brackets are based on Ti–6Al–4V, specified with a density of $4.47 \times 10^{-3}$ g/mm³, a Young's modulus ($E$) of 113.8 GPa, and a Poisson’s ratio ($ u$) of 0.342, representing realistic behavior under the applied loads.
An elastic–plastic material model with linear isotropic hardening is employed to capture the nonlinear response, characterized by a yield stress of 227.6 MPa and a hardening modulus of 355.56 MPa. The nonlinear analysis settings include a maximum iteration limit of 10 and a convergence tolerance of 1%, ensuring accurate simulation of the structural response to complex loading conditions.
| Bracket geometry and load direction | Bolted and loaded interfaces | Boundary conditions and constraints |
|---|---|---|
The dataset comprises a range of jet engine bracket geometries with varying structural properties and masses. The node counts range from 127,634 to 380,714, and the mass spans from 0.56 kg to 2.41 kg, ensuring a diverse set of structural complexities and weights.
| Metric | Minimum | Maximum | Average |
|---|---|---|---|
| Number of nodes | 127,634 | 380,714 | 209,974 |
| Number of edges | 468,708 | 1,453,872 | 787,658 |
| Number of cells | 78,118 | 242,312 | 131,276 |
| Mass (kg) | 0.56 | 2.41 | 1.23 |
To facilitate effective model training and evaluation, the dataset was divided into training and validation subsets, with 80% allocated for training...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the accumulation of large amounts of health related data, predictive analytics could stimulate the transformation of reactive medicine towards Predictive, Preventive and Personalized (PPPM) Medicine, ultimately affecting both cost and quality of care. However, high-dimensionality and high-complexity of the data involved, prevents data-driven methods from easy translation into clinically relevant models. Additionally, the application of cutting edge predictive methods and data manipulation require substantial programming skills, limiting its direct exploitation by medical domain experts. This leaves a gap between potential and actual data usage. In this study, the authors address this problem by focusing on open, visual environments, suited to be applied by the medical community. Moreover, we review code free applications of big data technologies. As a showcase, a framework was developed for the meaningful use of data from critical care patients by integrating the MIMIC-II database in a data mining environment (RapidMiner) supporting scalable predictive analytics using visual tools (RapidMiner’s Radoop extension). Guided by the CRoss-Industry Standard Process for Data Mining (CRISP-DM), the ETL process (Extract, Transform, Load) was initiated by retrieving data from the MIMIC-II tables of interest. As use case, correlation of platelet count and ICU survival was quantitatively assessed. Using visual tools for ETL on Hadoop and predictive modeling in RapidMiner, we developed robust processes for automatic building, parameter optimization and evaluation of various predictive models, under different feature selection schemes. Because these processes can be easily adopted in other projects, this environment is attractive for scalable predictive analytics in health research.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the first data release from the Public Utility Data Liberation (PUDL) project. It can be referenced & cited using https://doi.org/10.5281/zenodo.3653159
For more information about the free and open source software used to generate this data release, see Catalyst Cooperative's PUDL repository on Github, and the associated documentation on Read The Docs. This data release was generated using v0.3.1 of the catalystcoop.pudl python package.
Included Data Packages
This release consists of three tabular data packages, conforming to the standards published by Frictionless Data and the Open Knowledge Foundation. The data are stored in CSV files (some of which are compressed using gzip), and the associated metadata is stored as JSON. These tabular data can be used to populate a relational database.
pudl-eia860-eia923:pudl-eia860-eia923-epacems:pudl-eia860-eia923 package above, as well as the Hourly Emissions data from the US Environmental Protection Agency's (EPA's) Continuous Emissions Monitoring System (CEMS) from 1995-2018. The EPA CEMS data covers thousands of power plants at hourly resolution for decades, and contains close to a billion records.pudl-ferc1:catalystcoop.pudl Python package and the original source data files archived as part of this data release.Contact Us
If you're using PUDL, we would love to hear from you! Even if it's just a note to let us know that you exist, and how you're using the software or data. You can also:
Using the Data
The data packages are just CSVs (data) and JSON (metadata) files. They can be used with a variety of tools on many platforms. However, the data is organized primarily with the idea that it will be loaded into a relational database, and the PUDL Python package that was used to generate this data release can facilitate that process. Once the data is loaded into a database, you can access that DB however you like.
Make sure conda is installed
None of these commands will work without the conda Python package manager installed, either via Anaconda or miniconda:
Download the data
First download the files from the Zenodo archive into a new empty directory. A couple of them are very large (5-10 GB), and depending on what you're trying to do you may not need them.
pudl-input-data.tgz.pudl-eia860-eia923-epacems.tgz.Load All of PUDL in a Single Line
Use cd to get into your new directory at the terminal (in Linux or Mac OS), or open up an Anaconda terminal in that directory if you're on Windows.
If you have downloaded all of the files from the archive, and you want it all to be accessible locally, you can run a single shell script, called load-pudl.sh:
bash pudl-load.sh
This will do the following:
sqlite/pudl.sqlite.parquet/epacems.sqlite/ferc1.sqlite.Selectively Load PUDL Data
If you don't want to download and load all of the PUDL data, you can load each of the above datasets separately.
Create the PUDL conda Environment
This installs the PUDL software locally, and a couple of other useful packages:
conda create --yes --name pudl --channel conda-forge \
--strict-channel-priority \
python=3.7 catalystcoop.pudl=0.3.1 dask jupyter jupyterlab seaborn pip
conda activate pudl
Create a PUDL data management workspace
Use the PUDL setup script to create a new data management environment inside this directory. After you run this command you'll see some other directories show up, like parquet, sqlite, data etc.
pudl_setup ./
Extract and load the FERC Form 1 and EIA 860/923 data
If you just want the FERC Form 1 and EIA 860/923 data that has been integrated into PUDL, you only need to download pudl-ferc1.tgz and pudl-eia860-eia923.tgz. Then extract them in the same directory where you ran pudl_setup:
tar -xzf pudl-ferc1.tgz
tar -xzf pudl-eia860-eia923.tgz
To make use of the FERC Form 1 and EIA 860/923 data, you'll probably want to load them into a local database. The datapkg_to_sqlite script that comes with PUDL will do that for you:
datapkg_to_sqlite \
datapkg/pudl-data-release/pudl-ferc1/datapackage.json \
datapkg/pudl-data-release/pudl-eia860-eia923/datapackage.json \
-o datapkg/pudl-data-release/pudl-merged/
Now you should be able to connect to the database (~300 MB) which is stored in sqlite/pudl.sqlite.
Extract EPA CEMS and convert to Apache Parquet
If you want to work with the EPA CEMS data, which is much larger, we recommend converting it to an Apache Parquet dataset with the included epacems_to_parquet script. Then you can read those files into dataframes directly. In Python you can use the pandas.DataFrame.read_parquet() method. If you need to work with more data than can fit in memory at one time, we recommend using Dask dataframes. Converting the entire dataset from datapackages into Apache Parquet may take an hour or more:
tar -xzf pudl-eia860-eia923-epacems.tgz
epacems_to_parquet datapkg/pudl-data-release/pudl-eia860-eia923-epacems/datapackage.json
You should find the Parquet dataset (~5 GB) under parquet/epacems, partitioned by year and state for easier querying.
Clone the raw FERC Form 1 Databases
If you want to access the entire set of original, raw FERC Form 1 data (of which only a small subset has been cleaned and integrated into PUDL) you can extract the original input data that's part of the Zenodo archive and run the ferc1_to_sqlite script using the same settings file that was used to generate the data release:
tar -xzf pudl-input-data.tgz
ferc1_to_sqlite data-release-settings.yml
You'll find the FERC Form 1 database (~820 MB) in sqlite/ferc1.sqlite.
Data Quality Control
We have performed basic sanity checks on much but not all of the data compiled in PUDL to ensure that we identify any major issues we might have introduced through our processing
Facebook
TwitterThis dataset contains two tables: creative_stats and removed_creative_stats. The creative_stats table contains information about advertisers that served ads in the European Economic Area or Turkey: their legal name, verification status, disclosed name, and location. It also includes ad specific information: impression ranges per region (including aggregate impressions for the European Economic Area), first shown and last shown dates, which criteria were used in audience selection, the format of the ad, the ad topic and whether the ad is funded by Google Ad Grants program. A link to the ad in the Google Ads Transparency Center is also provided. The removed_creative_stats table contains information about ads that served in the European Economic Area that Google removed: where and why they were removed and per-region information on when they served. The removed_creative_stats table also contains a link to the Google Ads Transparency Center for the removed ad. Data for both tables updates periodically and may be delayed from what appears on the Google Ads Transparency Center website. About BigQuery This data is hosted in Google BigQuery for users to easily query using SQL. Note that to use BigQuery, users must have a Google account and create a GCP project. This public dataset is included in BigQuery's 1TB/mo of free tier processing. Each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery . Download Dataset This public dataset is also hosted in Google Cloud Storage here and available free to use. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage. We provide the raw data in JSON format, sharded across multiple files to support easier download of the large dataset. A README file which describes the data structure and our Terms of Service (also listed below) is included with the dataset. You can also download the results from a custom query. See here for options and instructions. Signed out users can download the full dataset by using the gCloud CLI. Follow the instructions here to download and install the gCloud CLI. To remove the login requirement, run "$ gcloud config set auth/disable_credentials True" To download the dataset, run "$ gcloud storage cp gs://ads-transparency-center/* . -R" This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A program for managing collections of full spectrum recordings of bats.v6.2.6660 incorporates the import and export of collections of pictures in the image compare window.v6.2.6661 fixes some bugs and speed issues in 6660.v6.2.6680 tries to fix some database updating problems and adds additional debugging in this area.v7.0.6760 - Major improvements and changes.First define the additional shortvut key in Audacity - CTRL-SHIFT-M=Open menu in focussed track. New item in 'View' menu- Analyse and Import, will open a folder of .wav files and sequentially open them in Audacity. When annotated and the label file saved and Audacity closed the next file will be opened. If the label file is not saved then the process stops and will resume on the next invocation of Analyse and Import on that folder. As each file is opened the label track wil be automatically created and named.and the view ill zoom to the first 5 seconds of the .wav track.7.0.6764 also includes a new report format which (for one or more sessions) gives number of minutes in each ten minute window throughout the day in which a species of bat was detected. Rows are given for each species in the recordings. In Excel looks good as a bar chart or a radar chart.7.06789 hopefully fixes the problems when trying to update a database that caused the program to crash on startup if the database did not contain the more recent Version table.7.0.6799 cosmetic changes to use the normal file selection dialog instead of the folder browser dialog, and also when using Analyse and Import, you no longer need to pick a file when selecting the .wav file folder.7.0.6820 Adds session data to all report formats, including pass statistics for all species found in that session.7.0.6844 Adds the ability to add, save, adjust and include in exported images, Fiducial lines. Lines can be added, deleted or adjusted in the image comparison window and are saved to the database when the window is closed. For exported images the lines are permanently overlaid on the image and are no longer adjustable.7.0.6847 Makes slight improvements to the aspect ratio of images in the comparison window and when images are exported the fiducial lines are only included if the FIDS button is deptessed.7.0.6850 Fixes an occasional bug when saving images through Analyse and Import - using filenames in the caption has priority over bat's names. Also improvements in file handling when changing databases - now attempts to recognise if a db is the right type.7.0.6858 Makes some improvements to image handling, including a modification to the database structure to allow long descriptions for images (previously description+caption had to be less than 250 chars) and the ability to copy images within the application (but not to external applications). A single image may now be used simultaneously as a bat image, a call image or a segment image. Changes to it in one location will be reflected in all the other locations. On deletion the link is removed and if there are no remaining links for the image then the image itself will be removed from the database.7.0.6859 has some improvements to the image handling system. In the batReference view the COMP button now adds all bat and call images for all selected bats to the comparison window. Double clicking on a bat adds all bat, call and segment images for all the bats selected to the comparison window.7.0.6860 removed the COMP button from the bat reference view. Double-clicking in this view transfers all images of bat, calls and recordings to the comparison window. Double-clicking in the ListByBats view transfers all recording images but not the bat and call images to the comparison window. Exported images for recordings use the recording filename plus the start offset of the segment as a filename, or alternatively the image caption. 7.0.6866 Improvements to the grids and to grid scaling and movement especially for the sonagram grids.7.0.6876 Added the ability to right-click on a labelled segment in the recordings detail list control, to open that recording in Audacity and scroll to the location of that labelled segment. Only one instance of Audacity may be opened at a time or the scrolling does not work. Also made some improvements to the scrolling behaviour of the recording detail window.Version 7.1 makes significant changes to the way in which the recordingSessions list is displayed. Because this list can get quite large and therefore takes a long time to load, it now loads the data in discrete pages.At the top of the RecordingSessions List is a new navigation bar with a set of buttons and two combo-boxes. The rightmost combobox is used to set the number of items that will be loaded and displayed on a page. The selections are currently 10, 25, 50 and 100. Slower machines may find it advantageous to use smaller page sizes in order to speed up load times and reduce the demand for memory and cpu-time.The other combobox allows the selection of a sort field for the session list. Sessions are displayed in columns in a DataGrid which allows columns to be re-sized, moved and sorted. These functions all now only apply to the subset of data that has been loaded as a page. The Combo-box allows you to sort the full set of data in the database before loading the page. Thus if the combobox is set to sort on DATE with a Page size of 10, then only the 10 earliest (or the 10 latest depending on the direction of sorting) sessions in the database will be loaded. The displayed set of sessions can be sorted on the screen by clicking the column headers but this only changes the order on the screen, it does not load any other sessions from the database.The four buttons can be used to load the next or previous pages or to move to the start or end of the complete database collection. The Next or Previous buttons move the selection by 2/3 of the Page Size so that there will always be some visual overlap between pages.The sort combo-box has two entries for each field, one with a suffix of ^ and one with a suffix of v . These sort the database in Ascending or Descending order. Selecting a sort field will update the display and sort the display entries on the same field, but the sort direction of the displayed items will be whatever was last used. Clicking the column header will change the direction of sort for the displayed items.v7.1.6885 Updates the database to DB version 6.2 by the addition of two link tables between bats and recordings and between bats and sessions. These tables enable much faster access to bat specific data. Also various improvements to improve the speed of loading data when switching to List By Bats view, especially with very large databases.v7.1.6891 Further performance improvements in loading ListByBats and in loading imagesv7.1.6901 Has the ability to perform screen grabs of images without needing an external screen grabber program. Shift-Click on the 'PASTE' button and drag and resize the semi-transparent window to select a screen area, right click in the window to capture that portion of the screen. For details refer to Import/Import Picturesv7.1.6913 Fixed some scaling issues on fiducial lines in the comparison windowv7.1.6915 Bugfix for adjusting fiducial lines - 7.1.6913 removedv7.1.6941 Improvements and adjustments to grid and fiducial line handlingv7.1.6951 Fixes some problems with the Search dialogv7.2.6970 Introduces the ability to replay segments at reduced speed or in heterodyne 'bat detector' mode.v7.2.6971 When opening a recording or segment in Audacity the corresponding .txt file will be opened as a label track. NB this only works if there is only a single copy of Audacity open - subsequent calls with Audacity still open do not open the label track.v7.2.6978 Improvements to Heterodyne playback to use pure sinewave.7.2.6984 Bug fixes and mods to image handling - image captions can now have a region appended in seconds after the file name.---BRM-Aud-Setup_v7_2_7000.exeThis version includes its only private copy of Audacity 2.3.0 portable, which will be placed in the same folder as BRM and has its own pre-configured configuration file appropriate for use with BRM. This will not interfere with any existing installation of Audacity but provides all the Audacity features required by BRM with no further action by the user. BRM will use this version to display .wav files.v7.2.7000 also includes a new report format which is tailored to provide data for the Hertfordshire Mammals, Amphibians and Reptiles survey. It also displays the GPS co-ordinates for the Recording Session as an OS Grid Reference as well as latitude and longitude.v7.2.7010 Speed improvements and bug-fixes to opening and running Audacity through BRM. Audacity portable is now located in C:\audacity-win-portable instead of under the BRM program folder.v7.2.7012 Fixed some bugs in Report generation when producing he Frequency Table. Enabled the AddTag button in the BatReference pane.v7.2.7021 Upgrades the Audacity component to version 2.3.1 and a few minor bug fixes.
Facebook
TwitterTEMPO-Online provides the following functions and services: Free access to statistical information.Export of tables in .csv and .xls formats and its printing. What is the content of TEMPO-Online? The National Institute of Statistics offers a statistical database, TEMPO-Online, that gives the possibility to access a large range of information.The content of the above-mentioned database consists of:Approximately 1100 statistical indicators, divided in socio-economical fields and sub-fields; Metadata associated to the statistical indicators (definition, starting and ending year of the time series, the last period of data loading, statistical methodology, the last updating); Detailed indicators at statistical characteristics group and/or sub-group level ( ex. The total number of employees at the end of the year by employee category, activities of the national economy - sections, sexes, areas and counties); Time series starting with 1990 - till today: With a monthly, quarterly, semi-annual and annual frequency; At national level, development region level, county and commune level. Search according to key words The search key words allows the finding of various objects (tables with statistical variables divided on time series). The search will give back results based on the matrix code and on the key words in the title or in the definition of a matrix. The result of the search will show on a list with specific objects. For a key word, one can use the searching section from the menu bar on the left.Tables As a whole, the tables that result following an interrogation have a flexible structure. For instance, the user may select the variables and attributes with the help of the interrogation interface, according to his needs.The user can save the table that results following an interrogation in .csv and .xls formats and its printingNote: in order to access tables at place level (very large), the user has to select each county with the respective places, so that the access be faster and avoid technical blocks.
Facebook
TwitterThis dataset originates from the experimental study titled "Effects of Endogenous Potassium and Calcium Metal Ions in Corn Stalk on the Characteristics of Its Pyrolysis Gaseous, Solid, and Liquid Products," aiming to obtain correlated data between metal ion concentration, pyrolysis temperature, and the yield and characteristics of pyrolysis products (biochar, bio-oil, and syngas) through systematic experiments, with all data generated via experimental determination and standardized processing. The data generation process is as follows: first, experimental materials were prepared—corn stalks used as raw material were collected from farms in Guannan County, Lianyungang City, Jiangsu Province, crushed into powder with a particle size of 80–120 meshes using a crusher, dried to absolute dryness in an oven at 105°C, and then sealed for later use; the corn stalk powder (CS-Raw) was subjected to acid washing for ash removal by immersing it in 1 mol/L hydrochloric acid at a solid-to-liquid ratio of 1:10, stirring at room temperature for 12 hours, and then undergoing filtration, rinsing, and drying to obtain the ash-removed sample (CS-AW); subsequently, using KCl and CaCl₂ as metal sources, CS-AW was immersed in deionized water containing the corresponding metal salts at a solid-to-liquid ratio of 1:10, with metal ion concentrations (mass ratio relative to the raw material) set at 2%, 5%, and 7%, and after drying, metal-loaded samples (CS-K-2%/5%/7%, CS-Ca-2%/5%/7%) were obtained. Pyrolysis experiments were conducted using a self-made fixed-bed device (comprising a gas supply system with high-purity nitrogen cylinders, high-purity oxygen cylinders, and gas flow controllers; a reaction system with a temperature controller and a heating reactor; a liquid collection system with a low-temperature bath and a condenser; and a gas collection system with desiccants and gas collection bags); for each experiment, 3 g of sample was weighed and placed in a quartz tube, purged with N₂ for 10 minutes, then heated to 400°C, 500°C, and 600°C at a heating rate of 20°C/min, held at the target temperature for 20 minutes, and after cooling, the solid residual char was weighed (M1); the mass of the liquid product (M2) was obtained from the mass difference of the condenser, and the gas yield was calculated as 100% minus the solid yield minus the liquid yield. During data processing, the net organic char yield of biochar was calculated using the formula: "(mass of residual char − mass of metal salt) / (mass of raw material − mass of metal salt) × 100%"; the oxygen (O) content on a dry basis was derived by "100% − C − H − N − S − ash content"; the components of bio-oil were qualitatively analyzed via the peak area normalization method using a gas chromatography-mass spectrometry (GC/MS) instrument combined with the NIST spectral library; and the gas components were determined using a gas chromatography (GC) instrument equipped with a thermal conductivity detector (TCD) and a flame ionization detector (FID). Experimental characterization relied on various instruments: elemental analysis was performed using an Elementary Vario EL III Automatic Elemental Analyzer (Elementar, Germany); higher heating value was measured using a ZDHW-300A Microcomputer-Automatic Calorimeter (Keda Instrument Co., Ltd., Hebi City); proximate analysis was conducted in accordance with the GB/T 28731—2012 standard; the content of alkali and alkaline earth metals (AAEMs) was determined using an iCAP 7000 Inductively Coupled Plasma Optical Emission Spectrometer (ICP-OES, Thermo Fisher Scientific, USA); thermogravimetric analysis was carried out using a TG209 F1 Libra Thermogravimetric Analyzer (Netzsch, Germany) with an N₂ flow rate of 40 mL/min and a heating rate of 20°C/min up to 800°C; gas component analysis was performed using a GC9890B Gas Chromatograph (Renhua Chromatography Technology Co., Ltd., Nanjing) equipped with a Porapak Q column and a 13X molecular sieve; and liquid component analysis was conducted using an ISQ7000 Gas Chromatography-Mass Spectrometry (GC/MS) Instrument (Thermo Fisher Scientific, USA) with an HP-5MS capillary column and high-purity helium as the carrier gas. In terms of spatiotemporal information, there is no continuous time-series data in the time dimension, only instantaneous experimental data at three pyrolysis temperatures (400°C, 500°C, and 600°C), and all experimental operations were completed within the same time period to ensure consistent conditions; in the spatial dimension, the raw material was collected from a specific farm in Guannan County, Lianyungang City, Jiangsu Province (single-point sampling, no spatial gradient distribution), all experiments were conducted in the laboratories of the Bamboo Industry Institute and the College of Environment and Resources, Zhejiang A & F University, and the spatial resolution focuses on the laboratory experimental equipment and the raw material sampling point, with no large-scale spatial extension data. The table data includes 5 structured data tables (Table 1 to Table 5): Table 1, titled "Elemental and Proximate Analysis of Corn Stalk Before and After Acid Washing for Ash Removal and Metal Ion Loading," contains 8 records (covering CS-Raw, CS-AW, and 6 metal-loaded samples), with column labels including elemental analysis (C, H, O, N, S, unit: wt%, on a dry and ash-free basis (daf)), proximate analysis (volatiles, fixed carbon, ash, unit: wt%, on a dry basis (db)), and higher heating value (unit: MJ/kg); Table 2, titled "AAEM Contents in Corn Stalk Before and After Acid Washing for Ash Removal," includes 2 records (CS-Raw, CS-AW), with column labels including AAEM contents (K, Na, Ca, Mg, unit: μg/g) and removal rates (unit: %); Table 3, titled "Residual Char Rate of Corn Stalk Pyrolysis Under Different Concentrations of Potassium Ions and Calcium Ions," has 8 records (the same samples as in Table 1), with column labels being total residual char rate (unit: %) and net organic residual char rate (unit: %); Table 4, titled "Effects of Different Concentrations of Potassium Ions and Calcium Ions on the Basic Characteristics of Corn Stalk Pyrolysis Char," contains 8 records (the same samples as in Table 1), with column labels consistent with those of Table 1 (O is marked as O*, indicating a calculated value on a dry basis); Table 5, titled "Effects of Pyrolysis Temperature on the Elemental and Proximate Analysis of Corn Stalk Pyrolysis Biochar," includes 6 records (400-7% K, 500-7% K, 600-7% K, 400-7% Ca, 500-7% Ca, 600-7% Ca), with column labels consistent with those of Table 1. In terms of data integrity, there is no obvious data missing, and all samples designed in the experiment (8 basic samples and 6 temperature-metal combination samples) have undergone testing for key indicators such as elemental analysis, proximate analysis, yield determination, and component analysis; the sources of errors mainly include mass errors caused by balance precision during sample weighing (e.g., weighing of M1 and M2), leakage errors that may be caused by the tightness of gas collection bags during gas collection, relative analysis errors from the GC/MS peak area normalization method, and weight loss rate errors caused by sample uniformity in thermogravimetric analysis; the experiment reduced errors by controlling conditions such as N₂ purging time (≥10 minutes), heating rate stability (20°C/min), metal salt weighing precision, and consistency of the solid-to-liquid ratio for acid washing, and although no specific error range is clearly given, the data meet the precision requirements of conventional laboratory experiments (e.g., mass weighing error ≤ 0.001 g, temperature control error ≤ ±5°C). The types of data files include structured table data (Excel format), experimental graph data (thermogravimetric curves, product yield and component distribution diagrams, with source files in Origin format), device schematic diagrams (PPT format), and original instrument data (e.g., .raw format of GC/MS, which can be opened using Excel); there are no files in niche formats, and all files are compatible with conventional scientific research software.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction The Long Term Development Statements (LTDS) report on a 0-5 year period, describing a forecast of load on the network and envisioned network developments. The LTDS is published at the end of May and November each year. This is Table 2b from our current LTDS report (published 28 November 2025), showing the Transformer information for three winding (1x High Voltage, 2x Low Voltage) transformers associated to each Grid and Primary substation where applicable. More information and full reports are available from the landing page below: Long Term Development Statement and Network Development Plan Landing Page
Methodological Approach
Site Functional Locations (FLOCs) are used to associate the Substation which the transformer is located to Key characteristics of active Grid and Primary sites — UK Power Networks ID field added to identify row number for reference purposes
Quality Control Statement Quality Control Measures include:
Verification steps to match features only with confirmed functional locations. Manual review and correction of data inconsistencies. Use of additional verification steps to ensure accuracy in the methodology.
Assurance Statement The Open Data Team and Network Insights Team worked together to ensure data accuracy and consistency.
Other Download dataset information: Metadata (JSON) Definitions of key terms related to this dataset can be found in the Open Data Portal Glossary: https://ukpowernetworks.opendatasoft.com/pages/glossary/To view this data please register and login.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
analyze the national survey on drug use and health (nsduh) with r the national survey on drug use and health (nsduh) monitors illicit drug, alcohol, and tobacco use with more detail than any other survey out there. if you wanna know the average age at first chewing tobacco dip, the prevalence of needle-sharing, the family structure of households with someone abusing pain relievers, even the health insurance coverage of peyote users, you are in the right place. the substance abuse and mental health services administration (samhsa) contracts with the north carolinians over at research triangle institute to run the survey, but the university of michigan's substance abuse and mental health data archive (samhda) holds the keys to this data castle. nsduh in its current form only goes back about a decade, when samhsa re-designed the methodology and started paying respondents thirty bucks a pop. before that, look for its predecessor - the national household survey on drug abuse (nhsda) - with public use files available back to 1979 (included in these scripts). be sure to read those changes in methodo logy carefully before you start trying to trend smokers' virginia slims brand loyalty back to 1999. although (to my knowledge) only the national health interview survey contains r syntax examples in its documentation, the friendly folks at samhsa have shown promise. since their published data tables were run on a restricted-access data set, i requested that they run the same sudaan analysis code on the public use files to confirm that this new r syntax does what it should. they delivered, i matched, pats on the back all around. if you need a one-off data point, samhda is overflowing with options to analyze the data online. you even might find some restricted statistics that won't appear in the public use files. still, that's no substitute for getting your hands dirty. when you tire of menu-driven online query tools and you're ready to bark with the big data dogs, give these puppies a whirl. the national survey on drug use and health targets the civilian, noninstitutionalized population of the united states aged twelve and older. this new github repository contains three scripts: 1979-2011 - download all microdata.R authenticate the university of michi gan's "i agree with these terms" page download, import, save each available year of data (with documentation) back to 1979 convert ea ch pre-packaged stata do-file (.do) into r, run the damn thing, get NAs where they belong 2010 single-year - analysis examples.R load a single year of data limit the table to the variables needed for an example analysis construct the complex sample survey object run enough example analyses to make a kitchen sink jealous replicate sam hsa puf.R load a single year of data limit the table to the variables needed for an example analysis construct the complex sample survey object print statistics and standard errors matching the target replicati on table click here to view these three scripts for more detail about the national survey on drug use and health, visit: the substance abuse and mental health services administration's nsduh homepage research triangle in stitute's nsduh homepage the university of michigan's nsduh homepage notes: the 'download all microdata' program intentionally breaks unless you complete the clearly-defined, one-step instruction to authenticate that you have read and agree with the download terms. the script will download the entire public use file archive, but only after this step has been completed. if you c ontact me for help without reading those instructions, i reserve the right to tease you mercilessly. also: thanks to the great hadley wickham for figuring out how to authenticate in the first place. confidential to sas, spss, stata, and sudaan users: did you know that you don't have to stop reading just because you've run out of candlewax? maybe it's time to switch to r. :D
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.
The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.
Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.
The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.
Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).
As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).
The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.
Facebook
TwitterSPAN-E Level 2 Electron Full 3D Spectra Data--------------------------------------------File Naming Format: psp_swp_spa_sf0_L2_16Ax8Dx32E_YYYYMMDD_v01.cdfThe SF0 products are the Full 3D Electron spectra from each individual SPAN-E instrument, SPAN-Ae and SPAN-B. Units are in differential energy flux, degrees, and eV. One spectrum comprises decreasing steps in Energy specified by the number in the filename, alternating sweeps in Theta/Deflection, also specified by the number in the filename, and a number of Phi/Anode directions, also specified by the number in the filename. The sample filename above includes 16 Anodes, 8 Deflections, and 32 Energies.This data set covers all periods for which the instrument was turned on and taking data in the solar wind in "Full Sweep", normal cadence survey mode. This includes maneuvers affecting the spacecraft attitude and orientation. Measurements taken by SPAN-B during cruise phase periods when the spacecraft is pointed away from the sun are taken in sunlight.Parker Solar Probe SWEAP Solar Probe Analyzer, SPAN, Electron Data Release Notes--------------------------------------------------------------------------------November 19, 2019 Initial Data Release--------------------------------------Overview of Measurements------------------------The SWEAP team is pleased to release the data from Encounter 1 and Encounter 2. The files contain data from the time range October 31, 2018 - June 18, 2019.The prime mission of Parker Solar Probe is to take data when within 0.25 AU of the Sun during its orbit. However, there has been some extended campaign measurements outside of this distance. The data are available for those days that are within 0.25 AU as well as those days when the instruments were operational outside of 0.25 AU.Each SWEAP data file includes a set of a particular type of measurements over a single observing day. Measurements are provided in Common Data Format (CDF), a self-documenting data framework for which convenient open source tools exist across most scientific computing platforms. Users are strongly encouraged to consult the global metadata in each file, and the metadata that are linked to each variable. The metadata includes comprehensive listings of relevant information, including units, coordinate systems, qualitative descriptions, measurement uncertainties, methodologies, links to further documentation, and so forth.SPAN-E Level 2 Version 01 Release Notes---------------------------------------The SPAN-Ae and SPAN-B instruments together have fields of view covering >90% of the sky; major obstructions to the FOV include the spacecraft heat shield and other intrusions by spacecraft components. Each individual SPAN-E has FOV of ±60° in Theta and 240° in Phi. The rotation matrices to convert into the spacecraft frame can be found in the individual CDF files, or in the instrument paper.This data set covers all periods for which the instrument was turned on and taking data in the solar wind in ion mode. This includes maneuvers affecting the spacecraft attitude and orientation. Measurements taken by SPAN-B when the spacecraft is pointed away from the sun are taken in sunlight.The data quality flags for the SPAN data can be found in the CDF files as: QUALITY_FLAG (0=good, 1=bad)General Remarks for Version 01 Data-----------------------------------Users interested in field-aligned electrons should take care regarding potential blockages from the heat shield when B is near radial, especially in SPAN-Ae. Artificial reductions in strahl width can result.Due to the relatively high electron temperature in the inner heliosphere, many secondary electrons are generated from spacecraft and instrument surfaces. As a result, electron measurements in this release below 30 eV are not advised for scientific analysis.The fields of view in SPAN-Ae and SPAN-B have many intrusions by the spacecraft, and erroneous pixels discovered in analysis, in particular near the edges of the FOV, should be viewed with skepticism. Details on FOV intrusion are found in the instrument paper, forthcoming, or by contacting the SPAN-E instrument scientist.The instrument mechanical attentuators are engaged during the eight days around perihelia 1 and perihelia 2, which results in a factor of about 10 reduction of the total electron flux into the instrument. During these eight days, halo electron measurements are artificially enhanced in the L2 products as a result of the reduced instrument geometric factor and subsequent ground corrections.A general note for Encounter 1 and Encounter 2 data: a miscalculation in the deflection tables loaded to both SPAN-Ae and SPAN-B resulted in over-deflection of the outermost Theta angles during these encounters. As such, pixels at large Thetas should be ignored. This error was corrected by a table upload prior to Encounter 3.Lastly, when viewing time gaps in the SPAN-E measurements, be advised that the first data point produced by the instrument after a power-on is the maximum value permitted by internal instrument counters. Therefore, the first data point after powerup is erroneous and should be discarded, as indicated by quality flags.SPAN-E Encounter 1 Remarks--------------------------SPAN-E operated nominally for the majority of the first encounter. Exceptions to this include: a few instances of corrupted, higher-energy sweep tables, and an instrument commanding error for the two hours surrounding perihelion 1. These and other instrument diagnostic tests are indicated with the QUALITY_FLAG variable in the CDFs.The mechanical attentuator was engaged for the 8 days around perihelion 1: as a result the microchannel plate, MCP, noise due to thermal effects and cosmic rays are artificially enhanced and are particularly obvious at higher energies. Exercise caution with this data release if looking for halo electrons when the mechanical attenuator is engaged.SPAN-E Cruise Phase Remarks---------------------------The cruise mode rates of SPAN-E are greatly reduced compared to the encounter mode rates. When the PSP spacecraft is in a communications slew, the SPAN-B instrument occasionally reaches its maximum allowable operating temperature and is powered off by SWEM.Timing for the SF1 products in cruise phase is not corrected in v01, and thus it is not advised to use the data at this time for scientific analysis. The typical return of SF0 products is one spectrum out of every 32 survey spectra is returned every 15 minutes or so. One out of every four 27.75 s SF1 spectra is produced every 111 s.SPAN-E Encounter 2 Remarks--------------------------SPAN-E operated nominally for the majority of the second encounter. Exceptions include instrument diagnostic and health checks and a few instances of corrupted high-energy sweep tables. These tests and corrupted table loads are indicated with the QUALITY_FLAG parameter.The mechanical attentuator was engaged for the 8 days around perihelion 2: as a result the MCP noise due to thermal effects and cosmic rays are artificially enhanced and are particularly obvious at higher energies. Exercise caution in this data release if looking for halo electrons when the mechanical attenuator is engaged.Parker Solar Probe SWEAP Rules of the Road------------------------------------------As part of the development of collaboration with the broader Heliophysics community, the mission has drafted a "Rules of the Road" to govern how PSP instrument data are to be used. 1) Users should consult with the PI to discuss the appropriate use of instrument data or model results and to ensure that the users are accessing the most recently available versions of the data and of the analysis routines. Instrument team Science Operations Centers, SOCs, and/or Virtual Observatories, VOs, should facilitate this process serving as the contact point between PI and users in most cases. 2) Users should heed the caveats of investigators to the interpretations and limitations of data or model results. Investigators supplying data or models may insist that such caveats be published. Data and model version numbers should also be specified. 3) Browse products, Quicklook, and Planning data are not intended for science analysis or publication and should not be used for those purposes without consent of the PI. 4) Users should acknowledge the sources of data used in all publications, presentations, and reports: "We acknowledge the NASA Parker Solar Probe Mission and the SWEAP team led by J. Kasper for use of data.".* 5) Users are encouraged to provide the PI a copy of each manuscript that uses the PI data prior to submission of that manuscript for consideration of publication. On publication, the citation should be transmitted to the PI and any other providers of data.
Facebook
TwitterThese are tables used to process the loads of gulf shrimp data. It contains pre-validation tables, error tables and information about statistics on data loads. It contains no data tables and no code tables. This information need not be published data set contains catch (landed catch) and effort for fishing trips made by the larger vessels that fish near and offshore for the various species of shrimp in the Gulf of Mexico. The data set also contains landings by the smaller boats that fish in the bays, lakes, bayous, and rivers for saltwater shrimp species; however, these landings data may be aggregated for multiple trip and may not provide effort data similar to the data for the larger vessels. The landings statistics in this data set consist of the quantity and value for the individual species of shrimp by size category type and quantity of gear, fishing duration and fishing area The data collection procedures for the catch/effort data for the large vessels consist of two parts. The landings statistics are collected from the seafood dealers after the trips are unloaded; whereas, the data on fishing effort and area are collected by interviews with the captain or crew while the trip is being unloaded.