MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
RegMix Data
Dataset Description
The RegMix Data is a curated dataset derived from the Pile-Uncopyrighted, specifically designed for the RegMix paper (https://huggingface.co/papers/2407.01492). This dataset aims to facilitate the automatic identification of high-performing data mixtures for language model pre-training by formulating it as a regression task.
Key Features:
Size: Approximately 1TB disk space, 250B tokens Distribution: Follows the natural token… See the full description on the dataset page: https://huggingface.co/datasets/sail/regmix-data.
Strategic Analytics for Improvement and Learning Value Model or SAIL, is a system for summarizing hospital system performance within Veterans Health Administration (VHA). SAIL assesses key Quality measures in areas such as death rate, complications, and patient satisfaction, as well as overall efficiency at individual VA Medical Centers (VAMCs).
Attendance information for all hospital outpatient appointments.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Space sails are a continuum of lightweight, thin, large-area, deployable technologies which are pushing forward new frontiers in space mobility and exploration. They encompass solar sails, laser-driven sails, drag sails, magnetic sails, electric sails, deployable membrane reflectors, deployable membrane antennas, and solar power sails. This database contains values of important parameters from 220 different space sails, which have either flown in space or been proposed as mission concepts. The parameters are: the deployed sail area, the spacecraft's total mass, the total sail loading, the characteristic acceleration, the characteristic thrust, the sail's stowed volume, the sail packing efficiency, and the sail thickness. Assumptions and definitions used for each parameter are provided, along with links to the data sources.
The Child Health System in Wales; includes birth registration and monitoring of child health examinations and immunisations.
The Child Health System in Wales; includes birth registration and monitoring of child health examinations and immunisations.
The dataset brings together data from local Child Health System databases which are held by NHS Trusts and used by them to administer child immunisation and health surveillance programmes.
The dataset contains all children born, resident or treated in Wales and born after 1987.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Crude Steel: Production: Public Sector: Steel Authority of India Limited (SAIL) data was reported at 15,022.000 Metric Ton th in 2018. This records an increase from the previous number of 14,494.000 Metric Ton th for 2017. Crude Steel: Production: Public Sector: Steel Authority of India Limited (SAIL) data is updated yearly, averaging 13,507.500 Metric Ton th from Mar 2003 (Median) to 2018, with 16 observations. The data reached an all-time high of 15,022.000 Metric Ton th in 2018 and a record low of 11,628.000 Metric Ton th in 2003. Crude Steel: Production: Public Sector: Steel Authority of India Limited (SAIL) data remains active status in CEIC and is reported by Joint Plant Committee. The data is categorized under India Premium Database’s Metal and Steel Sector – Table IN.WAA003: Crude Steel: Production (Annual).
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
In August 2021, the 107-year-old 98-meter-long tall ship Statsraad Lehmkuhl departed Norway to return in April 2023, having sailed 55,000 nautical miles and visited 36 ports worldwide. The main goal is to create attention and share knowledge about the crucial role of the ocean for a sustainable development in a global perspective. This dataset contains various marine observations collected in the Atlantic Ocean. This dataset is U.S. State Department MSR U2021-017 as part of the World Data Services for Oceanography. CTD is in TXT and RSK (RBR CTD) formats, navigation is in CSV, PCO2 data are in TXT format, wave radar data are in Python Pickle File (PKL) format, weather station and Ferrybox (through-flow system) data are in JSON format, echosounder data are in Simrad EK80 (.raw) format, hydrophone sound data are in uncompressed wave format (.wav). The latter two are compressed by gzip.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Project SAIL aimed to improve the scientific understanding of the marine boundary layer by means of a unique monitoring campaign on board the iconic Portuguese tall ship NRP Sagres during its 2020 circumnavigation expedition. This dataset comprises the pre-processed atmospheric measurements from the SAIL campaign. It is derived from the raw measurements (https://doi.org/10.25747/b2ff-kg31) by applying preliminary quality-control and pre-processing procedures. The jupyter notebooks documenting the pre-processing of the data are publicly available on Zenodo's Project SAIL community (https://zenodo.org/communities/sail). Detailed information on the pre-processing procedures can be found in the project's data management plan (DOI: https://doi.org/10.5281/zenodo.4286209). This dataset currently includes the resources detailed below. Additional resources will be added as they become available. The ReadMe file provides detailed information about the resources structures.
This database contains residential and geographical information data about care homes in Wales.
This dataset contains near-surface measurements of oceanographic, meteorological and physical data collected in situ during a survey of the eastern Bering Sea shelf conducted by three autonomous surface vehicles (USVs, Saildrones (SD) 1043, 1046, and 1049). The saildrones were used to conduct an acoustic survey of walleye pollock (Gadus chalcogrammus) in the US economic exclusive zone in summer 2020. This survey is traditionally conducted with crewed research vessels, but was conducted with USVs asin response to the cancellation of the ship-based surveys due to safety concerns associated with COVID-19 pandemic. The USV survey was conducted on 14 transects spaced 74 km apart spanning the ~80 m to ~1000m depth contour, with SD 1046 sampling in the south, SD 1046 in the center, and SD 1049 in the north portion of the survey area. All available data are included, which encompass the survey and a portion of the transit to the survey area. The saildrones were equipped with a variety of sensors and instruments consisting of thermosalinograph, echo sounder, oxygen optode, fluorometer, SST IR pyrometer, anemometer, meteorological probe, digital and barometer. The oceanographic measurements include skin temperature, salinity, water temperature, water skin temperature, chlorophyll-a, and dissolved oxygen. The atmospheric measurements consist of wind speed and direction, air temperature, relative humidity and air pressure. All the data are in netCDF-CF (underway) format. These data are experimental and have not been quality controlled. These data are made available at the user’s own risk. Users will need to do quality control when using these data. The data from the echosounder will be separately archived at NCEI’s water column sonar data archive.
The Child Health System in Wales; includes birth registration and monitoring of child health examinations and immunisations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Project SAIL aimed to improve the scientific understanding of the marine boundary layer by means of a unique monitoring campaign on board the iconic Portuguese tall ship NRP Sagres during its 2020 circumnavigation expedition. The campaign focused on the measurement of the atmospheric electric field over the ocean, and on the study of space-driven interactions via the detailed monitoring of gamma, solar and cosmic radiation as well as GNSS signals and atmospheric ionization. The atmospheric measurements are complemented by the collection of fish samples and by underwater monitoring of the ocean state (temperature, conductivity, dissolved oxygen, pH, spectral radiance), providing unique data for the detailed study of ocean-atmosphere fluxes and surface-atmosphere interactions. This dataset comprises the raw atmospheric measurements from the SAIL campaign, including the ship data collected onboard (denoted by the infix SHIP), the sensor data, obtained after correction of logging errors (denoted by the infix SD) and the geosensor data corresponding to georeferenced datafiles (denoted by the infix GD). Further information can be found in the project's data management plan (DOI: https://doi.org/10.5281/zenodo.4286209). This dataset currently includes the resources detailed below. Additional resources will be added as they become available. ATMOSPHERIC ELECTRIC FIELD The resource SAIL_SHIP_E1.tar.gz contains the files SAIL_SHIP_E1_yyyymmdd.tgz, each including the hourly files E1_yyyymmdd_HH.txt (where yyyy is the year, mm the month, dd the day, and HH the hour). The files E1_yyyymmdd_HH.txt have the following structure: col 1: timestamp (seconds.microseconds) col 2: date (mm/dd/yyyy) col 3: time (HH:MM:SS) col 4: voltage (power) (V) col 5: voltage (internal) (V) col 6: Panel temperature (deg C) col 7: Electric field (V/m) col 8: Leakage current (nA) col 9: CS110 status (numeric code) col 10: Internal RH (%) col 11: shortwave incoming radiation (W/m2) col 12: shortwave outgoing radiation (W/m2) The resource SAIL_SHIP_E2.tar.gz contains the files SAIL_SHIP_E2_yyyymmdd.tgz, each including the hourly files E2_yyyymmdd_HH.txt (where yyyy is the year, mm the month, dd the day, and HH the hour). The files E2_yyyymmdd_HH.txt have the following structure: col 1: timestamp (seconds.microseconds) col 2: date (mm/dd/yyyy) col 3: time (HH:MM:SS) col 4: voltage (power) (V) col 5: voltage (internal) (V) col 6: Panel temperature (deg C) col 7: Electric field (V/m) col 8: Leakage current (nA) col 9: CS110 status (numeric code) col 10: Internal RH (%) The resource SAIL_SD_E1.tar.gz contains the files SAIL_SD_E1_yyyymmdd.tgz, each including the hourly files E1_yyyymmdd_HH.txt (where yyyy is the year, mm the month, dd the day, and HH the hour), with the following structure: col 1: timestamp (seconds.microseconds) col 2: Electric field (V/m) col 3: Leakage current (nA) col 4: CS110 status (numeric code) col 5: Internal RH (%) The resource SAIL_SD_E2.tar.gz contains the files SAIL_SD_E2_yyyymmdd.tgz, each including the hourly files E2_yyyymmdd_HH.txt (where yyyy is the year, mm the month, dd the day, and HH the hour), with the following structure: col 1: timestamp (seconds.microseconds) col 2: Electric field (V/m) col 3: Leakage current (nA) col 4: CS110 status (numeric code) col 5: Internal RH (%) The resource SAIL_GD_E1.tar.gz contains the files SAIL_GD_E1_yyyymmdd.tgz, each including the hourly files E1_yyyymmdd_HH.txt (where yyyy is the year, mm the month, dd the day, and HH the hour), with the following structure: col 1: timestamp (seconds.microseconds) col 2: Electric field (V/m) col 3: Leakage current (nA)
This dataset provides information about the number of properties, residents, and average property values for The Sail cross streets in East Islip, NY.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Crude Steel: Installed Capacity: Public Sector: Steel Authority of India Limited (SAIL) data was reported at 17,519.000 Metric Ton th in 2018. This stayed constant from the previous number of 17,519.000 Metric Ton th for 2017. Crude Steel: Installed Capacity: Public Sector: Steel Authority of India Limited (SAIL) data is updated yearly, averaging 12,859.000 Metric Ton th from Mar 2003 (Median) to 2018, with 16 observations. The data reached an all-time high of 17,519.000 Metric Ton th in 2018 and a record low of 12,696.000 Metric Ton th in 2004. Crude Steel: Installed Capacity: Public Sector: Steel Authority of India Limited (SAIL) data remains active status in CEIC and is reported by Joint Plant Committee. The data is categorized under India Premium Database’s Metal and Steel Sector – Table IN.WAA005: Crude Steel: Production: Capacity.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
1922 Global exporters importers export import shipment records of Sail bearing with prices, volume & current Buyer's suppliers relationships based on actual Global export trade database.
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Attendance and clinical information for all general practice interactions: includes patients symptoms, investigations, diagnoses, prescribed medication and referrals to tertiary care.
This dataset covers 83% of the population of Wales and 80% of GP practices in Wales. It is linkable with anonymised fields for individuals and GPs to other datasets, including bespoke project specific cohorts. Each GP practice uses a clinical information system to maintain an electronic health record for each of their patients; capturing the signs, symptoms, test results, diagnoses, prescribed treatment, referrals for specialist treatment and social aspects relating to the patients home environment.
The majority of the data is entered by the clinician during the patient consultation. Test results are electronically transferred from secondary care systems.
There are no standard rules for recording data within primary care clinical information systems. Therefore, each individual clinician can record information in their own way. The majority use Read Code Terminology, however, sometimes this is applied behind the scenes by the clinical system and sometimes local codes are used. Read codes are not as precise as ICD 10 or OPCS codes.
Coding standards have been agreed on for conditions monitored by the QOF (Quality Outcomes Framework) returns. Since the implementation of QOF these conditions have been coded in a more consistent way.
Time coverage varies between each practice.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
United States Imports: cif: Indian Mackrls, Marlins, Sail, Spearfish, etc, Fr, Ch data was reported at 0.037 USD mn in Jan 2025. This records a decrease from the previous number of 0.093 USD mn for Dec 2024. United States Imports: cif: Indian Mackrls, Marlins, Sail, Spearfish, etc, Fr, Ch data is updated monthly, averaging 0.039 USD mn from Apr 2017 (Median) to Jan 2025, with 94 observations. The data reached an all-time high of 0.541 USD mn in Dec 2018 and a record low of 0.003 USD mn in Sep 2023. United States Imports: cif: Indian Mackrls, Marlins, Sail, Spearfish, etc, Fr, Ch data remains active status in CEIC and is reported by U.S. Census Bureau. The data is categorized under Global Database’s United States – Table US.JA130: Imports: by Commodity: 6 Digit HS Code: HS 1 to 15.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
1882 Global export shipment records of Sail Fish with prices, volume & current Buyer's suppliers relationships based on actual Global export trade database.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
RegMix Data
Dataset Description
The RegMix Data is a curated dataset derived from the Pile-Uncopyrighted, specifically designed for the RegMix paper (https://huggingface.co/papers/2407.01492). This dataset aims to facilitate the automatic identification of high-performing data mixtures for language model pre-training by formulating it as a regression task.
Key Features:
Size: Approximately 1TB disk space, 250B tokens Distribution: Follows the natural token… See the full description on the dataset page: https://huggingface.co/datasets/sail/regmix-data.