ERDDAP数据集加载

最新推荐文章于 2023-09-20 12:08:34 发布

百老

最新推荐文章于 2023-09-20 12:08:34 发布

阅读量815

点赞数

分类专栏：海洋相关

本文链接：https://blog.csdn.net/u010763324/article/details/81906197

版权

海洋相关专栏收录该内容

8 篇文章 0 订阅

订阅专栏

GenerateDatasetsXml is a command line program that can generate a rough draft of the dataset XML for almost any type of dataset.

We STRONGLY RECOMMEND that you use GenerateDatasetsXml instead of creating chunks of datasets.xml by hand because:

GenerateDatasetsXml works in seconds. Doing this by hand is at least an hour's work, even when you know what you're doing.
GenerateDatasetsXml does a better job. Doing this by hand requires extensive knowledge of how ERDDAP works. It is unlikely that you will do a better job by hand. (Bob Simons always uses GenerateDatasetsXml for the first draft, and he wrote ERDDAP.)
GenerateDatasetsXml always generates a valid chunk of datasets.xml. Any chunk of datasets.xml that you write will probably have at least a few errors that prevent ERDDAP from loading the dataset. It often takes people hours to diagnose these problems. Don't waste your time. Let GenerateDatasetsXml do the hard work. Then you can refine the .xml by hand if you want.

命令行加载数据被源代码作者bob.simons强烈推荐，

GenerateDatasetsXml first asks you to specify the EDDType (Erd Dap Dataset Type) of the dataset. See the List of Dataset Types (in this document) to figure out which is type appropriate for the dataset you are working on.

So在命令行之前，你得熟悉支持的格式和参数，我测试的nc数据，bob给的示例是Ascii数据

Let's say you run the script: ./GenerateDatasetsXml.sh
Then enter: EDDTableFromNcFiles
Then enter: /u00/data/
Then enter: .*\.nc
Then enter: /u00/data/sampleFile.nc
Then enter: 10

DISCLAIMER: The chunk of datasets.xml made by GenerateDatasetsXml isn't perfect. YOU MUST READ AND EDIT THE XML BEFORE USING IT IN A PUBLIC ERDDAP.

命令行输出的成果不一定正确，故要修改

Diagnostic information and the rough draft of the dataset XML will be written tobigParentDirectory/logs/GenerateDatasetsXml.log .

The rough draft of the dataset XML will be written tobigParentDirectory/logs/GenerateDatasetsXml.out .

数据会先保存在GenerateDatasetsXml.out文件中

如有问题参照以下

Other Ways To Diagnose Problems With Datasets
In addition to the two main Tools,

log.txt is a log file with all of ERDDAP's diagnostic messages.
The Daily Report has more information than the status page, including a list of datasets that didn't load and the exceptions (errors) they generated.
The Status Page is a quick way to check ERDDAP's status from any web browser. It includes a list of datasets that didn't load (although not the related exceptions) and taskThread statistics (showing the progress of EDDGridCopy and EDDTableCopydatasets).
If you get stuck, please send an email with the details to bob dot simons at noaa dot gov.
Or, you can join the ERDDAP Google Group / Mailing List and post your question there.

log.txt会有错误提示，Daily Report会发送到邮箱，页面status也可以参考

Limits to the Size of a Dataset
You'll see many references to "2 billion" below. More accurately, that is a reference to 2,147,483,647 (2^31-1), which is the maximum value of a 32-bit signed integer. In some computer languages, for example Java (which ERDDAP is written in), that is the largest data type that can be used for many data structures (for example, the size of an array).

For String values (for example, for variable names, attribute names, String attribute values, and String data values), the maximum number of characters per String in ERDDAP is ~2 billion. But in almost all cases, there will be small or large problems if a String exceeds a reasonable size (e.g., 80 characters for variable names and attribute names, and 255 characters for most String attribute values and data values). For example, web pages which display long variable names will be awkwardly wide and long variable names will be truncated if they exceed the limit of the response file type.

For gridded datasets:

The maximum number of axisVariables is ~2 billion.
The maximum number of dataVariables is ~2 billion.
But if a dataset has >100 variables, it will be cumbersome for users to use.
And if a dataset has >1 million variables, your server will need a lot of physical memory and there will be other problems.
The maximum size of each dimension (axisVariable) is ~2 billion values.
I think the maxumum total number of cells (the product of all dimension sizes) is unlimited, but it may be ~9e18.

For tabular datasets:

The maximum number of dataVariables is ~2 billion.
But if a dataset has >100 variables, it will be cumbersome for users to use.
And if a dataset has >1 million variables, your server will need a lot of physical memory and there will be other problems.
The maximum number of sources (for example, files) that can be aggregated is ~2 billion.
In a some cases, the maximum number of rows from an individual source (for example, a file, but not a database) is ~2 billion rows.
I don't think there are other limits.

数据结构的多少有限制，20亿条

当然单个数据集的大小也有限制

For both gridded and tabular datasets, there are some internal limits on the size of the subset that can be requested by a user in a single request (often related to >2 billion of something or ~9e18 of something), but it is far more likely that a user will hit the file-type-specific limits.

NetCDF version 3 .nc files are limited to 2GB bytes. (If this is really a problem for someone, let me know: I could add support for the NetCDF version 3 .nc 64-bit extension or NetCDF Version 4, which would increase the limit significantly, but not infinitely.)
Browsers crash after only ~500MB of data, so ERDDAP limits the response to .htmlTable requests to ~400MB of data.
Many data analysis programs have similar limits (for example, the maximum size of a dimension is often ~2 billion values), so there is no reason to work hard to get around the file-type-specific limits.
The file-type-specific limits are useful in that they prevent naive requests for truly huge amounts of data (for example, "give me all of this dataset" when the dataset has 20TB of data), which would take weeks or months to download. The longer the download, the more likely it will fail for a variety of reasons.
The file-type-specific limits are useful in that they force the user to deal with reasonably-sized subsets (for example, dealing with a large gridded dataset via files with data from one time point each).

nc文件大小限制2G

另外，dataset ID 是每个数据集必须的、独一的识别号，自动生成时，都一样，需要修改

datasetID="aDatasetID" is a REQUIRED attribute within a <dataset> tag which assigns a short (usually <15 characters), unique, identifying name to a dataset.

Valid characters are A-Z, a-z, 0-9, _, and -, but we strongly recommend starting with a letter and then just using A-Z, a-z, 0-9, and _.
DatasetID's are case sensitive, but DON'T create two datasetID's that only differ in upper/lower case letters. It will cause problems on Windows computers (yours and/or a user's computer).
Best practices: We recommend using camelCase.
Best practices: We recommend that the first part be an acronym or abbreviation of the source institution's name and the second part be an acronym or abbreviation of the dataset's name. When possible, we create a name which reflect's the source's name for the dataset. For example, we used datasetID="erdPHssta8day" for a dataset from the NOAA NMFS SWFSC Environmental Research Division (ERD) which is designated by the source to be satellite/PH/ssta/8day.
If you want to change a dataset's name, you need to do something to kill off (err, retire) the dataset with the existing name. The two solutions are:
- Shutdown Tomcat/ERDDAP. Change the name. Restart ERDDAP.
- Change the name of the dataset and make a dummy, active="false" dataset to kill off (err, retire) the old dataset:
  <dataset type="EDDTableFromNcFiles" datasetID="theOldName" active="false" />
  You can remove that tag after the old dataset is inactive.