Problem:
The steps that I take:
create table using the following HQL commands:
CREATE TABLE 10projects(......)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
The table is created and raw data is loaded into the table, but the data reads like:
select * from 10projects:
"""
e565fb42185c6e9f22806ad9d5ac8a
77"""
"""
2e17c8c91cb58132d8103a9aa8797e
80"""
"""
45e7ddbdd7023f1eb65a6cc028d741
4f"""
360009001332
40.841691
-73.875457
Bronx
NY
10460
urban
New York City Dept Of Ed
Bronx
f
f
f
f
f
f
Mr.
f
f
Literacy
Literacy & Language
Books
highest poverty
Grades 9-12
NULL
NULL
NULL
NULL
280.02
341.49
0
308.0
1
f
f
completed
2007-03-08
2007-03-08
2007-03-08
2003-12-31
The double quotes, such as """ e565fb42185c6e9f22806ad9d5ac8a 77""", is not expected to appear.
I expected results to be:
e565fb42185c6e9f22806ad9d5ac8a
77
2e17c8c91cb58132d8103a9aa8797e
80
45e7ddbdd7023f1eb65a6cc028d741
4f
360009001332
40.841691
-73.875457
Bronx
NY
10460
urban
New York City Dept Of Ed
Bronx
f
f
f
f
f
f
Mr.
f
f
Literacy
Literacy & Language
Books
highest poverty
Grades 9-12
NULL
NULL
NULL
NULL
280.02
341.49
0
308.0
1
f
f
completed
2007-03-08
2007-03-08
2007-03-08
2003-12-31
SOLUTION:
1. Using CSV SerDe:
Step 1: download JAR HERE
Step 2: Add jar path to HIVE class path when lunching HIVE.
$ hive --auxpath /path/to/hive-examples.jar
Step 3: Create table using the serde as row format.
CREATE TABLE 10projects2 (projectid STRING, teacher_acctid STRING)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
stored as textfile;
Results: '''' double quotes from triple quotes """ are eliminated, but " one quote still remain.
SOURCE: Hadoop Guide pp.451.
2. Using sed -i linux streaming editor.
SOURCE: Linux sed
The s
Command:
Syntax: s / regexp / replacement /flags
The s command can be followed by zero or more of the following flags:
g
Apply the replacement to all matches to the regexp, not just the first.
g
Apply the replacement to all matches to the regexp, not just the first.
For this case:
sed -i 's/"""//g' opendata_essays.log