Teradata Notes

These are my personal notes that I use as a quick help in my work.
You are welcome to read them.

Index	Java	Internet	Oracle Notes
Linux	Basics	Web	Basics	SQL Notes
Informatica	Servlets	Apache	BkpRstore	SQL*Plus
Teradata		LDAP	Storage	PL/SQL
Windows			Tables	OEM
UML	Lang.		Net8	Portal
SQL Server	Python	perl	Performance	OLAP
Vmware	Visual Basic	PHP/MySQL	User Mgmt
Git
More technical pages here

Introduction and Basics
Admin
Documentation
Teradata Studio UI
SQL Editor
Databases
Connections
Tables
Table Types
Columns
Indexes
Macros
Stored procedures
OLAP functions
Select

Python and Teradataml

Basic Start
Connections
DataFrames
DataFrame Methods
Other Misc Notes
Additional topics

Admin

See if database is running:
pdestate -a

start database:
/etc/init.d/tpa start

select * from dbc.databases; – I notice that we have schemas, accounts, and owners
select * from dbc.tables where tablename like '...';
select * from dbc.users;

HELP SESSION: to get some information on the current session

Documentation

https://www.teradata.com/University/Overview

Teradata Studio UI

fonts, etc: menu > window > preferences > general > appearance
There is a reset button too

The DBC user/schema has many useful views.

Some "schemas" – things listed in the explorer – are users, and some are databases.

To see only the databases, and not the users: menu > window > preferences > teradata datatools > data source explorer load.
The icon distinguishes between users and databases, or you can do right-click and see the DDL

The pattern I see is:

all databases are created "from DBC", and the account is DBC.
users are all created "from DBC", and default users or "all" have account DBC, but others have their own account.
users have a pw, databases do not
users have a default timezone, and other default values

Result set viewer: wrench on the resultset bar, or menu > window > preferences > terdata datatools > result set viewer
alternate rows in color: enable
max display row count: 500

On history, useful columns are

SQL Statement
Row count
result
timestamp
elapsed time

note: "notes" allows adding a note to the history

Enable suggestions: menu > SQL > code assist autoactivation

projects:
stored in the workspace
See where the projects are stored: right-click on the project in the explorer

Ctrl+/ to add or remove commenting

SQL Editor

DATABASE db-nm – sets the current database to db-nm
SELECT DATABASE – shows current database

ctrl+space: templates
Add templates to preferences > data management > sql development > sql editor > templates

Query pause button: red button in upper right

Databases

CREATE DATABASE dbname 
    AS 
    PERM=0;`  -- minimum

CREATE DATABASE dbname 
    FROM parentdb 
    AS    -- generally parentdb is DBC
    PERM = 6400000000
    SPOOL = 39662995374
    TEMPORARY = 39662995374  -- not mandatory
    FALLBACK                 -- not mandatory
    ACCOUNT= 'accountname'   -- not mandatory
    NO AFTER JOURNAL         -- not mandatory
    NO BEFORE JOURNAL;       -- not mandatory

(It looks like FALLBACK if there is PERM>0, otherwise NO FALLBACK)

Permanent space is space available for permanent objects
Spool space is the part of permanent space available for intermediate results
Temp space is the unused part of the permanent space, used for Global Temporary tables

Connections

TMODE=TERA TERA is case INsensitive on comparing strings
TMODE=ANSI ANSI is case sensitive on comparing strings

Tables

SHOW TABLE tbl-nm; Show the DDL statement (error if a view)
SHOW VIEW vw-nm; Show the DDL statement
HELP TABLE vw-nm; Return list of columns as result set, for table or view. This includes comments.
HELP STATS tbl-nm; Return statistics for table or view

Copy a table:

CREATE TABLE db.tablename   -- use "..." to surround names if necessary
    AS db.sourcetablename 
    WITH DATA;
CREATE TABLE db.empty_table  -- creates an empty table
    AS db.sourcetablename 
    WITH NO DATA;            -- notice "NO"

Create based on a query

CREATE TABLE tablename 
    AS (SELECT ...) 
    WITH [NO] DATA;

Minimal table creation:

CREATE TABLE db.tb_nm                   -- best practice: put database name
    ( a VARCHAR(10) TITLE 'the name' NOT NULL
    , n INTEGER TITLE 'num' FORMAT 'zzz9'
    , ind CHAR(1) TITLE 'Important'
    )
    UNIQUE PRIMARY INDEX( a );

Additional configurations:

CREATE [SET|MULTISET] TABLE db.tb_nm    -- multiset accepts duplicates
    , [NO] FALLBACK                     -- a copy is in another AMP
    , NO BEFORE JOURNAL
    , NO AFTER JOURNAL
    , CHECKSUM = [ALL|NONE|DEFAULT]
    , DATABLOCKSIZE = ...               -- small for OLTP, large for OLAP, usually specified by DBA
    , MERGEBLOCKRATIO = 50 PERCENT      -- less used
    , FREESPACE = 50 PERCENT            -- less used
    , MAP = TD_MAP1            -- less used, see maps with   select MapName from dbc.maps;
    ( ....
    )
    UNIQUE PRIMARY INDEX( a );

With journal:

CREATE [SET|MULTISET] TABLE db.tb_nm    -- multiset accepts duplicates
    , BEFORE JOURNAL
    , AFTER JOURNAL
    , WITH JOURNAL = tb_nm_2

SHOW TABLE tb_nm; – show the DDL

DROP TABLE tb_nm;

SELECT databasename, tablename, columnname FROM dbc.columns WHERE UPPER(columnname) LIKE UPPER('%event%');
SELECT databasename, tablename FROM dbc.tables WHERE UPPER(tablename) LIKE UPPER('%event%');
SELECT databasename FROM dbc.databases;
SELECT * FROM dbc.databases;

Table Types

Derived table: basically a named table in a query. Lives during a query

Volatile table:
lives in a user session (not in data dictionary)
CREATE [SET|MULTISET] VOLATILE TABLE tb_nm AS (SELECT ....) WITH DATA;
By default, the rows are dropped on commit. To keep:
CREATE [SET|MULTISET] VOLATILE TABLE tb_nm AS (SELECT ....) [WITH DATA] ON COMMIT PRESERVE ROWS;
In spool space

Global temporary table:
lives during a user session, and visible to other sessions (in data dictionary, other users can see)
CREATE [SET|MULTISET] GLOBAL TEMPORARY TABLE tb_nm AS (SELECT ....) WITH DATA;
By default, the rows are dropped on commit. To keep:
CREATE [SET|MULTISET] GLOBAL TEMPORARY TABLE tb_nm AS (SELECT ....) [WITH DATA] ON COMMIT PRESERVE ROWS;

SET tables do NOT contain duplicate records
MULTISET tables contain duplicate records

Columns

ALTER TABLE tb_nm ADD col_nm data_type; 
ALTER TABLE tb_nm RENAME col_nm TO col_nm2;
ALTER TABLE tb_nm DROP col_nm;

BYTEINT (-128 to 127)
SMALLINT (2 times bigger)
INT, INTEGER (again twice as big, about -2*10**9 to 2*10**9)
BIGINT (8 bytes long)
DECIMAL(m,n) (1 to 16 bytes), m = total # digits, n = precision
Does rounding for precision, but fails when too many digits before decimal
NUMBER(m,n) (1 to 16 bytes), m = total # digits, n = precision. Idem about number of digits.
FLOAT,REAL (8 bytes, IEEE format, from 10**-308 to 10**308)
CHAR (fixed)
VARCHAR (n): variable, with n up to 64K characters
LONG VARCHAR: more than 64K char
DATE YYYYMMDD 4 bytes
TIME 6 or 8 bytes long
TIME WITH ZONE
TIMESTAMP date and time together
TIMESTAMP WITH ZONE
Also large objects, geospatial, XML

Dates stored as integers:
((YEAR - 1900) * 10000) + (MONTH * 100) + DAY
Now:     CURRENT_DATE (just date)
    CURRENT_TIME (just time)
    CURRENT_TIMESTAMP (date and time down to sub-sec)
CAST('28/04/2022' as TIMESTAMP FORMAT 'DD/MM/YYYY')

– Additional column attributes:
NOT NULL
DEFAULT val
WITH DEFAULT -- 0 for numeric, space for char
FORMAT 'YYYY/MM/DD' -- remember: single quotes
UPPERCASE
CASESPECIFIC -- force case to matter for comparisons
PRIMARY KEY

COMPRESS -- compresses nulls
COMPRESS 'abc' -- compresses nulls and ‘abc’
COMPRESS ('abc', 'def') -- compresses nulls, ‘abc’ and ‘def’
COMPRESS USING user-defined-function
DECOMPRESS USING user-defined-function

Multi-Value Compression (MVC)
Take up to 255 distinct values out of the stored rows, and store in the table header in each AMP.
Cannot do this on a primary index column

Indexes

Unique primary index and non unique primary index (duplicate values allowed)
The primary index determines which AMP executes the query (data distribution) via a hashing algorithm
UNIQUE PRIMARY INDEX (col[s]) -- UPI
PRIMARY INDEX (col[s]) -- NUPI
To change the primary index, you have to drop and re-create the table

Secondary indexes are non-primary indexes. They can be unique or not
They are not involved in data distribution
UNIQUE INDEX (col[s]) -- USI
INDEX (col[s]) -- NUSI

Macros

Users need only EXEC privilege, not privileges on the individual objects

Single transaction.

REPLACE MACRO abc(p1 INTEGER)    -- for VARCHAR, you have to give a length
AS 
(
  SELECT ... WHERE col = :p1  ;   -- ";" after each statement
);
EXEC abc(123);
SHOW MACRO abc;
DROP MACRO abc;

I can do DDL, but it has to be the last statement

It is not necessary to use CREATE the first time, therefore use REPLACE instead

Stored procedures

CREATE PROCEDURE proc_name (pr 1 data_type, p2 data_type ) 
BEGIN 
   <SQL or SPL statements>
  SELECT ... WHERE col = :p1  ;
END;
CALL proc_name;
CALL proc_name(1,2);

REPLACE PROCEDURE  <scm>.<pnm>
   (
    IN var_in   VARCHAR(30)
   ,OUT var_out VARCHAR(250)
   )
SQL SECURITY INVOKER
MAIN: BEGIN
 
  DECLARE   a_var VARCHAR(250); 
  DECLARE   i INTEGER;
  SET a_var  = ' ';
  SET i  = 1;
 
 
 
DECLARE cr CURSOR FOR
         SELECT         ...
         FROM   ...
                ;
 
for_loop: FOR
 tab_cr AS cr
    CURSOR FOR
         SELECT         ...
         FROM   ...   -- repeat what is above
                ;
  DO 
     ...
  END FOR for_loop;
 
  IF ( SQLSTATE <> '00000') THEN
     SET var_out = 'A value';
     LEAVE MAIN;
  END IF;
 
  WHILE cond
  DO
     ....
  END WHILE;
 
  SELECT COUNT(*)
  INTO   i
  FROM   a_table;

  -- Transaction in Teradata session.  Otherwise, in ANSI, use COMMIT; and ROLLBACK;
  BT;
  update or insert
  ET;
 
 
  CLOSE cr;
 
  SET var_out = 'A value';
   
END MAIN;
 
 
-- This may not be needed because you can execute "in-line"
CALL DBC.SYSEXECSQL('COLLECT STATS ON '||TRIM(a_db)||'.'||TRIM(a_table)||';');
 
CREATE MULTISET VOLATILE VT_a_table AS
(
SELECT * FROM ...
)
WITH DATA PRIMARY INDEX ( the_id )
ON COMMIT PRESERVE ROWS;


DROP TABLE VT_a_table;   -- best practice to explicitely drop

Functions

select otranslate('abcd012zZaA09', 'abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'aaaaaaaaaaaaaaaaaaaaaaaaaa9999999999AAAAAAAAAAAAAAAAAAAAAAAAAA')

OLAP functions

<aggregate function> OVER  
([PARTITION BY] 
[ORDER BY columnname]
[ROWS BETWEEN UNBOUDED PRECEDING AND UNBOUNDED FOLLOWING)

<aggregate function> can be sum, count, max, min, avg

Select

NOT =
LIKE, NOT LIKE, LIKE ANY ('...', '...'), LIKE SOME ('...', '...'), LIKE ALL ('...', '...')
LIKE 'ABC\_DEF' ESCAPE '\'
"%" for 0 or many, "_" for one character
CASESPECIFIC or NOT CASESPECIFIC: WHERE col (CASESPECIFIC) LIKE 'S%'

SELECT TOP n ...
TOP n PERCENT: takes n% rows
TOP n WITH TIES

SAMPLE n: takes n rows
SAMPLE 0.4: takes 40% of rows
To get a fully random sample:
SAMPLE RANDOMIZED ALLOCATION 0.1

CROSS JOIN: no ON clause. "Cross join" is a cartesian join

RANK() OVER(PARTITION BY ... ORDER BY ... DESC)

QUALIFY: a filter on ordered window (analytical) function:
Seems to be the equivalent of "WHERE" or "HAVING" for analytical functions
Example: see the top for each partition:
RANK() OVER(PARTITION BY ... ORDER BY ... DESC) col_alias
QUALIFY col_alias = 1

ROW_NUMBER() does not have ties, RANK() has ties

TYPE(col) returns data type of column

CONCAT(col, col) or ||
SUBSTRING(S, start_num, length)
SUBSTR: idem
SUBSTRING(str FROM n FOR m) old syntax??
TRIM, RTRIM, LTRIM
UPPER, LOWER
INDEX(str, substr): returns starting pos of substr in str. 0 if not found\POSITION: same as INDEX
LENGTH
OTRANSLATE(str, 'abc', 'xyz') a–>x, b–>y, c–>z
OTRANSLATE(str, 'abc', '') remove
OREPLACE(str, o, n) replace every occurrence of o with n
CAST(x AS INTEGER) verify with TYPE(CAST(... as type))
CAST(str AS DATE FORMAT 'YYYY-MM-DD') B=blank
also: TO_CHAR, TO_NUMBER, TO_DATE

CASE WHEN condition THEN val
     ELSE val
END

CASE value WHEN value THEN val
           ELSE val
END

COALESCE(val, val) first non null value in list

NULLIF(expr1, expr2)

Dates:
CAST(str AS DATE FORMAT 'YYYY-MM-DD') B=blank
TO_DATE(str, 'YYYY-MM-DD')
CURRENT_DATE + INTERVAL '3' YEAR
CURRENT_DATE + INTERVAL '3-01' YEAR TO MONTH or CURRENT_DATE + INTERVAL '03' YEAR + INTERVAL '1' MONTH
CURRENT_TIMESTAMP + INTERVAL '01 05:10' DAY TO MINUTE This format can be confusing: it is better to do INTERVAL ... + INTERVAL
WHERE dt_fld BETWEEN (CURRENT_DATE - INTERVAL '4' DAY) AND CURRENT_DATE last 4 days
EXTRACT(YEAR FROM CURRENT_DATE) – month, day
EXTRACT(HOUR FROM CURRENT_TIMESTAMP) – minute, second
interval: see doc

Time: 6 bytes, stored as three fields, with 4 bytes for seconds with up to 6 digits after decimal
CAST (col AS TIME WITH TIME ZONE)
SET TIME ZONE INTERVAL -'4' HOUR
Get just the time:
SUBSTRING(TO_CHAR(CAST(EVENTTIME AS TIME)),1,8)
Format: HH, MI, B (blank or space), T=AM/PM, SS, SSDS(one decimal point), Z=timezone
FORMAT 'HHhMImSSs' –> 14h32h24s

Timestamp: 4 bytes for date, 6 for time

Python and Teradataml

https://docs.teradata.com/r/Teradata-Package-for-Python-User-Guide/November-2021/DataFrames-for-Tables-and-Views/DataFrame-Manipulation/Filtering-Rows-and-Columns/filter-Method

https://docs.teradata.com/r/Getting-Started-Guide-for-Data-Scientists-Using-Python-with-Vantage/December-2020
https://docs.teradata.com/r/Teradata-Package-for-Python-Function-Reference/November-2021
https://docs.teradata.com/r/Teradata-Package-for-Python-User-Guide/November-2021

Basic Start

import teradataml as t
# or: import * from teradataml 
eng = t.create_context(host='192.168.1.223', username='dbc', password='dbc', database="hr")
t.db_list_tables("schema name", "%ou%", "[all|table|view|volatile|temp]")  # all parameters are optional

t.db_list_tables("schema_name", "%part_of_table%")

Connections

cn=t.get_connection()
for row in cn.execute("select * from dbc.databases where databasename like '%H%';"):
    print(row)

def e(s):
    for r in t.get_connection().execute(s):
        print(r)
e("select trim(databasename), trim(tablename) from dbc.tables where databasename like '%H%' order by 1,2")
# this is very flexible, and is not as picky as creating a dataframe

DataFrames

df = t.DataFrame.from_table(t.in_schema("dbc", "databases"))

df = t.DataFrame.from_query("select sn, city, upper(state) as st from hr.ville_temperature;")
# This creates a temporary table (derived?) and requires the appropriate permissions



df.info()
df.describe()
df.dtypes
df.columns
df.index  # null if no primary index on the table
df.shape[0]  # number of rows
df.sample(3)
df.count(distinct=True)

df[df.TableKind=="T"].select(["TableKind","tbname","dbname"]).groupby(["TableKind","dbname"]).count().sort(["TableKind","dbname"])

df[(df.col==4) & ~(df.col5==5 | df.col6==4)].select(["col1", "col2"]).groupby(["col1"]).max()

DataFrame Methods

assign():   adds new columns.  df.assign(new_col = ...., new_col2 = ...).  Add "drop_columns=True" to remove original columns
concat():   concatenate two DataFrams, like a union operation
describe():     generation statistics (profiling)
drop():     drop columns: df.drop(["col1", "col2], axis=1); drop rows: df.drop(df.col==xyz, axis=0), df.drop([index_val1, index_val2], axis=0)
dropna():   looks similar to dropna in spark
filter():   get a subset of rows, or columns
get():      get columns from DataFrame
get_values():   get values in the form of a numpy array
groupby():  .groupby("col") or .groupby(["col1", "col2"]).  Note:   returns all columns
head():     first n rows
join():     with how taking left, inner, right, full, or cross.  df1.join(other = df2, on = [df1.col1==df2.col1b,"col2=col3"], how = "left", lsuffix = "t1", rsuffix = "t2")
map_row():  apply a function to the rows
map_partition():    apply a function to the rows
merge():    with how taking inner, left, right, or full (not cross).  df1.merge(right = df2, on = ["col1"=="col1b", "col2=col3"], how = "left", use_index = True, lsuffix = "t1", rsuffix = "t2"). Alternate for on:    on = [df1.col1==df2.col1b, df1.col3!=df2.col4]
sample():   .sample(n=10), .sample(frac=0.2), .sample(randomize = True)
select():   select("col"), .select(["col1", "col2"]), .select([["col1", "col2"]]) 
set_index():    .set_index("col"), .set_index(["col1", "col2"]).  Add a column to an index: .set_index("col3", append = True, drop = True)
show_query():   show the underlying query.  Without parameter, although df.show_query(full_query = True) may be necessary
sort():     .sort("col"), .sort(["col1", "col2"], ascending=True)
sort_index():   sort
squeeze():  only for one-dimensional objects
tail():     last n rows

df.info()
df.keys(), same as .columns
df.columns: list of column names
df.dtypes: column names and types
df.index: usually the primary index of the underying table or view
df.shape
df.size: number of elements=rows*columns.  Get the number of rows with .shape[0]
df.tdtypes: the column names and the data types of the underlying table

df.to_sql('the_name', primary_index='col', set_table=True, if_exists="replace")

DataFrame.from_table(t.in_schema("dbc", "databases"))

Other Misc Notes

Vantage is the new packaging
licensing : developer (no fees), base, advanced, enterprise
Interfaces: TD studio (also has an admin pane), TD studio express, TD SQL Asst (old), BTEQ (download under TD tools and utilities)

joins: left outer join, right outer join, full outer join
cross join: every row on left with every row on right

COLLECT [SUMMARY] STATISTICS INDEX (indexname) COLUMN (columnname) ON <tablename>;

COLLECT STATS ON VT_the_name_of_the_volitile_table PRIMARY INDEX(... key columns ...);
Three ways to collect statistics:

random AMP sampling
full collection
sample option

HELP STATISTICS <table name>; Show statistics

EXPLAIN select * from ... where .. gives the explain plan

JOIN INDEX = materialized view, for improving performance

Single table join index: use to re-partition a table
Multi-table join index: use when tables are often queried together
Aggregated join index: when the same aggregates are often performed. Only SUM and COUNT!

virtual box:
to make db available to host, go to the settings of the vm, and select "bridge adaptor" then check the ip address with ifconfig on the vm

Additional topics

Additional topics that need further reading:

join strategies: merge, nested, product
partitioned primary index
transient journal (=redo log?)
fallback, and down AMP recovery journal
cliques
not standby node
RAID
FastLoad: load from file into empty tables, GUI or batch
MultiLoad: IMPORT or DELETE or update
FastExport: write to file
TPump: streaming
TPT (TD parallel transporter). Combines all others.
Viewpoint: db admin tool

git

echo "# iom" >> README.md
git init
git add README.md
git commit -m "first commit"
git branch -M main
git remote add origin https://github.com/yekesys/iom.git
git push -u origin main

create token:

settings > emails > click on "verify email"
settings > developer settings > personal access tokens
Generate new token
select "repo" for access to command line

Unstage:
git restore --staged .

Teradata Notes

Contents

Admin

Documentation

Teradata Studio UI

SQL Editor

Databases

Connections

Tables

Table Types

Columns

Indexes

Macros

Stored procedures

Functions

OLAP functions

Select

Python and Teradataml

Basic Start

Connections

DataFrames

DataFrame Methods

Other Misc Notes

Additional topics

git