Archive for the ‘Problem Solving’ Category
I was looking at the contents of a tablespace recently – just the extents, not a block dump – with the intention of shrinking a couple of the associated datafiles (don’t ask why – the reason is both stupid and irrelevant). I needed to know just how much I could shrink immediately, and if there were any quick wins in terms of moving objects from near the end of files so even more space could be reclaimed. I was a little surprised to discover that something which I have been doing for over 20 years was not as common knowledge as I thought it was, so I though I would try to let some more people know how to do this.
So, what’s in a tablespace? It’s made up of a number of datafiles (whether in ASM, RAW or on a journalled file-system is irrelevant for this conversation). Within the datafiles are extents. These are logical groupings of blocks – a multiple of the block size – which show where your objects are stored within the datafile.
I wrote this piece of SQL to show where the extents live in 1992. It still works today, regardless of whether your tablespaces are dictionary or locally managed:
SELECT tablespace_name, file_id, owner, segment_name, block_id begin, blocks, block_id+blocks-1 end, bytes/1024 KB, '' free FROM sys.dba_extents where tablespace_name = 'USERS' UNION SELECT tablespace_name, file_id, '' owner, '' segment_name, block_id begin, blocks, block_id+blocks-1 end, bytes/1024 KB, 'FREE' free FROM sys.dba_free_space where tablespace_name = 'USERS' ORDER BY 1,2,5 TSPC ID OWNER SEGMENT_NAME BEGIN BLOCKS END KB FREE USERS 4 SCOTT DEPT 128 8 135 64 USERS 4 SCOTT PK_DEPT 136 8 143 64 USERS 4 SCOTT EMP 144 8 151 64 USERS 4 SCOTT PK_EMP 152 8 159 64 USERS 4 SCOTT SALGRADE 160 8 167 64 USERS 4 168 128 295 1024 FREE USERS 4 SYS TABLE_1 296 8 303 64 USERS 4 SYS TABLE_1 304 8 311 64 USERS 4 SYS TABLE_1 312 8 319 64 USERS 4 SYS TABLE_1 320 8 327 64 USERS 4 SYS TABLE_1 328 8 335 64 USERS 4 SYS IND_1 336 8 343 64 USERS 4 344 40 383 320 FREE USERS 4 384 128 511 1024 FREE USERS 4 512 128 639 1024 FREE USERS 6 SYS TABLE_2 128 128 255 1024 USERS 6 256 12544 12799 100352 FREE USERS 6 SYS IND_2 12800 256 13055 2048
So, as we can see from the output above, the tablespace USERS has 2 datafiles; “4″ and “6″ (you can identify them fully using view DBA_DATA_FILES). We can shrink file 4 immediately by 1024K+1024K+320K = a bit over 2M.
File 6 is a bit more of a problem. There’s an index IND_2 in the way, unhelpfully located at the end of the datafile, which stops us from shrinking the datafile. If you try to resize the file using ALTER DATABASE DATAFIL E ‘+DATA/orcl/…..’ 10M, it will fail with the error ORA-03297: file contains used data beyond requested RESIZE value. In this case, the remedy may be as simple as performing an ALTER INDEX IND_2 REBUILD TABLESPACE USERS ONLINE; This will (probably) relocate the index somewhere lower in the file and allow us to shrink the datafile. Moving tables with an ALTER TABLE … MOVE command may prove more difficult depending upon your throughput due to the extra locks needed at the start and end of the transaction, and extra care needs to be take with any LOB objects involved.
Yesterday I had a problem… I noticed that one of our overnight jobs was failing. The job in question is an ad-hoc tidy-up job, ran as needed when a performance metric for a specific query indicates that we’re going to be in trouble soon.
So, what was the problem? I have the stack trace captured in the job output:
Finished shrinks at 02-APR-2012 00:12 Creating job data pump export job LOG_EXPORT User:....... NEW_SCHEMA Tables:..... 'table-1','table-2' DECLARE * ERROR at line 1: ORA-31626: job does not exist ORA-06512: at "SYS.DBMS_SYS_ERROR", line 79 ORA-06512: at "SYS.DBMS_DATAPUMP", line 911 ORA-06512: at "SYS.DBMS_DATAPUMP", line 4356 ORA-06512: at line 22
So, “job does not exist”! What does that mean, and what could be causing this to happen? What has changed? Let’s look at the code at line 22 and see what we’re running:
. . v_handle := DBMS_DATAPUMP.OPEN ( operation => 'EXPORT', job_mode => 'SCHEMA', job_name => 'LOG_EXPORT'); . .
And that’s it. We’re initiating a DBMS_DATAPUMP job (to extract the data to an external file before manipulation.)
So, what has changed? Since the last invokation of this job, we have had a new software release, and for this product that means a whole new schema. We’ve missed something. We must have a permissions problem within the database. Fortunately, we keep the old schema in place, locked, following an upgrade so all we need to do is compare the permissions for the old and new schemas and we’re done.
select granted_role from dba_role_privs where grantee = 'OLD_SCHEMA' minus select granted_role from dba_role_privs where grantee = 'NEW_SCHEMA' / no rows selected select privilege from dba_sys_privs where grantee = 'OLD_SCHEMA' minus select privilege from dba_sys_privs where grantee = 'NEW_SCHEMA' / no rows selected select table_name,privilege from dba_tab_privs where grantee = 'OLD_SCHEMA' minus select table_name,privilege from dba_tab_privs where grantee = 'NEW_SCHEMA' / TABLE_NAME PRIVILEGE ------------------------------ ---------------------------------------- DATA_PUMP_DIR READ DATA_PUMP_DIR WRITE
Ah ha! There’s the difference! And it is definitely data pump related, so let’s run the grants into the database and run the job again. This will fix it.
grant READ on DIRECTORY data_pump_dir to NEW_SCHEMA; grant WRITE on DIRECTORY data_pump_dir to NEW_SCHEMA; ... and run the job again... DECLARE * ERROR at line 1: ORA-31626: job does not exist ORA-06512: at "SYS.DBMS_SYS_ERROR", line 79 ORA-06512: at "SYS.DBMS_DATAPUMP", line 911 ORA-06512: at "SYS.DBMS_DATAPUMP", line 4356 ORA-06512: at line 22
What! Oh nuts. Why didn’t that work? [because you have not assigned a directory to DBMS_DATAPUMP yet! It'll be useful later though... Now stop jumping to conclusions! ed.] What are we going to do now? We know the permissions match. I’ve checked the O/S and there’s space in the filesystem. DBMS_DATAPUMP is not giving me enough information; what can I do to diagnose this problem further? I know! Why don’t I use datapump externally and see if that makes a difference…
expdp NEW_SCHEMA/..pwd..@..DB.. directory=data_pump_dir dumpfile=neilchandler.dmp job_name=neilchandler Export: Release 10.2.0.3.0 - 64bit Production on Tuesday, 03 April, 2012 12:54:27 Copyright (c) 2003, 2005, Oracle. All rights reserved. Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production With the Partitioning, Oracle Label Security, OLAP and Data Mining Scoring Engine options ORA-31626: job does not exist ORA-31633: unable to create master table "NEW_SCHEMA.NEILCHANDLER" ORA-06512: at "SYS.DBMS_SYS_ERROR", line 95 ORA-06512: at "SYS.KUPV$FT", line 863 ORA-01536: space quota exceeded for tablespace 'USERS'
Oh heck, we can’t write the parameter table as we have been denied a space quota on the default tablespace. D’oh! How painful was that to locate?
alter user NEW_SCHEMA quota unlimited on USERS;
And it all worked.
My question is, why couldn’t DBMS_DATAPUMP tell me that?
A while ago, I failed over a database (as planned) to it’s Dataguard copy, and of course everything worked as expected. Everything, that is, except a couple of reports which get sent directly from the database server early in the morning. The report generation had failed. After some investigation, we discovered that the newly active Dataguard server did not have NLS_DATE_FORMAT set in the environment, and the 2 reports in question were coded something like:
select col1, col2, col3 from user_data where user_date >= '25-Nov-2011 00:00:00';The select was failing with the error. ORA-01821: date format not recognized
If the developer had coded an explicit conversion, then we would not have experienced a problem.
select col1, col2, col3 from user_data where user_date >= to_date('25-Nov-2011 00:00:00','DD-MON-YYYY HH24:MI:SS');
Coding with an implicit date mask is great and works successfully every time, as long as the NLS_DATE_FORMAT in your session matches the date mask you have supplied, which is course it always does. Until something changes and it doesn’t. In my experience, implicit conversion is probably the single greatest source of failure in systems and also one of the hardest to track down. It frequently occurs in a badly designed schema which doesn’t use the correct datatypes. I have seen schemas where everything is being stored a VARCHAR2, including numeric fields. This works fine as Oracle will happily insert implicit to_number functions into your code and return answers in ways which seem correct, until you get some rogue data into the database and everything falls apart.
USER1 @ orcl > -- Create a table but allow generic data, rather than specifying numeric data USER1 @ orcl > -- The client will take care of validation. Of course it will. USER1 @ orcl > create table implicit_problem (col1 varchar2(10), col2 varchar2(10)); Table created. USER1 @ orcl > USER1 @ orcl > -- Lets fill the table with reasonable data USER1 @ orcl > insert into implicit_problem values (1,1); 1 row created. USER1 @ orcl > insert into implicit_problem values (2,10); 1 row created. USER1 @ orcl > insert into implicit_problem values (3,66); 1 row created. USER1 @ orcl > USER1 @ orcl > -- Oracle is putting an implicit to_number around col1*col2 to allow the calculation USER1 @ orcl > select col1,col2,col1*col2 from implicit_problem; COL1 COL2 COL1*COL2 ---------- ---------- ---------- 1 1 1 2 10 20 3 66 198 USER1 @ orcl > USER1 @ orcl > -- And now lets have some incorrectly validated data USER1 @ orcl > insert into implicit_problem values (4,'A'); 1 row created. USER1 @ orcl > USER1 @ orcl > USER1 @ orcl > -- And now the implicit conversion fails USER1 @ orcl > select col1,col2,col1*col2 from implicit_problem; ERROR: ORA-01722: invalid number no rows selected USER1 @ orcl > USER1 @ orcl > -- Cleanup USER1 @ orcl > drop table implicit_problem; Table dropped.
It’s much easier (and quicker) to catch bad data going into a system than it is to perform problem resolution. Always code explicitly for your data types. Implicit conversion in yuor coding invariably leads to hard-to-find bugs.
The other night I was sat in the pub with some like-minded individuals discussing the relative merits of the 10046 trace (we Rock! in the pub, dudes!) and somebody asked me how often I has actually used it in anger? A well-respected DBA / Architect maintained it was a pretty useless and difficult option to use, given the topology of modern applications (e.g. How do you find the correct session with all that connection pooling going on from multiple web servers.)
My answer surprised me – I thought back to one client where I spent 90% of my time performance tuning a large (TiB-ish) OLTP/Batch hybrid system and concluded that I had ran a 10046 against production about once a year. Once. So if the 10046 is the holy grail of plan information, why wasn’t I using it that much. And why did I never use a 10053 against Production there?
The answer for me is a little more complex than that given in the pub:
1. as stated above, it’s hard to catch the in-flight session unless the application is instrumented to inject the trace statement when needed (and how many applications are instrumented to help you discover problems? Screen ST03 in SAP is very helpful. Any others in major ERP’s? Thought not.)
2. In many places that I have worked, getting authorisation to make any a change to a 24×7 mission-critical system is highly bureaucratic, involving cast-iron justification for the change and it’s positive benefits, requirement that there will be no adverse effects because of the change, very senior sign-off, more red-tape, etc. This causes a significant amount of work simply to put a trace on, even if you can catch the SQL. This can end up being more work than actually fixing the problem.
3. An awful lot of SQL tuning is a fairly blunt affair, as the developer (who is frequently database-blind) has usually missed something obvious. It is frequently to do with incorrectly using or not using an index (or using a poor index), or lack of filtering data at the right point to minimise the I/O.
4. Most importantly, if you have AWR and ASH, it’s not really needed. For each plan created by the optimizer the database stores the bind variables along with it, so we can usually understand why the optimizer makes the decisions it makes. ASH contains the main event waits. Why bother trying to capture all of the detail in a trace when you really don’t need that much detail, and it’s all already there; ready to be extracted from the relevant tables (e.g. dba_hist_active_sess_history, dba_hist_sql_plan and dba_hist_sql_bind.)
I have never used a 10053 trace on a Production system. I have simply never needed to know the decisions taken by the optimizer in that much detail. Like most DBA’s and Oracle consultants, I don’t go from site-to-site on a weekly basis resolving edge-case problems that the incumbent DBA’s haven’t had the time, or possibly don’t have the skills, to resolve themselves. I usually don’t need that level of confirmation that I’m right about why the plan is wrong, and I don’t have the time to conclusively prove it over and over again – I just need to get the fix into place and move onto the next problem.
That said, perhaps I should get fully to the bottom of these problems to ensure that they never occur again – which is the fundamental problem with Adaptive Cursor Sharing.
I just thought I would try to get restarted on this blog by telling a quick story about my broken Kindle. I’ll crowbar a bit about Oracle in at the end…
Last night I left my friends in the pub at 9pm and got onto the Tube for my 40 minute journey home to North London. I got out my (2 month old) Kindle and switched it on to continue with the book I’m reading… and the screen only half-restored. The top-half was static and unchanging from the screen saver. Obviously broken, but how? There has been no impact, no twisting or anything. Apparently (checking the internet), this just happens to Kindles sometimes. Manufacturing fault or some such.
Now, I bought this Kindle at PC World (OK, I know, but I did – same price and no couriers involved) so this morning I packaged it up in the original packaging (but without receipt) and drove down to PC World ready for a good argument with “The Tech Guys” about getting a replacement. It all started so promisingly:
Me: “My kindle is broken. The screen is broken.”
TechGuy: “Did you buy extended cover”
TechGuy: “Well it’s not replaceable as that would be classed as accidental damage.”
Me: “It hasn’t been in an accident. It’s not damaged, it’s broken”
TechGuy: “OK, you’ll have to call Amazon about it”
Me – getting all warmed up, smiling: “No. I bought it here**”
TechGuy (immediately): “OK, I’ll get a manager”
What? No argument? Eh? Anyway, the manager comes over, checks the kindle for damage (there is none), asks for receipt but I don’t have it. However, I have proof of purchase (credit card statement) so he checks the computer, goes into the back shop and hands me a new 3G Kindle. Thank you PC World. Unexpectedly good service (despite the initial fob-off attempt.)
Right now, I’m re-loading all of the Oracle documentation back onto it. Oracle very thoughtfully provide all of their doco in .mobi format. Given that I have been to sites which have no internet link available and the servers are in very closed DMZ’s, and you’re not allowed to use USB sticks or DVD’s (the ports are blocked), having all of the Oracle doco in your pocket on a searchable Kindle can be pretty handy.
For the record, I think the Kindle is an truely excellent device. I’m hooked – it saves me carrying around books, manuals, PDF’s, white paper and (at a pinch) it can browse the web in most countries in the world for free over 3G. Not the best browser in the world, but great to check your emails and the news on a beach in Spain, the USA or New Zealand, whilst listening to the (limited but adequate) MP3 player.
**UK Law dictates that your contract is with the seller (i.e. the shop) and not the manufacturer. Don’t be fobbed off!
Well, not mine, but a recent post on the oracle-l mailing list which I thought was worth linking to and repeating. Firstly, because it’s interesting, and secondly, because it shows some good problem solving skills by both the poster and the wider Oracle community.
The poster in question was experiencing a problem whereby his alert log was not updating…
We have an alert.log that was last updated by the database on May 6th. Strangely enough, the log.xml in the alert directory of the diag destination is being updated normally, it is just the plain text alert.log in the trace directory that is not updated. We have bounced the database, changed the diag_destination parameter and I have even grepped all the file descriptors in /proc/*/fd for traces of a possibly opened alert.log - nothing, the alert.log is still not being updated. I tried dbms_system.ksdwrt to force a write to the alert.log - again, the log.xml is updated, the plain text is not.